Random Globus Notes

From BeSTGRID

Jump to: navigation, search

[edit] Solutions to Common Problems

  • Globus commands cannot be found i.e. "-bash: globusrun-ws: command not found"
    • VDT environment has to be set. Execute . /opt/vdt/setup.sh (this should be added to profile file)
  • grid-proxy-init fails with "ERROR: Couldn't find valid credentials to generate a proxy."
    • try running with -debug option. It should give more detailed error message
  • Job submission fails with "Delegating user credentials...Failed."
    • update your proxy by running grid-proxy-init
#Example error messsage
Delegating user credentials...Failed.
globus_gsi_credential.c:globus_gsi_cred_read:317:
Error with credential: The proxy credential: /tmp/x509up_u506
     with subject: /C=NZ/O=BeSTGRID/OU=The University of Auckland/CN=Yuriy Halytskyy/CN=1138554618
     expired 750 minutes ago.
  • Job submission hangs for a while and then reports "Current job state: Unsubmitted":
    • Telltail has to be run on the cluster, and logmaker - on gateway.
# complete error message
Job ID: uuid:e141f68e-2b66-11dd-b240-00163e500t1f2
Termination time: 05/27/2008 21:01 GMT
Current job state: Unsubmitted
Canceling...Canceled.
  • Job submission is too slow (delays between various stages ) but executes succesfully.
  • jobs with fileStageIn and fileStageOut fail.
    • To use fileStageIn and fileStageOut on client machine, the client must have ftp or gftp server. It is better to call url-copy separately.
    • More information Staging error

when reporting problem it is better to include output of globusws-run with -debug option.

[edit] How to Arrange Job Communication

When a job with multiple processes is submitted to PBS directly, it starts a script on a single node with $PBS_NODEFILE file with names of hosts for each process. It is a task of that script to distribute workload.

On the other hand when a job is submitted via globus with jobType multiple, globus will re-distribute processes without user intervention and $PBS_NODEFILE is insufficient to determine the identity of the process when there are multiple processes on a single host. Message passing middleware like MPI solves this problem by managing communication and providing a way to discover "rank" and total number of processes. OpenMPI implementation we are using has C and Fortran bindings, and there libraries in higher level languages built on the top of them but they have much greater communication overhead.

For example we want to process 100 xml files named 1.xml, 2.xml , etc. with beast. Running them through the multiple jobType will produce 100 beast instances not aware of each other, and with no way to pick up right file. One way to solve this problem is some server process that will distribute jobs and will give filename to every process. Then beast call will be wrapped in the shell script that requests a file name. Another solution is to use MPI wrapper:

int main(int argc, char **argv){
 int numtasks, rank, rc;
 MPI_Status Stat;
 rc = MPI_Init(&argc, &argv);
 if (rc != MPI_SUCCESS){
   printf("error when starting MPI program. Terminating\n");
   MPI_Abort(MPI_COMM_WORLD,rc);
 }
 MPI_Comm_size(MPI_COMM_WORLD, &numtasks);
 MPI_Comm_rank(MPI_COMM_WORLD, &rank);
 execlp("/usr/local/bin/beast","beast",argv[rank+1],(char*)0);
}

And then start it as

mpirun -np beastMPI *.xml

(see "Submitting MPI jobs" earlier on how to submit it through globus)

[edit] Limits on Number of Processes

Errors start to occur if the number of processes is larger than approximately twice the number of cores requested, for non-mpi jobs. For mpi jobs the limit is larger and currently uncertain.

More details: Limits on Number of Processes with Torque