Random Globus Notes
From BeSTGRID
[edit] Solutions to Common Problems
- Globus commands cannot be found i.e. "-bash: globusrun-ws: command not found"
- VDT environment has to be set. Execute . /opt/vdt/setup.sh (this should be added to profile file)
- grid-proxy-init fails with "ERROR: Couldn't find valid credentials to generate a proxy."
- try running with -debug option. It should give more detailed error message
- Job submission fails with "Delegating user credentials...Failed."
- update your proxy by running grid-proxy-init
#Example error messsage
Delegating user credentials...Failed.
globus_gsi_credential.c:globus_gsi_cred_read:317:
Error with credential: The proxy credential: /tmp/x509up_u506
with subject: /C=NZ/O=BeSTGRID/OU=The University of Auckland/CN=Yuriy Halytskyy/CN=1138554618
expired 750 minutes ago.
- Job submission hangs for a while and then reports "Current job state: Unsubmitted":
- Telltail has to be run on the cluster, and logmaker - on gateway.
# complete error message Job ID: uuid:e141f68e-2b66-11dd-b240-00163e500t1f2 Termination time: 05/27/2008 21:01 GMT Current job state: Unsubmitted Canceling...Canceled.
- Job submission is too slow (delays between various stages ) but executes succesfully.
- try setting $GLOBUS_HOSTNAME on the client to its FQDN. That name has to show up via reverse DNS lookup.
- For more details see http://www.globus.org/mail_archive/gt-user/2008/03/msg00032.html
- jobs with fileStageIn and fileStageOut fail.
- To use fileStageIn and fileStageOut on client machine, the client must have ftp or gftp server. It is better to call url-copy separately.
- More information Staging error
when reporting problem it is better to include output of globusws-run with -debug option.
[edit] How to Arrange Job Communication
When a job with multiple processes is submitted to PBS directly, it starts a script on a single node with $PBS_NODEFILE file with names of hosts for each process. It is a task of that script to distribute workload.
On the other hand when a job is submitted via globus with jobType multiple, globus will re-distribute processes without user intervention and $PBS_NODEFILE is insufficient to determine the identity of the process when there are multiple processes on a single host. Message passing middleware like MPI solves this problem by managing communication and providing a way to discover "rank" and total number of processes. OpenMPI implementation we are using has C and Fortran bindings, and there libraries in higher level languages built on the top of them but they have much greater communication overhead.
For example we want to process 100 xml files named 1.xml, 2.xml , etc. with beast. Running them through the multiple jobType will produce 100 beast instances not aware of each other, and with no way to pick up right file. One way to solve this problem is some server process that will distribute jobs and will give filename to every process. Then beast call will be wrapped in the shell script that requests a file name. Another solution is to use MPI wrapper:
int main(int argc, char **argv){
int numtasks, rank, rc;
MPI_Status Stat;
rc = MPI_Init(&argc, &argv);
if (rc != MPI_SUCCESS){
printf("error when starting MPI program. Terminating\n");
MPI_Abort(MPI_COMM_WORLD,rc);
}
MPI_Comm_size(MPI_COMM_WORLD, &numtasks);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
execlp("/usr/local/bin/beast","beast",argv[rank+1],(char*)0);
}
And then start it as
mpirun -np beastMPI *.xml
(see "Submitting MPI jobs" earlier on how to submit it through globus)
[edit] Limits on Number of Processes
Errors start to occur if the number of processes is larger than approximately twice the number of cores requested, for non-mpi jobs. For mpi jobs the limit is larger and currently uncertain.
More details: Limits on Number of Processes with Torque
