kali's webpage at http://www.math.umbc.edu/kali.
kali; I will collectively refer to them as "the scheduler" in the following. It also contains some suggestions how to supervise your runs (and kill them if necessary) and how to monitor performance of your code.
If you find mistakes on this page or have suggestions, please contact me.
All scheduler commands have man pages; additionally, the man page man pbs_resources has particularly useful information. Also look under the "See Also" heading at the bottom of all man pages for cross-references to other pages.
#!/bin/bash : : The following is a template for job submission for the : Scheduler on kali.math.umbc.edu : : This defines the name of your job #PBS -N MPI_Aout : This is the path #PBS -o . #PBS -e . #PBS -q workq #PBS -l nodes=4:myrinet:ppn=2 cd $PBS_O_WORKDIR mpiexec -nostdout a.out
at the Linux prompt.
When the scheduler starts executing this script, its working
directory is your home directory. But the environment
PBS_O_WORKDIR holds on to the
directory, in which you started your job, which is
typically not your home directory. To get back to this
directory, the script first of all executes the line
$PBS_O_WORKDIR. From now on, you are again in the
directory, where this file
located and where you issued the qsub
qsub-script command. Hence, we can access the
executable in that directory by
This directory change is crucial in particular, if your
code reads in an input file and/or creates output files.
cd command, your executable
will not be found; also, input files
cannot be accessed, and output files will all be put in
your home directory.
-q workq specifies which queue to submit your
job to. The queue
workq is the only one set up
You choose the name of your job with the option
this name will appear in the queue that you can see by qstat.
Choose a meaningful name for your own code here.
-e tell the scheduler
in which directory to place the stdout and stderr files, respectively.
These files have the form jobnumber.kali.cl.ER and jobnumber.kali.cl.OU,
respectively, at present since the jobnumber is a three-digit number;
it it becomes a four-digit number, we will likely lose the letter "l"
and get jobnumber.kali.c.ER and jobnumber.kali.c.OU.
These files are created and accumulated in some temporary place
and only moved to your directory after completion of the job.
See below for a remark on this important issue.
In this example, you want to run on 8 processors as indicated by the crucial line
#PBS -l nodes=4:myrinet:ppn=2
ppn=2), all connected by Myrinet. Your job will execute on the 4 nodes returned by the scheduler. Note that the run-line beginning with
mpiexecdoes not specify the number of processes, so your job will run on all the processors returned by the scheduler (8 in this case).
The line starting with
starts the job. The
-nostdout flag indicates
that you do not want output sent to the
stdout stream, but rather want it redirected,
as explained above.
The above example showed how to pipe stderr and stdout into
two separate files. One can also 'join' them together.
To accomplish this, replace the
-e . by
-j oe. (To make this clear, in case it is hard to
read in the previous sentence, you replace
and the period
See the man page for qsub.
If your executable is not in the
current directory, you would simply replace the
a.out in the
mpiexec line of
the qsub-script above by the full path for the
executable. For instance, if you have an executable
DG-mpigm-O in directory
mpirun line could read
mpiexec -nostdout \ $HOME/develop/Applications/DG/bin/x86_linux-icc/DG-mpigm-Owhich uses the predefined environment variable
HOMEto make the script a little more general. Notice that I used the backslash "\" to continue the line to keep the line length below 80 columns to make the script file more readable. This is particularly useful if you have a long list of command-line arguments.
Advanced options such as using only one
processor per node (even on a multiprocessor node)
overriding the communication libraries built
into the executable, or
config file (to specify
exactly which nodes to use) are available.
Consult the man page for
mpiexec for full details.
A few important facts about
mpiexec must be
mpiexecutility was designed to run within the scheduler environment; it cannot be executed from the Linux prompt.
mpiexec(unless your code explicitly spawns threads of its own).
mpiexecprovides a clean wrap-up of MPI jobs. For instance, when any process associated with a job is killed, all processes associated with that job are automatically terminated. The behavior is similar when a job is deleted using the scheduler command
qdel(usage explained below) is the clean way to terminate MPI jobs.
Two useful caveats: If you use the csh or tcsh shell, the stdout file (the file ending with OU) will contain a couple of lines of error messages, namely
Warning: no access to tty (Bad file descriptor). Thus no job control in this shell.This is due to some conflict between our shell,
mpiexec, etc. You can safely ignore this.
Req'd Req'd Elap Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time --------------- -------- -------- ---------- ------ --- --- ------ ----- - ----- 635.mgtnode gobbert workq MPI_DG 2320 8 1 -- 10000 R 716:0 636.mgtnode gobbert workq MPI_DG 2219 8 1 -- 10000 R 716:1 665.mgtnode gobbert workq MPI_Nodesu -- 16 1 -- 10000 Q -- 704.mgtnode gobbert workq MPI_Nodesu 12090 15 1 -- 10000 E 00:00 705.mgtnode kallen1 workq MPI_Aout -- 1 1 -- 10000 Q -- 706.mgtnode gobbert workq MPI_Nodesu -- 15 1 -- 10000 Q -- 707.mgtnode gobbert workq MPI_Nodesu -- 15 1 -- 10000 Q --
The most interesting column is the one titled S for "status". It shows what your job is doing at this point in time: The letter Q indicates that your job has been queued, that is, it is waiting for resources to become available and will then be executed. The letter R indicates that your job is currently running. Finally, the letter E says that your job is exiting; this will appear during the shut-down phase, after the job has actually finished execution. See man qstat for more information.
Personal suggestions: I feel that qstat -a gives me a little more information. Often times, it is necessary to know which nodes your job is running on. You can see that by qstat -n; this implies the -a option. Finally, if your job is listed as queued (an Q in the S column of qstat), you can find out why it is not running using qstat -f; look for the comment field, which might say something like "not sufficient nodes of requested type available".
There are two other visual tools available on
to monitor performance and activity:
Req'd Req'd Elap Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time --------------- -------- -------- ---------- ------ --- --- ------ ----- - ----- 6307.kali.math. kali-g2 workq MPI_SPARK 8472 16 1 -- 10000 R 133:5 node018/1+node018/0+node027/1+node027/0+node026/1+node026/0+node025/1 +node025/0+node024/1+node024/0+node023/1+node023/0+node022/1+node022/0 +node020/1+node020/0+node019/1+node019/0+node031/1+node031/0+node030/1 +node030/0+node029/1+node029/0+node028/1+node028/0+node005/1+node005/0 +node004/1+node004/0+node003/1+node003/0 6308.kali.math. kali-g2 workq MPI_SPARK 8894 2 1 -- 10000 R 16:44 node007/1+node007/0+node006/1+node006/0 6309.kali.math. kali-g2 workq MPI_SPARK 8126 8 1 -- 10000 R 07:37 node014/1+node014/0+node001/1+node001/0+node013/1+node013/0+node012/1 +node012/0+node011/1+node011/0+node010/1+node010/0+node009/1+node009/0 +node008/1+node008/0 6310.kali.math. kali-g2 workq MPI_SPARK 8304 1 1 -- 10000 R 00:54 node015/1+node015/0
kali-g2is one of the group accounts on
kaliused by myself (Gobbert). Notice that this sample output shows the job-IDs as they appeared in November 2004, truncated to the 15 available characters for output in that column. While the job-IDs look different now, the software still works the same way.
OU, respectively) in a temporary location and moved to your current directory (
$PBS_O_WORKDIR) only after your code has finished running.
Often, it is vital to be able to look at these files while the job is still running, for instance, to determine how far your simulation has progressed or whether your code has encountered a problem. Before the operating system upgrade in March 2005, the temporary location of these files was the user's home directory. This has changed, and the files are accumulated locally on that node on which the Process 0 of your job is running.
We are actually not satisfied with this situation, because it makes it very cumbersome to have a look at these files during a run. We are investigating how to control the location of these files as desired. But in the mean time, I am providing information in the following on how to find these files in the present situation.
First you have to find out on which nodes your job is running; use qstat -n to get information such as
803.kali.cl.mat gobbert workq MPI_Testio 9069 2 1 -- 10000 R 00:00 node31+node31+node30+node30telling you that your job runs on nodes
node31, using both CPUs on each. So, in MPI jargon, there are Processes 0, 1, 2, and 3. The order of the nodes in the list returned by qstat -n tells you that in fact that Processes 0 and 1 are on
node31and Processes 2 and 3 on
node30. (The confirmation of this fact is the point of having the process numbers and their hostnames printed out in the sample codes available from kali's webpage.)
Having determined that Process 0 is on
ssh to that node by saying ssh node31.
Change directory to
by saying cd /var/spool/pbs/spool.
A directory listing (using ls) should show the
ER and OU files here (or only OU if you requested them to be joined).
You can now look at the files using various Linux commands,
for instance, more or less, or
tail or even tail -f;
see their man pages for more information.
qdelcommand using the jobnumber from
qstat, as in
qdel 636, for instance. See man qdel for more information.