Running codes (serial and parallel)

When your job starts to execute, the batch system will execute the script file you submitted on the first node assigned to your job. If your job is to run on multiple cores and/or multiple nodes, it is your script's responsibility to deliver the various tasks to the different cores and/or nodes. How to do this varies with the application, but some common techniques are discussed here.

The scheduler assigns the variable $PBS_NODEFILE which contains the name of a file that lists all of the nodes that you've been assigned. If you are assigned multiple cores on the same node, the name of that node appears multiple times (once per core assigned) in that file.

  1. Running Serial jobs
  2. Running multithreaded jobs on a single node
  3. Running OpenMPI jobs
  4. Running MPICH jobs
  5. Running LAM jobs
  6. Running non-MPI jobs on multiple nodes

Running Serial jobs

Serial or single core jobs are the simplest case. Indeed, the batch system starts processing your job script on a core of the first node assigned to your job, and in the single core case this is the only core/node assigned to your job. So there is really nothing special that you need to do; just enter the command for your job and it should run.

NOTE: Do not confuse single-core jobs and the poorly-named serial queue. Single-core jobs can run on any of the narrow-* queues, or the debug queue (the wide-* and ib queues require multiple nodes), and most users submitting single core jobs probably want to use a narrow-* queue. The serial queue is a low priority, preemptible queue, as discussing in the section about .

Running Multithreaded jobs on a single node

The next simplest case is if your job is running on a single node but is multithreaded. I.e., OpenMP codes that are not also using MPI will typically fall into this category. Again, usually there is nothing special that you need to do.

One exception is if you are not using all the cores on the node for this job. In this case, you might need to tell the code to limit the number of cores being used. This is true for OpenMP codes, and OpenMP will by default try to use all the cores it can find.

Normally, once a node is assigned to you for one of your jobs, no one else can run jobs on that node (will it is still assigned to you). However, if room exists on the node, other jobs of yours might be placed on it. I.e., if you have several jobs requesting 4 cores, two of these jobs might get the same 8-core node. If both of these are OpenMP jobs and you do not limit the number of cores they try to use, both jobs will try to run 8 threads, one per core on the machine, resulting in 16 threads on that machine. This will cause contention between your two jobs, and actually reduce performance.

For OpenMP, you can set the environmental variable OMP_NUM_THREADS in your job script to match the number of cores per node requested by the job, e.g. for our 4 core example, either

setenv OMP_NUM_THREADS 4 
for csh type shells, or
OMP_NUM_THREADS=4 
export OMP_NUM_THREADS 
for bourne type shells.

Running OpenMPI jobs

OpenMPI is the preferred MPI unless your application specifically requires one of the alternate MPI variants. OpenMPI automatically "knows" about the contents of $PBS_NODEFILE and as such you don't need to include it on the command line. OpenMPI is also compiled to support all of the various interconnect hardware, so for nodes with fast transport (InfiniBand/Myrinet), the fastest interface will be selected automatically.

NOTE: All of the nodes in your job must be configured to use the same MPI library and version and language bindings. This is best done by editting your ~/.cshrc.mine (or ~/.bashrc.mine if you are using the bourne shell) to include the appropriate tap -q or module load command. You should ONLY tap/module load a single MPI library in your dot file; if you have multiple tap/module load lines, at best only the last one is effective (and likely none will work properly).

Otherwise, you can simply invoke your MPI enabled application with the mpirun command, e.g.

mpirun -np NUMCORES MY_APPLICATION
where NUMCORES is the number of cores/tasks to use, and MY_APPLICATION.

If you are doing hybrid OpenMP/OpenMPI parallelization, NUMCORES should be the number of MPI tasks you wish to start, each using OMP_NUM_THREADS cores via OpenMP. If you wish to disable OpenMP parallelization, just set OMP_NUM_THREADS to 1.

NOTE: Your code must be MPI aware for the above to work. Running a non-MPI code with mpirun might succeed, but you will have NUMCORES processes running the exact same calculations, duplicating each others work, and wasting resources.

For more information, see the examples.

Running LAM jobs

NOTE: Please consider using OpenMPI if your application supports it. Use of LAM is deprecated.

NOTE: All of the nodes in your job must be configured to use the same MPI library and version and language bindings. This is best done by editting your ~/.cshrc.mine (or ~/.bashrc.mine if you are using the bourne shell) to include the appropriate tap -q or module load command. You should ONLY tap/module load a single MPI library in your dot file; if you have multiple tap/module load lines, at best only the last one is effective (and likely none will work properly).

The LAM MPI library requires you to explicitly setup the MPI daemons on all the nodes before you start using MPI, and tear them down after your code exits. So to run an MPI code you would typically have the following three lines:

lamboot $PBS_NODEFILE
mpirun C YOUR_APPLICATION
lamhalt
  1. The first line sets up the MPI pool between the nodes assigned to your job.
  2. The second line starts up a copy of YOUR_APPLICATION on each cores (hence the 'C") assigned to your job
  3. The last line cleans up the MPI pool

NOTE: Your code must be MPI aware for the above to work. Running a non-MPI code with mpirun might succeed, but you will have NUMCORES processes running the exact same calculations, duplicating each others work, and wasting resources.

For more information, see the examples.

Running an MPICH MPI Job

NOTE: Please consider using OpenMPI if your application supports it. Use of MPICH is deprecated.

NOTE: All of the nodes in your job must be configured to use the same MPI library and version and language bindings. This is best done by editting your ~/.cshrc.mine (or ~/.bashrc.mine if you are using the bourne shell) to include the appropriate tap -q or module load command. You should ONLY tap/module load a single MPI library in your dot file; if you have multiple tap/module load lines, at best only the last one is effective (and likely none will work properly).

Note also that if you've never run MPICH before, you'll need to create the file .mpd.conf in your home directory. This file should contain at least a line of the form MPD_SECRETWORD=we23jfn82933. (DO NOT use the example provided, make up your own secret word.)

The MPICH implementation of MPI also requires the MPI pool to be explicitly set up and torn down. The set up step involves starting mpd daemon processes on each of the nodes assigned to your job.

A typical MPICH job will use the following lines

mpdboot -n NUM_NODES -f NODE_FILE
mpiexec -n NUM_CORES YOUR_PROGRAM
mpdallexit
  1. The first line starts NUM_NODES mpd daemons, one per node listed in the NODE_FILE. NOTE: NUM_NODES is the number of NODES, not cores or tasks for the job. Also, NODE_FILE is not quit the same format as PBS_NODEFILE, as it should only contain the name of each node once.
  2. The second line launches NUM_CORES copies of your YOUR_PROGRAM code across all the nodes. NOTE: NUM_CORES here is the number of cores/tasks, not the number of nodes.
  3. The last line shuts down the mpd daemons, etc.

NOTE: Your code must be MPI aware for the above to work. Running a non-MPI code with mpirun might succeed, but you will have NUMCORES processes running the exact same calculations, duplicating each others work, and wasting resources.

For more information, see the examples.

The above will work as long as you do not run more than one MPI job on the same node at the same time; since most MPI jobs use all the cores on a node anyway, it is fine for most people. If you do run into the situation where multiple MPI jobs are sharing nodes, when the first job calls mpdallexit, all the mpds for all jobs will be killed, which will make the second and later jobs unhappy. In these cases, you will want to set the environmental variable MPD_CON_EXT to something unique (e.g. the job id) before calling mpdboot, and add the --remcons option to mpdboot, e.g.

mpdboot -n NUM_NODES -f NODE_FILE --remcons

Running Non-MPI jobs on multiple nodes

MPI is currently the most standard way of launching, controlling, synchronizing, and communicating across multi-node jobs, but it is not the only way. Some applications have their own process for running across multiple nodes, and in such cases you should follow their instructions.

The examples page shows an example of using the basic ssh command to start a process on each of the nodes assigned to your job. Something like this could be used to break a problem into N chunks that can be processed independently, and send each chunk to a different core/node. However, most real parallel jobs require much more than just launching the code: the passing of data back and forth, synchronization, etc. And for a simple job as described is often better to submit separate jobs in the batch system for each chunk.