Job Submission Examples

When submitting a job with qsub, if you request more than one node, you'll need to know how to get your jobs to use all of the nodes that you've been assigned. The scheduler will run your script on the first node in the list. It's up to you to decide what to do with the remaining nodes. The scheduler assigns the variable $PBS_NODEFILE which contains the name of a file that lists all of the nodes that you've been assigned.
  1. Submitting an MPI Job Using OpenMPI
  2. Submitting an MPI Job Using Intel MPI
  3. Submitting an MPI Job Using LAM
  4. Submitting an MPI Job Using MPICH
  5. Submitting a Non-MPI job

Submitting an MPI Job Using OpenMPI

OpenMPI is the preferred MPI unless your application specifically requires one of the alternate MPI variants. Slurm and OpenMPI interact well together, so it makes it very easy to use. OpenMPI is also compiled to support all of the various interconnect hardware, so for nodes with fast transport (e.g. InfiniBand) the fastest interface will be selected automatically.

The following example will run the MPI executable alltoall on a total of 40 cores. For further information on the module load command check out the section Setting Up Your Environment.

#!/bin/tcsh
#SBATCH --ntasks=40
#SBATCH -t 00:01:00
#SBATCH --mem-per-cpu=2048
#SBATCH --exclusive

module unload intel
module load openmpi/gnu

mpirun alltoall

The above is for cshell style shells. The bourne style shell version is similar:

#!/bin/bash
#SBATCH --ntasks=40
#SBATCH -t 00:01:00
#SBATCH --mem-per-cpu=2048
#SBATCH --exclusive

. ~/.profile

module unload intel
module load openmpi/gnu

mpirun alltoall
WARNING
NOTE: Note the addition of the . ~/.profile line. This is necessary if your default shell is not bash, as otherwise the dot files (and the definition of the module and tap commands) will NOT get loaded.

Submitting an MPI Job Using Intel MPI

The Intel MPI libraries are available if you compiled your code with the Intel compilers. Slurm and the Intel MPI libraries interact well together, so it makes it very easy to use.

The following example will run the MPI executable alltoall on a total of 40 cores. For further information on the module load command check out the section Setting Up Your Environment.

#!/bin/tcsh
#SBATCH --ntasks=40
#SBATCH -t 00:01:00
#SBATCH --mem-per-cpu=2048
#SBATCH --exclusive

module load intel

mpirun alltoall

The above is for cshell style shells. The bourne style shell version is similar:

#!/bin/bash
#SBATCH --ntasks=40
#SBATCH -t 00:01:00
#SBATCH --mem-per-cpu=2048
#SBATCH --exclusive

. ~/.profile

module load intel

mpirun alltoall
WARNING
NOTE: Note the addition of the . ~/.profile line. This is necessary if your default shell is not bash, as otherwise the dot files (and the definition of the module and tap commands) will NOT get loaded.

Submitting an MPI Job Using LAM

WARNING
Use of the LAM MPI libraries is no longer supported on the Deepthought HPC clusters. Please use either the latest OpenMPI or Intel MPI libraries instead.
WARNING
The LAM MPI library function which parses the host string from Slurm appears to be broken. As the LAM MPI libraries are no longer maintained by the authors, it cannot be fixed by upgrading. The following provides a workaround, but it is STRONGLY recommended that you move to another MPI library.

The following example will run the MPI executable alltoall on a total of 40 cores. For further information on the tap command check out the section Setting Up Your Environment.

#!/bin/tcsh
#SBATCH -t 00:01:00
#SBATCH --ntasks=40
#SBATCH --mem-per-cpu=2048
#SBATCH --exclusive

#Generate a PBS_NODEFILE format nodefile
set PBS_NODEFILE=`/usr/local/slurm/bin/generate_pbs_nodefile`
#and convert it to LAM's desired format
set MPI_NODEFILE=$WORKDIR/mpd_nodes.${SLURM_JOBID}
sort $PBS_NODEFILE | uniq -c | awk '{ printf("%s cpu=%s\n", $2, $1); }' > $MPI_NODEFILE

tap lam-gnu
lamboot $MPI_NODEFILE
mpirun -np $SLURM_NTASKS  C alltoall
lamclean
lamhalt

Submitting an MPI Job Using MPICH

WARNING
Use of the MPICH MPI libraries is no longer supported on the Deepthought HPC clusters. Please use either the latest OpenMPI or Intel MPI libraries instead.

The following example will run the MPI executable alltoall on a total of 40 cores. For further information on the tap command check out the section Setting Up Your Environment.

Note also that if you've never run MPICH before, you'll need to create the file .mpd.conf in your home directory. This file should contain at least a line of the form MPD_SECRETWORD=we23jfn82933. (DO NOT use the example provided, make up your own secret word.)

#!/bin/tcsh
#SBATCH -t 1:00
#SBATCH --ntask=40
#SBATCH --mem-per-cpu=2048
#SBATCH --exclusive

tap mpich-gnu

#Generate a PBS_NODEFILE format nodefile
set PBS_NODEFILE=`/usr/local/slurm/bin/generate_pbs_nodefile`
#and convert it to MPICH's desired format
set MPI_NODEFILE=/tmp/mpd_nodes.${SLURM_JOBID}
sort $PBS_NODEFILE | uniq -c | awk '{ printf("%s:%s\n", $2, $1); }' > $MPI_NODEFILE

mpdboot -n $SLURM_JOB_NUM_NODES -f $MPI_NODEFILE
mpiexec -n $SLURM_NTASKS alltoall
mpdallexit

The above assumes a csh-like shell. For bourne shell/bash users, the equivalent script would be

#!/bin/bash
#SBATCH -t 1:00
#SBATCH --ntask=40
#SBATCH --mem-per-cpu=2048
#SBATCH --exclusive

. ~/.profile
SHELL=bash

tap mpich-gnu

#Generate a PBS_NODEFILE format nodefile
PBS_NODEFILE=`/usr/local/slurm/bin/generate_pbs_nodefile`
#and convert it to MPICH's desired format
MPI_NODEFILE=/tmp/mpd_nodes.${SLURM_JOBID}
sort $PBS_NODEFILE | uniq -c | awk '{ printf("%s:%s\n", $2, $1); }' > $MPI_NODEFILE

mpdboot -n $SLURM_JOB_NUM_NODES -f $MPI_NODEFILE
mpiexec -n $SLURM_NTASKS alltoall
mpdallexit
WARNING
NOTE: Note the addition of the . ~/.profile line. This is necessary if your default shell is not bash, as otherwise the dot files (and the definition of the module and tap commands) will NOT get loaded.

Submitting a Non-MPI job

The following example will run a command on each of the nodes in the assigned list. It uses ssh to communicate between nodes. If your shell is csh/tcsh, use this:

#!/bin/tcsh
#SBATCH -ntasks=40
#SBATCH -t 00:01:00
#SBATCH --mem-per-cpu=2048
#SBATCH --exclusive

set COMMAND=/bin/hostname

#We want a list of nodes, one per line.
#Use the PBS compatibility wrapper to make a PBS style nodefile
set PBS_NODEFILE=`/usr/local/slurm/bin/generate_pbs_nodefile`

foreach node (`cat $PBS_NODEFILE`)
   ssh $node $COMMAND &
end
wait

Here we "cheated" and used the Slurm PBS compatibility wrapper script to convert Slurm's abbreviated list of nodes into an PBS-like nodes file which we then use to launch our ssh tasks.

Note the use of the ampersand & in the ssh, and the wait command at the end of the loop. The ampersand causes the processes to run in parallel (otherwise each invocation of $COMMAND via ssh would need to complete before the next one starts). The wait command is necessary to prevent the main script from exiting before all of the spawned ssh processes have completed.

The above example has issues with accurately reporting the exit code of each of spawned commands. This could probably be implemented in the bash version below, but would significantly complicate the script. And I do not believe it is even possible given the limitations of the wait command in the C-shell variants.

And if you prefer bash, use this:

#!/bin/bash
#SBATCH -ntasks=40
#SBATCH -t 00:01:00
#SBATCH --mem-per-cpu=2048
#SBATCH --exclusive

COMMAND=/bin/hostname

#We want a list of nodes, one per line.
#Use the PBS compatibility wrapper to make a PBS style nodefile
PBS_NODEFILE=`/usr/local/slurm/bin/generate_pbs_nodefile`

for node in `cat $PBS_NODEFILE`; do
   ssh $node $COMMAND &
done
wait

WARNING
NOTE: Note the addition of the . ~/.profile line. This is necessary if your default shell is not bash, as otherwise the dot files (and the definition of the module and tap commands) will NOT get loaded.

The above examples are general enough to handle tasks running on different nodes. If you know (because of the number of cores requested relative to the smallest number of cores available on a node, or because of the way you requested the cores) that all the cores will be on the same node, you can forgo the ssh part and just have the main script invoke the command on the current node. E.g., for csh,

#!/bin/tcsh
#SBATCH -ntasks=8
#SBATCH -t 00:01:00
#SBATCH --mem-per-cpu=2048
#SBATCH --shared

set COMMAND=/bin/hostname

#We want a list of nodes, one per line.
#Use the PBS compatibility wrapper to make a PBS style nodefile
set PBS_NODEFILE=`/usr/local/slurm/bin/generate_pbs_nodefile`

foreach node (`cat $PBS_NODEFILE`)
   #In this case, it is assumed we *know* that all the assigned
   #cores are on the same node.
   $COMMAND &
end
wait

If you have any doubts, however, the general, multinode capable version is better. It will handle the case when all cores are on the same node, or when they are divided across multiple nodes, and the penalty for the extra ssh is usually negligible.

NOTE: all of the above are simplistic cases for example purposes. Your code still needs to somehow implement communication between the tasks, which is the main raison d'Ítre for the MPI standard. If your code does not need communication between the threads, than it is by definition embarrassingly parallel and should be submitted as N distinct jobs rather than a single job with N tasks.