Submitting Jobs

  1. Basic Job Submission
  2. Your Job Script
  3. Choosing a Queue
  4. Specifying how long the job will run
  5. Specifying node and core requirements
  6. Specifying memory requirements
  7. Requesting nodes with specific features
  8. Using infiniband
  9. Using GPUs
  10. Specifying the amount/type of scratch space needed
  11. Specifying the account to be charged
  12. Specifying email options
  13. Specifying output options
  14. Specifying which shell to run the job in
  15. Specifying the directory to run the job in
  16. Specifying whether or not other jobs can be on the same node
  17. Specifying a reservation

Basic Job Submission

The Deepthought HPC clusters use a batch scheduling system called Slurm to handle the queuing, scheduling, and execution of jobs. This scheduler is used in many recent HPC clusters throughout the world. This page will attempt to discuss the Slurm commands for submitting jobs, and how to specify the job requirements. For users familiar with PBS/Torque or Maui/Torque or Moab/Torque based clusters, we have a document which translates commonly used commands from those scheduler systems into their Slurm equivalents..

Users generally submit jobs by writing a job script file and submitting the job to Slurm with the sbatch command. The sbatch command takes a number of options (some of which can be omitted or defaulted). These options define various requirements of the job, which are used by the scheduler to figure out what is needed to run your job, and to schedule it to run as soon as possible, subject to the constraints on the system, usage policies, and considering the other users of the cluster. It is also possible to submit an interactive job, but that is usually most useful for debugging purposes.

The options to sbatch can be given on the command line, or in most cases inside the job script. When given inside the job script, the option is placed alone on a line starting with #SBATCH (you must include a space after the SBATCH). These #SBATCH lines SHOULD come before any non-comment/non-blank line in the script --- any #SBATCH lines AFTER a non-comment/non-blank line in the script might get ignored by the scheduler. See the examples page for examples. The # at the start of these lines means they will be ignored by the shell; i.e. only the sbatch command will read them.

Your Job Script

The most basic parameter given to the sbatch command is the script to run. This obviously must be given on command line, not inside the script file. This job script must start with a shebang line which specifies the shell under which the script is to run. I.e., the very first line of your script should generally be either

#!/bin/tcsh
or
#!/bin/bash
for the tcsh or bash shell, respectively. This must be the first line, and no spaces before the name of the shell. This line is typically followed by a bunch of #SBATCH lines specifying the job requirements (these will be discussed below), and then the actual commands that you wish to have executed when the job is started on the compute nodes.

There are many options you can give to sbatch either on the command line or using #SBATCH lines within your script file. Other parts of this page discuss the more common ones. It is strongly recommended that you include at least the following directives to specify:

NOTE: The #SBATCH lines should come BEFORE any non-blank/non-comment lines in your script file. Any #SBATCH lines which come after non-blank/non-comment lines might get ignored by the scheduler.

If your default shell is tcsh (on the Deepthought clusters, that will be your default shell unless you explicitly changed it), and you are submitting a bash job script (i.e., the first line of your job script is #!/bin/bash), on the Deepthought clusters it is strongly recommended that the first command after the #SBATCH lines is

. ~/.profile

This will properly set up your environment on these clusters, including defining the module command. It is also recommended that after that line, you include module load commands to set up your environment for any software packages you wish to use.

The remainder of the file should be the commands to run to do the calculations you want. After you submit the job, the job will wait for a while in the queue until resources are available. Once resources are available, the scheduler will run this script on the first node assigned to your job. If your job involves multiple nodes (or even multiple cores on the same node), it is this script's responsibility to launch all the tasks for the job. When the script exits, the job is considered to have finished. More information can be found in the section on Running parallel codes and in the examples section.

WARNING
Do NOT run jobs from your home directory; the home directories are not optimized for intensive I/O. You have a lustre directory at /export/lustre_1/USERNAME on the original Deepthought or /lustre/USERNAME on Deepthought2. Use that or a /data/... directory instead.

Note: If your script does not end with a proper Unix end-of-line (EOL) character, the last line of your script will usually be ignored by the shell when it is run. This can often happen if you transfer files from Windows (which uses different EOL characters) to Unix, and can sometimes be quite confusing, as you submit you job, it runs, and finishes almost immediately, and there is seemingly no output but the last line of your job script, which got ignored, is the command to actually do the calculation. Although you can use a command like dos2unix to fix the script file, it is usually easiest to just remember to add a couple blank lines to the end of your file. This doesn't actually fix the problem, as the script does not end with a proper Unix EOL, but the last line, which gets ignored, is now blank, and you won't care that it got ignored.

Choosing a Queue/Partition

On the Deepthought clusters, you generally should not be specifying a partition. The only time you should be specifying a partition on these clusters is if you want to run your job:

  1. in the debug partition. This partition is intended for quick turn around for small debugging jobs, but has a 15 minute maximum walltime so is not suitable for production.
  2. in the scavenger partition The scavenger partition is an ultra-low priority,

In all other cases on the Deepthought clusters, the scheduler will automatically place your job in the correct partition, usually based on the allocation account you are charging against.

The debug partition on the Deepthought clusters is for short, debugging jobs. It is intended to allow quick turn-around for the debugging process, but not for running production jobs. As such, it has a severely limitted run-time limit (15 minutes).

The scavenger partition does not charge against your allocation, but it is very low priority (all other jobs will cut in front of it in the pending queue) and even once your job starts, any other job not also in the low-priority partition can knock your job off a node it wants after your job started running. As such, these jobs need to do some form of checkpointing in order to make progress in the snatches of CPU time they get allocated. If you do not know what checkpointing is or how to do it, this partition is NOT for you. Since no allocation is charged, this partition does not get a priority increase if you charge against your hight-priority allocation.

To specify either of the above partitions, you just give the sbatch command the --partition=PART argument, or equivalently the -p PART argument, replacing PART with the name of the partition. E.g., to submit a job to the debug partition, you could add to your sbatch command the arguments -p debug. Similarly, to submit a job to the scavenger partition, you could add -p scavenger to the sbatch. In either case, you can either append the partition flag to the end of the command line, or add it near the top of your job script with a #SBATCH prefix, e.g. for the debug partition

#SBATCH -p debug

On the MARCC/Bluecrab cluster, you generally will need to specify a partition as on this cluster partitions are used to classify job requirements. A complete list of partitions on Bluecrab can be found on the MARCC/Bluecrab website.

In general, on MARCC/Bluecrab, jobs requiring more than one node (24 cores on normal nodes, 48 cores on large memory nodes) should be submitted to the parallel partition. Jobs requiring the large memory nodes (1024 GB) should be submitted to the lrgmem partition, and jobs requiring GPUs to the gpu partition. Jobs requiring one node (or a fraction thereof) should generally be submitted to the shared partition.

Note that jobs submitted to the parallel or gpu on the MARCC/Bluecrab cluster will be forced into --exclusive mode (even if you explicitly specify --share, this will be overridden). Jobs submitted to the other partitions will default to being in --share mode (although you can override that if you really want to). Most partitions are limited to one week of wall time.

The MARCC/Bluecrab cluster also provides a preemptible scavenger partition.

To specify any of the MARCC/Bluecrab partitions, you should either give a -p PART option in the sbatch command line, or include an

#SBATCH -p PART
in your job script, where PART is the name of the partition.

Specifying the Amount of Time Your Job Will Run

When submitting a job, it is very important to specify the amount of time you expect your job to take. If you specify a time that is too short, your job will be terminated by the scheduler before it completes. So you should always add a buffer to account for variability in run times; you do not want your job to be killed when it is 99.9% complete. However, if you specify a time that is too long, you may run the risk of having your job sit in the queue for longer than it should, as the scheduler attempts to find available resources on which to run your job. See the section on job scheduling for more information on the scheduling process and advice regarding the setting of walltime limits. See the section on Quality of Service levels for more information on the walltime limits on the Deepthought clusters.

In general, on the Deepthought clusters, all users can run jobs up to 3 days in length, and members of contributing units can run jobs up to 14 days in length. On the MARCC/Bluecrab cluster, all users can run jobs up to a week in length.

To specify your estimated runtime, use the --time=TIME or -t TIME parameter to sbatch. This value TIME can in any of the following formats:

  • M (M minutes)
  • M:S (M minutes, S seconds)
  • H:M:S (H hours, M minutes, S seconds)
  • D-H (D days, H hours)
  • D-H:M (D days, H hours, M minutes)
  • D-H:M:S (D days, H hours, M minutes, S seconds)

WARNING
NOTE: If you do not specify a walltime, the default walltime on the Deepthought HPC clusters is 15 minutes. I.e., your job will be killed after 15 minutes. Since that is not likely to be sufficient for it to complete, specify a reasonable walltime. This greatly aids the scheduler in making the best utilization of resources.

The following example specifies a walltime of 60 seconds, which should be more than enough for the job to complete.

#SBATCH -n 1
#SBATCH -t 0:60

hostname

Specifying Node and Core Requirements

Slurm provides many options for specifying your node and core requirements, and we only cover the basics here. More details can be found at the official Slurm site. Also see the man pages for sbatch (i.e. man sbatch).

Normally, jobs consist of a specified number of "tasks", and you want each task to run on its own core of a CPU to maximize performance. In this case, you can simple set the --ntasks=NUMTASKS or -n NUMTASKS arguments to sbatch. Note that your job might get put onto one node if NUMTASKS is small enough, or split across multiple nodes.

If you know you'll need 12 cores, but don't care how they're distributed, try the following:

#SBATCH --ntasks=12

myjob

The above might allocate a single 12 or more core node for your job, or allocate your job three 4-core nodes, or an 8-core node and a 4-core node, or even two 8-core nodes.

If you are concerned about how your cores are allocated, you can also give the --nodes=NUMNODESDESC or -N NUMNODESDESC. NUMNODESDESC can be of the form MINNODES or MINNODES-MAXNODES. In the former case, MAXNODES is set to the same as MINNODES. The scheduler will attempt to allocate between MINNODES and MAXNODES (inclusive) nodes to your job. So for the above example ( -ntasks=12), we might have

  • all cores assigned on the same node if -N 1 is given.
  • the cores split among two nodes if -N 2 is given. You might get an even split, 6 cores each node, or an assymetric split, e.g. 4 on one node and 8 on the other. But you will get two distinct nodes.
  • either of the two above cases if -N 1-2 is given.

If you only specify the number of nodes (i.e. only the -N parameter), you will be assigned (and charged for) all cores on the assigned nodes.

In general, for distributed memory (e.g. MPI) jobs, we recommend that most users just specify the --ntasks or -n parameter and let Slurm figure out how to best divide the cores among the nodes unless you have specific requirements. Of course, for shared memory (e.g. OpenMP or multithreaded) jobs, you need to give --nodes=1 to ensure that all of the cores assigned to you are on the same node.

WARNING
If you are requesting more than one core but less than the all the cores on the node on the Deepthought clusters, you should consider using the --share flag. The default --exclusive flag will result in your account being charged for all cores on the node whether you use them or not.

On the MARCC/Bluecrab cluster, if you are requesting more than one node, you must use the parallel partition.

Slurm's sbatch command has a large number of other options allowing you to specify node and CPU requirements for a wide variety of cases; the above is just the basics. More detail can be found reading the man page, e.g. man sbatch

Specifying Memory Requirements

If you want to request a specific amount of memory for your job, try something like the following:

#!/bin/sh
#SBATCH -N 2
#SBATCH -mem=1024

myjob

This example requests a two nodes with at least 1 GB (1024 MB) of memory total each. Note that the --mem parameter specifies the memory on a per node basis.

If you want to request a specific amount of memory on a per-core basis, use the following:

#!/bin/sh
#SBATCH --ntasks 8
#SBATCH --mem-per-cpu=1024

myjob

This requests 8 cores, with at least 1 GB (1024 MB) per core.

NOTE: for both --mem and --mem-per-cpu, the specified memory size must be in MB.

You should also note that node selection does not count memory used by the operating system, etc. So a node which nominally has 8 GB of RAM might only show 7995 MB available. So if your job specified a requirement of 8192 MB, it would not be able to use that node. So a bit of care should be used in choosing the memory requirements; going a little bit under multiples of GBs may be advisible.

On the MARCC/Bluecrab cluster, jobs requiring more than 128 GB/node (or about 5.3 GB/core if using all cores on the node), need to be submitted to the lrgmem partition. These jobs are restricted to a single node.

Requesting Nodes with Specific features or resources

Sometimes your job requires nodes with specific features or resources. I.e., some jobs require the higher interconnect speeds afforded by infiniband, or maybe your job will make use of GPUs for processing. Such requirements need to be told to the scheduler to ensure you get assigned appropriate nodes.

In slurm, we break this situations into two cases:

  • features: This refers to something which can be present or not on a system, and if it is present, it is available to all processes on the system. (Obviously, if it is not present, it is not available to any processes on the system). A simply boolean present or not present. E.g., the presence of an infiniband adapter, or whether the processors on the system support the SSSE3 instruction set.
  • resource: This refers to something which not only is present (or not), but has an amount attached to it. Unlike features, resources have a quantity, both in terms of what is present on the node, but also in terms of what is being consumed by jobs running on the node. I.e, a system can have 0, 1, or 2 GPUs. In addition, a job running on a 2 GPU system might consume 0, 1, or 2 of the GPUs.

You can see which nodes support which features and resources with the sinfo. By default, this information is not shown. Features can be shown by using the sinfo --Node --long options; resources require additional fields be specified in the --format. To see both, once can use:

login-1> sinfo --format=""%N  %.5D  %11T %.4c %.8z %.6m %.8d  %12f %12G"
NODELIST NODES STATE       CPUS    S:C:T MEMORY TMP_DISK FEATURES     GRES        
compute-f09-2     1 drained*      16    2:8:1  64000   190000 sb           (null)      
compute-f10-23     1 drained*      12    2:6:1  48000   100000 (null)       (null)      
compute-f10-34     1 drained*      12    2:6:1  48000   100000 ssd          (null)      
compute-f09-[6-7]     2 down*         16    2:8:1  64000   190000 sb           (null)      
compute-f18-[0-3,18,26]     6 down*          8    2:4:1  32000   190000 mhz2333      (null)      
compute-g19-[24-27]     4 drained       12    2:6:1  48000   190000 qib          gpu:1       
compute-f09-[0-1,3-5,8-17]    15 allocated     16    2:8:1 48000+   190000 sb           (null)      
compute-f10-[0-6,16-22,24-31]    22 allocated     8+   2:4+:1 24000+   100000 (null)       (null)      
compute-f10-[32-33,35-46]    14 allocated     12    2:6:1  48000   100000 ssd          (null)      
compute-f15-7     1 allocated      8    2:4:1   7900   100000 mhz2333      (null)      
compute-f17-[8,15]     2 allocated      4    2:2:1   7900    35000 mhz2667      (null)      
compute-g17-[0-31],compute-g18-[0-23]    56 allocated     12    2:6:1  24000   100000 qib          (null)      
compute-g17-[32-39]     8 allocated     16    2:8:1  64000   420000 sb,qib,fib   (null)      
compute-g18-[24-27]     4 allocated     16    2:8:1  64000   190000 sb,qib       (null)      
compute-g19-[1-5,7]     6 allocated      8    2:4:1   7900    35000 ib,mhz2333   (null)      
compute-f10-[7-15]     9 idle           8    2:4:1  24000   100000 (null)       (null)      
compute-f15-[0-6,8-12,15,18-39],compute-f16-[34-39],compute-f17-[34-39],compute-f18-[27-32],compute-f20-[0-3,27-30]    61 idle          4+   2:2+:1   7900   35000+ mhz2333      (null)      
compute-f16-[0-33],compute-f17-[0-7,9-14,16-33],compute-f18-[4-10,12-17,19-25]    86 idle           4    2:2:1   7900    1000+ mhz2667      (null)      
compute-f19-[1-39]    39 idle           4    2:2:1   3900    35000 dell1950,mhz (null)      
compute-g19-[8-23],compute-g20-[1-8,10-29]    44 idle           8    2:4:1   7900    1000+ ib,mhz2333   (null)      

To request a specific feature, use the --constraint option to sbatch. In its simplest form, you just give --constraint=TAG, where TAG is the name of the feature you are requesting. E.g., to request a node with a SandyBridge processor (sb feature), you would use something like:

#!/bin/tcsh
#SBATCH -t 15:00
#SBATCH --ntasks=8
#SBATCH --constraint="sb"

module load openmpi/gnu
mpirun mycode

The --constraint option can get rather more complicated, as Slurm allows multiple constraints to be given, with the constraints either ANDed or ORed. You can request that only a subset (e.g. 2 out of 4) nodes need to have the constraint, or that either of two features are acceptable, but all nodes assigned must have the same feature. If you need that level of complexity, please see the man page for sbatch (man sbatch).

Resources are requested with the --gres option to sbatch. The usage is --gres=RESOURCE_LIST, where RESOURCE_LIST is a comma delimitted list of resource names, optionally followed by a colon and a count. The resources specified are required on each node assigned to the job. E.g., to request 3 nodes and 2 GPUs on each node (for 6 GPUs total) on the Deepthought2 cluster, one would use something like:

#!/bin/tcsh
#SBATCH -t 15:00
#SBATCH -N 3
#SBATCH --gres=gpu:2

cd /lustre/payerle
./run_my_gpu_code

To get a list of available resources as defined in a cluster, you can use the command sbatch --gres=help temp.sh. NOTE: that temp.sh must be an existing submit script; the basic validation of the script occurs before the --gres=help gets evaluated. When --gres=help is given, the script will not be submitted.

On the Division of IT maintained clusters, the following resources and features are available:

Features
NameDT?DT2?DescriptionComments
ibYNNode has DDR infiniband 
qibYNNode has QDR infiniband 
fibYNNode has QDR infiniband Currently, all DT2 nodes have FDR, so feature is not needed
dell1950YNNode is Dell Poweredge 1950Deprecated
mhz2000YNNode has 2GHz CPUsDeprecated
mhz2333YNNode has 2.3GHz CPUsDeprecated
mhz2667YNNode has 2.7GHz CPUsDeprecated
sbYNNode has SandyBridge CPUsDeprecated
ssdYNNode has SSD scratch space 
Resources
NameDT?DT2?DescriptionComments
gpuYYNode has GPUs 

In the above table, the DT? and DT2? columns indicate whether the feature/resource is available on the Deepthought and Deepthought2 clusters, respectively.

Other sections provide more information about specific resources and features, namely:

Using InfiniBand Nodes

All nodes on the Deepthought2 cluster have FDR (54 Gb/s) infiniband. You do not need to request infiniband nodes, all the nodes have it.

NOTE: Although all nodes on Deepthought2 have infiniband, the network topology is such that there is 2:1 blocking in the bandwidth when going between the rack top switches. If your job cannot fit within a single rack (56 nodes or 1120 cores), you cannot really avoid that. For smaller jobs, you can specify --switches=1 then your job will be allocated nodes all connected to the same switch, avoiding the blocking issue. You can also use --switches=1@MAXTIME, which limits the amount of time your job will wait as pending for nodes all on the same switch to become available; after that time, it will accept nodes spread across more than one switch.

The original deepthought cluster currently has a mix of nodes with:

  • No infiniband, just GB ethernet
  • DDR infiniband, 16 Gb/s (feature=ib)
  • QDR infiniband, 32 Gb/s (feature=qib)

To see whether a given node has infiniband, and of what type, you can use the scontrol show nodes NODENAME, and look at the Features= setting. It will show ib for DDR infiniband, qib for QDR inifiniband, or neither if no infiniband is available on the node. Or to get a count of nodes by features, use something like:

login-1> sinfo -o "%5D %10A %4c %8z %8m %25f %10G" -e
NODES NODES(A/I) CPUS S:C:T    MEMORY   FEATURES                  GRES      
456   0/456      20   2:10:1   128000   (null)                    (null)    
4     0/4        40   4:10:1   1020000  (null)                    (null)

To request infiniband for your job on Deepthought, you must give the --constraint, e.g. to request a job with 96 tasks on nodes with QDR infiniband, use

#!/bin/sh
#SBATCH --ntasks=96
#SBATCH --constraint="qib"

#rest of your code
...

When infiniband is available on Deepthought, it is non-blocking.

There is no longer an ib queue for infiniband, nor any need for it.

In general, multiple arguments can be given to the --constraints flag, with a & between them to logically AND the multiple constraints, or a ! to OR them. In the latter case, you can also enclose all the ORed options in square brackets ([ and ]) to ensure the SAME feature is selected on ALL nodes assigned.

So for specifying infiniband, the same job script as in the previous example, but requiring either DDR or QDR infiniband, would be:

#!/bin/sh
#SBATCH --ntasks=96
#SBATCH --constraint="[ib|qib]"

#rest of your code
...

Using GPUs

Although originally designed to driving high end graphics displays, it turns out the graphical processing units (GPUs) are very good at number crunching, for certain types of problems. The Deepthought2 cluster has 40 nodes each with 2 Nvidia Tesla K20m GPU cards, each providing over 2000 cores.

Although there are a lot of cores present in the GPU, they are not compatible with the standard Intel x86 architecture, and codes need to be written especially for these cards, using the CUDA platform. Some applications support for CUDA already, although even in those cases you need to use versions that were built to support CUDA.

See the section on CUDA for more information on using and compiling CUDA and OpenCL programs. See the section on software supporting GPUs for more information on currently installed software which supports GPU processing.

To request GPUs for your job on the Deepthought2 cluster, you need to give sbatch the --gres=gpu or --gres=gpu:N, where N specifies the number of GPUs per node that you are requesting. N defaults to 1 in the first form, and since we have at most 2 GPUs per node, the only other viable option is N=2. E.g, to request 4 nodes, requesting 1 GPU on each, you could use something like:

#!/bin/tcsh
#SBATCH -t 15:00
#SBATCH -N 4
#SBATCH --gres=gpu

cd /lustre/payerle
./run_my_gpu_code

Currently, we do NOT directly charge for the use of GPUs. GPU based jobs are only charged for the CPUs they consume on the GPU node. Your job must use at least 1 CPU core. If your job runs in exclusive mode (which is the default for jobs using more than 1 CPU core), you will be charged for all CPU cores on the node. Otherwise, in "shared" mode, other jobs (CPU and/or GPU if there are GPUs you are not using) can run on the node while your job is running. This will reduce the cost of your job, but does increase risk (it is possible for the other jobs to effectively crash the node).

Currently, the GPU nodes on Deepthought2 have 2 GPUs each. It is possible for two single GPU jobs to run on the same node in "shared" mode. Slurm will set the environmental variable CUDA_VISIBLE_DEVICES to the GPU(s) which it allocated to your job, e.g. to 0 if it assigned you only the first GPU, 1 if it assigned only the second, or 0,1 if it assigned both. By default, CUDA will use this variable and only used the specified GPU(s). So two single GPU CUDA jobs should be able to coexist on the same node without interfering with each other. (However, problems might occur if one of the jobs is not CUDA based, or if the job does stuff it should not be doing.)

On the MARCC/Bluecrab cluster, to request GPUs for your job you need to submit your job to the gpu partition. Again, you are only charged for the CPUs consumed, not the GPU cores, but in this case you are charged for the entire node.

Specifying the Amount/Type of Scratch Space Needed

If your job requires more than a small amount (1GB) of local scratch space, it would be a good idea to specify how much you need when you submit the job so that the scheduler can assign appropriate nodes to you.

Most of the nodes currently have at least 30GB of scratch space, some have as much as 250GB available, and a few have as little as 1GB available. Scratch space is currently mounted as /tmp. Scratch space will be cleared once your job completes.

The following example specifies a scratch space requirement of 5GB. Note however that if you do this, the scheduler will set a filesize limit of 5GB. If you then try to create a file larger than that, your job will automatically be killed, so be sure to specify a size large enough for your needs.

#!/bin/sh
#SBATCH -ntasks=8
#SBATCH --tmp=5120

myjob

Note that the disk space size must be given in MB.

The Deepthought cluster also has some nodes with solid state scratch drives. These can be requested with the ssd feature, e.g.

#!/bin/sh
#SBATCH -ntasks=8
#SBATCH --constraint="ssd"

myjob

Specifying the account to be charged

All users of the cluster belong to at least one project associated with the cluster, and each project has at least one account its users can charge against. Projects which have contributed hardware to the cluster generally have a normal priority and a high priority account; other projects typically have only a normal priority account.

Jobs charged to the high-priority account take precedence over jobs charged to normal priority accounts, as well as low priority (e.g. scavenger queue) jobs. And normal priority jobs take precedence over low priority jobs. No job will preempt another job (i.e., kick it off a node once it starts execution) regardless of priority, with the exception of jobs in the scavenger queue, which will be preempted by any job with a higher priority.

To submit jobs to an account other than your default (normal priority) account, use the -A option to sbatch.

login-1:~: sbatch -A test-hi test.sh
Submitted batch job 4194

If no account is explicitly specified, your job will be charged against your default account. You can view and/or change your default account with the sacctmgr command. The following example shows how the user payerle would change his default allocation account from test to tptest using the sacctmgr command; you should change the user and allocation account names appropriately.

login-1:~: sacctmgr list user payerle
      User   Def Acct     Admin 
---------- ---------- --------- 
   payerle       test      None 

login-1:~/slurm-tests: sacctmgr modify user payerle cluster=dt2 set DefaultAccount=tptest
 Modified users...
  payerle
Would you like to commit changes? (You have 30 seconds to decide)
(N/y): y
login-1:~/slurm-tests: 
login-1:~/slurm-tests: sacctmgr list user payerle
      User   Def Acct     Admin 
---------- ---------- --------- 
   payerle     tptest      None 

NOTE: the above example is for changing your default account on the Deepthought2 HPC. Change the cluster=dt2 to cluster=dt to change the default account on the original Deepthought cluster. Also note that you must have permission to access the specified new default account on the HPCC selected. Furthermore, if you omit the cluster= specification, the command will assume that you wish to change it on both clusters, which usually will NOT work since most accounts do NOT exist on both clusters (you will get an error like Can't modify because these users aren't associated with new default account 'tptest' due to the account not existing on one of the two clusters, and the change will be aborted on BOTH clusters).

If you belong to multiple projects, you should charge your jobs against an account for the appropriate project (i.e. if your thesis advisor is Prof. Smith, and you are doing work for Prof. Jones, thesis work should be charged against one of Prof. Smith's accounts, and your work for Prof. Jones against one of his accounts). If there are both high and normal priority accounts in the project, you generally should be charging against the high priority account. Exception: You should generally run jobs in the debug partition against normal priority accounts, as jobs in the debug partition do NOT get any increase in priority when run against high priority accounts, since jobs in the debug partition already run with increased priority.

The above recommendations assume that there are sufficient funds available in your high priority account. If there do not appear to be sufficient funds to complete the job (and all currently running jobs that are being charged against that allocation), then the job will not start. The scheduler will NOT draw funds from the normal priority account to make up the difference. (The reverse also does not occur; if you attempt to run a job against the standard priority account but there are insufficient funds, the scheduler will NOT draw funds from the high priority account even if there are sufficient funds there.)

But in general, if you have both normal and high priority accounts, use the high priority account preferentially. The main reasons to charge a job against the normal priority account are:

  1. you are running it in the debug partition
  2. you have exceeded your monthly high priority allotment

This latter case is the whole reason for the dual account setup; you can effectively borrow SUs from the next month in the quarter (or the previous if you did not use them), but such "borrowed" SUs only run at normal priority.

For more information on accounts, including monitoring usage of your account, see the section Allocations and Account Management.

Email Options

The scheduler can email you when certain events related to your job occur, e.g. on start of execution, or when it completes. By default, any such mail is sent to your @umd.edu email address, but you can specify otherwise with the --mail-user=EMAILADDR flag to sbatch.

You can control when mail is sent with the --mail-type=TYPE option. Valid options are:

  • BEGIN: when the job starts to execute
  • END: when the job completes
  • FAIL: if and when the job fails
  • REQUEUE: if and when the job is requeued.
  • ALL: for all of the above cases.

You can give multiple --mail-type=TYPE options to have mail sent for multiple conditions. The following job script will send mail to hpc-wizard@hpcc.umd.edu when the job starts and when the job finishes:

#!/bin/tcsh
#SBATCH --ntasks=24
#SBATCH --mail-user=hpc-wizard@hpcc.umd.edu
#SBATCH --mail-type=BEGIN
#SBATCH --mail-type=END

start-long-hpc-job
WARNING
NOTE: It is recommended that you use care with these options, especially if you are submitting a large number of jobs. Not only will you get a large amount of email, but it can cause issues with some email systems (e.g. GMail imposes limits on the number of emails you can receive in a given time period).

Specifying output options

By default, slurm will direct both the stdout and stderr streams for your job to a file named slurm-JOBNUMBER.out in the directory where you submitted the sbatch command. For job arrays, the file will be slurm-JOBNUMBER_ARRAYINDEX.out. In both cases, JOBNUMBER is the number for the job.

You can override this with the --output=FILESPEC (or -o FILESPEC, for short) option. FILESPEC is the name of the file to write to, but the following replacement symbols are supported:

  • %A: The master job allocation number for job arrays
  • master allocation number for the job array.
  • %a: The job array index number, only meaningful for job arrays.
  • %j: The job allocation number.
  • %N: The name of the first node in the job.
  • %u: Your username

Multiple replacements symbols are allowed in the same FILESPEC. I.e., the default values are slurm-%j.out and slurm-%A_%a.out for simple and array jobs, respectively.

You can also use --error=FILESPEC (or -e FILESPEC) to have the stderr sent to a different file from stdout. The same replacement symbols are allowed here.

WARNING
If you use the --output or --error options, be sure to specify the full path of your output file. Otherwise the output might be lost.

Specifying the shell to run in

Under Slurm, your job will be executed in whatever shell is specified by the shebang in the script file specifies. Note: this differs from the case with Moab/Torque, where the job would run under your default shell unless you gave an explicit -S to qsub to change it.

Thus, the following job script will be processed via the C-shell:

#!/bin/csh
#SLURM --ntasks=16
#SLURM -t 00:15

setenv MYDIR /tmp/$USER
...

and the following job script will be processed with the Bourne again shell:

#!/bin/bash
#SLURM --ntasks=16
#SLURM -t 00:15

. ~/.profile

MYDIR="/tmp/$USER"
export MYDIR
...

WARNING
NOTE: If your default shell is csh or tcsh based (which is the default on the Deepthought clusters), and you submit a job with a bourne style shell (e.g. sh or bash), your .profile (and therefore your .bashrc and .bashrc.mine scripts) will not be read automatically. Therefore you need to include a . ~/.profile near the top of your code to get full functionality. This is needed if you wish to load modules, etc.
WARNING
NOTE: If you wish to use a bourne style shell, we strongly recommend #!/bin/bash instead of #!/bin/sh. Under linux, both run the same bash executable, but in the latter case certain non-backwards compatible features are disabled which can causes problems.

Running Your Job in a Different Directory

The working directory in which your job runs will be the directory from which you ran the sbatch command, unless you specify otherwise. The easiest way to change this behavior is to add the appropriate cd command before any other commands in your job script.

Also note that if you are using MPI, you may also need to add either the -wd or -wdir option for mpirun to specify the working directory.

The following example switches the working directory to /data/dt-raid5/bob/my_program

#!/bin/csh
#SLURM -t 01:00
#SLURM --ntasks=24

module load openmpi

cd /data/dt-raid5/bob/my_program

mpirun -wd /data/dt-raid5/bob/my_program C alltoall

There is also a --workingdir=DIR option that you can give to sbatch (or add a #SBATCH --workingdir=DIR line in your job script), but use of that method is not recommended. It should work for the Lustre file system, but does not work well with any NFS file systems (since these get automounted, using symlinks, and sbatch appears to expand all symlinks, which breaks the automount mechanism).

Specifying whether or not other jobs can be on the same node

The sbatch command has the (mutually-exclusive) flags --exclusive and --share which control whether the scheduler should allow multiple jobs to co-exist on the same nodes. This is only an issue when the jobs individually do not consume all of the resources on the node; e.g. consider a node with 8 cores and 8 GB of RAM. If one job requests 2 cores and 4 GB, and a second job requests 4 cores and 3 GB, they should both be able to fit comfortably on that node at the same time. If both jobs have the shared flag set, then the scheduler is free to place them on the same node at the same time. If either has the exclusive flag set, however, then the scheduler should not put them on the same node; the job(s) with the exclusive flag set will be given their own node.

The problem is that there can be interference between the jobs. First off, to optimize performance, we are not perfectly enforcing the core and memory usage of jobs, and it is possible for a job to "escape" its bounds. But even assuming the jobs keep within their requested CPU and memory limits, they still would be sharing IO bandwidth, particularly disk and network, and depending on the jobs this might cause significant performance degradation. On the other hand, it is wasteful to give a smaller job a node all to itself if it will not use all the resources on the node.

From the perspective of you, the user, this potential for interference between jobs means that your job might suffer from slower performance, or even worse, crash (or the node it is running on crash). While that might make submitting jobs in exclusive mode seem like the easy answer, that could significantly impact utilization of the cluster. In other words, if you submit a job in exclusive mode, we will have to charge you for all the cores on the node, not just the ones you asked to use, for the lifetime of your job (since no one else can use those cores). Thus, the funds in your allocation will be depleted faster.

The default behavior on the two Deepthought clusters is that serial (single core) jobs will get the share flag set, unless you explicitly submit them with the exclusive flag. All other jobs by default have exclusive flag set, unless you explicitly submit them with the share flag. Large parallel jobs typically consume all the resources on the nodes they are assigned anyway, and so effectively run in exclusive mode regardless. They also cost the most to rerun if one node crashes. Serial jobs would pay the highest penalty in terms of the charge for cores not being used, so having them in share makes sense. for jobs between those extremes, the policy is somewhat conservative, but allows the user to choose for himself.

It is strongly recommended that users of the Deepthought clusters explicitly set the --share or --exclusive flags for jobs using more than one core and not using the entire node. In general, you will probably want to use the --share flag to reduce the amount charged to your allocation.

On the MARCC/Bluecrab cluster, this setting is set automatically depending on the partition chosen.. The parallel and gpu partitions always use --exclusive mode (and will override any setting you give). The others default to --share mode (although you can explicitly override this if you so desire).

Specifying a reservation

On rare occasions, a reservation might be set up for a certain allocation account. This means that certain CPU cores/nodes have been reserved for a specific period of time for that allocation. This is not done often, and when it is done it is typically reserving some nodes on the original Deepthought cluster for a class, during class hours only, so that students can launch jobs and get the results back while the class is still in session (and the instructor is still available to assist them with issues). Again, this is only done rarely, and you should have been informed (e.g. by your instructor) if that is the case. Most users do not have access to reservations and can safely ignore this section.

If you do have access to a reservation which is active, you can submit jobs which can use the reserved resources by adding the following flag to your sbatch command: --reservation=RESERVATION where RESERVATION is the name of the reservation (which should have been provided to you, e.g. by your instructor). If you were not informed of a reservation name, your allocations probably do not have reservations and this section does not apply to you. The --reservation=RESERVATION flag can either be given as an explicit argument on the sbatch command line, or as a #SBATCH --reservation=RESERVATION line in your job script.

NOTE: to effectively use a reservation, the following conditions must hold:

  1. You must be charging the job to an allocation account that has access to the reservation. For class reservations, this typically means that you must be submitting the job from your class temporary login account, and charging it to the class allocation account.
  2. You must specify that the job should use the reservation, i.e. use the --reservation flag described above.
  3. The reservation must be "active". Class reservations are typically only active during the hours the class meets, and often only on specific days that the class is meeting. If you submit a job specifying a reservation when the reservation is not active, instead of expediting things it will likely delay the job until the reservation becomes active.