This is a "quick start" introduction into using the HPC clusters at the University of Maryland. This covers the general activities most users will deal with when using the clusters.
MATLAB DCS Users: The system interaction for users of Matlab Distributed Computing Server is rather different from that of other users of the cluster, and so is not covered in this document. Please see Matlab DCS Quick Start.
- Logging into one of the login nodes
- Creating a job script
- Submitting a job
- Monitoring job status
- Monitoring your allocation
Prequisites for the Quick Start
This quick start assumes that you already
- have a TerpConnect/Glue account (REQUIRED for the Deepthought clusters, advisable for the others)
- have an account on/access to the cluster (i.e., have an allocation or have been granted access to someones allocation)
- know how to use ssh
- have at least a basic familiarity with Unix
Logging into one of the login nodes
All of the clusters have at least 2 nodes available for users to log into. From these nodes you can submit and monitor your jobs, look at results of the jobs, etc.
DO NOT RUN computationally intensive processes on the login nodes!!!. These are in violation of policy, interfere with other users of the clusters, and will be killed without warning. Repeated offenses can lead to suspension of your privilege to use the clusters.
For most tasks you will wish to accomplish, you will start by logging into one of the login nodes for the appropriate cluster. These are:
ssh -l johndoe login.deepthought.umd.edu
ssh -l johndoe login.deepthought2.umd.edu
ssh -l email@example.com gateway2.marcc.jhu.edu
Note: For the MARCC/Bluecrab cluster, note that since your username is
firstname.lastname@example.org, you will need to either
-l argument to the standard Unix ssh or quote
the username for the
@ format, as shown above.
See the section on logging into the clusters for more information.
Creating a job script
Next, you'll need to create a job script. This is just a simple shell script that will specify the necessary job parameters and then run your program.
Here's an example of a simple script, we'll call
#!/bin/bash #SBATCH -t 1 #SBATCH -n 4 #SBATCH --mem-per-cpu=128 #SBATCH --share . ~/.profile module load python/2.7.8 hostname date
The first line, the shebang, specifies the shell to be used to run the script. Note that you must have a shebang specifying a valid shell in order for Slurm to accept and run your job; this differs from Moab/PBS/Torque which ignores the shebang and runs the job in your default shell unless you gave an option to qsub for a different shell.
The next three lines specify parameters to the scheduler.
-t, specifies the maximum amount of time
you expect your job to run. It can take various forms, but usually
you will want to give minutes, hours:minutes:seconds, or days-hours.
You should always set a reasonable wall time limit;
this will help improve utilization of the cluster and reduce the amount
of time your job will wait in the queue. To encourage this, the
default wall time limit is rather short. In this example, we specified
a wall time limit of 1 minute; normally this would be much longer, but
this is a trivial job.
See the section on specifying the walltime limit for more information.
The second line,
-n, tells the scheduler on how many
tasks/cores your job will have (by default Slurm assigns a distinct
core to each task). We do not specify how Slurm should distribute
these cores across machines, so Slurm can distribute them however
it sees fit. That is usually sufficient for many MPI jobs, and
there are other options that allow for very detailed specification
on how the cores should be distributed, as
briefly described here and
in the examples page.
In this example, we are requesting 4 cores (which is way more than needed for this trivial example). Most likely we will get all 4 cores on a single node, but that is NOT guaranteed. We could possibly get one core on each of 4 nodes, or some allocation of 4 cores on 2 or 3 nodes.
See the section on specifying the node/core requirements for more information.
The third line,
--mem-per-core, tells the scheduler on how
much memory to allocate for your jobs. This particular form,
reserves the requested amount of memory (N MB) per CPU core assigned.
A similar form,
--mem=N, reserves the requested
amount of memory (N MB) for the entire job. The
is usually more convenient. Nodes on the Deepthought cluster
should have at least 1 GB/core. On the Deepthought2 cluster, nodes
have at least 6 Gb/core. For the MARCC/Bluecrab cluster, nodes have
at least 5 GB/core.
In this example, we are requesting 128 MB per core, for a total of 512 MB for
our 4 core job. If we used
--mem=128 we would get a total of 128 MB
(or effectively 32 MB per core), which for this trivial job is still way more than
is actually needed.
See the section on specifying the memory requirements for more information.
The fourth line,
--share, is valid for the Deepthought
HPC clusters, and states that we are willing
to share a node with other jobs. E.g., on Deepthought2, most nodes
have 20 cores; by using
--share mode, if all of our
cores are assigned to one such node, Slurm will reserve 4 cores for us, but
can assign the other 16 cores to other jobs which our job is running. The
--exclusive, which prevents other jobs from running
on the same node(s) as the exclusive job. If our sample job was
--exclusive and assigned to a 20 core node, the other 16 cores
would be unassigned and idle while the job ran.
NOTE: exclusive jobs get charged for both the
cores they use AND for the cores they prevent from being used by anyone else
due to the exclusive status. E.g., if the example job was
and assigned a 20 core node, it would accrue charges for 20 cores for as long
as it ran.
By default, jobs requesting only a single core are run in
mode, and those requesting more than one core are run in
mode. But you can override this with the
See the section on specifying whether other jobs can be on the same node for more information.
For the MARCC/Bluecrab cluster, the shared/exclusive mode is determined
by the partition your job is submitted to, and in general you will need
to specify a partition depending on the type of job you wish to run.
By default, jobs will go to the
shared partition, which is
suitable for serial jobs or jobs doing shared memory parallelization.
"Shared" partition jobs are restricted to a single node, and as the name
implies, run in shared mode. The
is suitable for parallel jobs requiring multiple nodes, and enforces
--exclusive mode. On MARCC/Bluecrab, you should just specify the
partition (with the -p option), and not specify the shared/exclusive mode,
#!/bin/bash # Bluecrab example #SBATCH -t 1 #SBATCH -n 4 #SBATCH -p shared . ~/.profile module load python/2.7.9 hostname date
See the section on MARCC/Bluecrab partitions for more information.
Users of the two Deepthought HPC clusters generally should NOT be specifying a partition (the only real exceptions being if you wish to use the debug or the scavenger partitions); Slurm will automatically select the appropriate partition for you.
It is advisable to include at least these four above options
(wall time limit, number of cores, memory and either exclusivity or partition
depending on the cluster) for all jobs,
either in the job script as shown, or on the sbatch command line
(see for general information
on providing options to the
sbatch command). There are many other
possible arguments to the
sbatch command, the more commonly
used ones are described here.
The remaining lines in the file are just standard commands, you will replace them with whatever your job requires. In this case once the job runs, it will print out the time and hostname to the output file. The script will be run in whatever shell is specified by the shebang on the first line of the script. NOTE: unlike with the Moab scheduler, you MUST provide a valid shebang on the first line.
Note that when your job starts, your job script is executed on the
first node assigned to your job. The list of nodes assigned to your
job, etc. are available in Slurm
environmental variables, but Slurm does not do anything to parallelize
your job. Your script is responsible for farming out tasks to the different
cores/nodes that are part of the job. Normally, a parallel application
will handle that, or you issue your MPI-aware code with
which handles that.
See the section on running MPI jobs for more information.
In particular, note the the example given is BAD. Although it requests 4 cores, all the commands listed (hostname, date) are single core commands, so 3 of the requested cores will actual be idle while the job is running. Since this job is just a simple example and will finish in seconds, that is not a big issue in this case. But in general, simply submitting serial code as a sbatch job requesting more than one core DOES NOT parallelize a job.
For users of the Deepthought HPC clusters:
If your job script used
bash and that is NOT your
default shell, you
should begin the code section of your script with
Generally, this should be followed by
of whatever modules your job requires.
See the section on using the module command for more information.
It is recommended that you include the relevant module commands for a job in the actual job script, as opposed to relying on modules loaded by your dot files.
For more information than is suitable for a quick start document, follow one or more of the links below:
Submitting a job
Now that you have a job script, you need to submit the job
to the cluster with the
sbatch command. For example,
login-1:~: sbatch test.sh Submitted batch job 13222
The number that is returned to you is the identifier for the job, and you should use that anytime you want to find out more information about your job, and you should include this number if you are opening a help ticket about a job.
Do NOT start jobs from your home directory. It is NOT optimized for heavy I/O.
At this point, your job has been placed in the queue, and will wait its turn for resources to be available. Depending on how heavily used the cluster is at that time, and how many resources you are requesting, your job might start within minutes or it might wait for hours or even days. (And this is assuming that there are sufficient funds in the allocation, etc.) See the FAQ for tips on how to educe the amount of time your job spends waiting in the queue..
Once resources become available, Slurm will assign resources to your job, including one or more cores on one or more nodes. A shell process will start on the first core of the first node assigned, and your script will run. Normally, your script will start any other tasks on the same or on other nodes as needed.
The standard output and standard error streams will be directed
to a file, by default
slurm-NNNN in the directory
where you started the job, where the NNNN is the job number
as described above. See the section on
specifying output options
for more information.
Do NOT start jobs from your home directory. It is NOT optimized for high I/O. Use lustre or scratch space. See the section on storage for more information on storage options.
Output from your job can be viewed in the above specified file shortly after it starts running (assuming it has output something). This can be used to check the status of your job, although it is advisable if your code generates a lot of output to redirect it to another file. See the section on storage for more information on storage options.
For our trivial example from the last section, when the job completes we should see something like
l:~: cat slurm-13222.out compute-2-39.deepthought.umd.edu Wed May 21 18:38:06 EDT 2014
As you can see in the output files above, the script ran and printed the hostname and date as specified by the job script.
Monitoring job status
The basic command for monitoring your jobs' status is the
squeue command. Because normally you are only interested
in your jobs, it is advisable to add the
flags, to speed up the command and only show your jobs. Replace USERNAME
with your username (and remember the
@umd.edu on Bluecrab).
For more information on monitoring jobs than is suitable for a quick start document, follow the links below.
- Monitoring and managing your jobs in general
- The squeue command
- Getting the estimated start time of your job
- Obtaining detailed information about a job
Monitoring your allocation
It is often useful to be able to see the status of the cluster as a whole, including information about how busy the cluster is at a given point in time.
The squeue command without any arguments will list all jobs in the queue. This can be overwhelming, however, as there are often many, many jobs.
sinfo -N command can show you information about the nodes in
the cluster. Again, this is a dense text output, so can be difficult to
smap uses ascii graphics to present this information in a
more graphical and hopefully more digestible fashion.
sview uses X11 graphics for an even prettier overview of the
More information about the commands above can be found in the section on
on monitoring the cluster..