This is a "quick start" introduction into using the HPC clusters at the University of Maryland. This covers the general activities most users will deal with when using the clusters.
MATLAB DCS Users: The system interaction for users of Matlab Distributed Computing Server is rather different from that of other users of the cluster, and so is not covered in this document. Please see Matlab DCS Quick Start.
This quick start assumes that you already
If not, follow the above links before proceeding with this quick start.
All of the clusters have at least 2 nodes available for users to log into. From these nodes you can submit and monitor your jobs, look at results of the jobs, etc.
![]() |
DO NOT RUN computationally intensive processes on the login nodes!!!.
These are in violation of policy, interfere with other users of the
clusters, and will be killed without warning. Repeated offenses
can lead to suspension of your privilege to use the clusters.
|
For most tasks you will wish to accomplish, you will start by logging into one of the login nodes for the appropriate cluster. These are:
login.deepthought.umd.edu
.login.deepthought2.umd.edu
.login.marcc.jhu.edu
From an unix like system you then would use commands like
#To ssh to a deepthought login node, from a Glue/Terpconnect
#system or other system where your username is the same on
#both systems
ssh login.deepthought.umd.edu
#To ssh to DT2 login node, assuming your username on the system your
#are ssh-ing from does NOT match your DT2 username. Here we
#are assuming johnsmith
is your DT2 username
ssh johnsmith@login.deepthought2.umd.edu
# or
ssh -l johnsmith login.deepthought2.umd.edu
#The same as the above, but to a bluecrab login node
ssh "johnsmith@umd.edu"@login.marcc.jhu.edu# or
ssh -l johnsmit@umd.edu login.marcc.jhu.edu
#To connect to DT2 with a tunnelled X11 connection for graphics as well
#If your username is the same on both systems
ssh -X login.deepthought2.umd.edu
#or if they differ
ssh -X -l johnsmith login.deepthought2.umd.edu
#or
ssh -X johnsmith@login.deepthought2.umd.edu
More information about logging into the systems
Next, you'll need to create a job script. This is just a simple shell script that will specify the necessary job parameters and then run your program.
Here's an example of a simple script, we'll call test.csh
:
#!/bin/tcsh
#SBATCH -t 1:00
#SBATCH -n 4
#SBATCH --share
module load python/2.7.8
hostname
date
The first line, the shebang, specifies the shell to be used to run the script. Note that you must have a shebang specifying a valid shell in order for Slurm to accept and run your job; this differs from Moab/PBS/Torque which ignores the shebang and runs the job in your default shell unless you gave an option to qsub for a different shell.
The next three lines specify parameters to the scheduler.
The first, -t
, specifies the maximum amount of time
you expect your job to run. This parameter accepts the following
formats for the duration of the job:
MM
minutes MM::SS
minutes and seconds HH:MM:SS
hours, minutes, and secondsDAYS-HH
days and hoursDAYS-HH:MM
days, hours and minutesDAYS-HH:MM:SS
days, hours, minutes, and secondsYou should specify a reasonable estimate for this number, with some padding. If you specify too large of a wall time limit, it can negatively impact the queueing of this or other jobs of yours (see e.g. this FAQ and this FAQ). Too large of a wall time limit can also cause excessive consumption of your allocation's funds by misbehaving jobs. However, you wish to make sure you specify enough time for properly behaving jobs to complete, because once the wall time limit is hit, your job WILL be terminated.
If you fail to specify a walltime limit, it defaults to 15 minutes. Since this is insufficient for most HPC jobs, you should always specify a walltime limit.
In this example, we requested 1 minute of walltime. Although a short time, our code is quite trivial so 1 minute is more than sufficient.
The second line, -n
, tells the scheduler on how many tasks your job
will have, and by default Slurm assigns a distinct core for each
task. This method of specification doesn't care how those
cores are distributed across machines or about how those machines are
configured, and that is sufficient for many MPI jobs. But Slurm allows
for quite detailed specifications of CPU and node requirements, as
briefly described here and
in the examples page.
In this example, we are requesting 4 cores (which is way more than needed for this trivial example). We do not specify how Slurm should allocate them across nodes; most likely we will get all 4 cores on a single node, but that is NOT guaranteed. We could possibly get one core on each of 4 nodes, or some allocation of 4 cores on 2 or 3 nodes.
The third line --shared
is important from the perspectives of
billing and efficient use of the cluster. When scheduling jobs, you have
a choice of whether other jobs (either your jobs or from someone else)
can coexist on the same node or nodes. Although Slurm will not overcommit
resources on a node, not everyone specifies all the resources needed.
And even if both do specify the resources they need, jobs can still interfere
with each other if both are heavily using the disk or network. Or in the
most extreme example, if one job does something which causes the system to
crash, both jobs die.
On the other hand, if jobs cannot share nodes, the cluster will not be
as efficiently used. For example, if this sample job were not to allow
other jobs to share a node with it (i.e. requested to get --exclusive
access to the nodes), and gets assigned to a node with 20 cores,
16 of those cores will be idle while this job is on the node, which is not very efficient
from the perspective of cluster utilization. Nor from the perspective of billing,
we charge jobs based on the number of cores consumed, not used,
so in this exclusive mode case the job would be charged for 20 cores for the
lifetime of the job even though it only requested 4 (since 20 cores are made
unavailable to other jobs).
In the actual sample script, we request --share
for shared access.
In this case if the job is assigned to 4 cores of a 20 core node, the other
16 cores are still available to other jobs, which should improve cluster
utilization. And the job is only charged for 4 cores for its lifetime.
By default, jobs requesting only a single core are run in --share
mode, and those requesting more than one core are run in --exclusive
mode. But you can override this with the --share
and --exclusive
flags.
It is advisable that large parallel jobs run in --exclusive
mode,
since these tend to use most if not all the cores on a node anyway, and the potential
waste
The remaining lines in the file are just standard commands, you will replace them with whatever your job requires. In this case once the job runs, it will print out the time and hostname to the output file. The script will be run in whatever shell is specified by the shebang on the first line of the script. NOTE: unlike with the Moab scheduler, you MUST provide a valid shebang on the first line.
To submit your job, we just use the sbatch
command.
login-1:~: sbatch test.sh
Submitted batch job 13222
The number that is returned to you is the identifier for the job, and you should use that anytime you want to find out more information about your job. For information on how to verify that your job is running, see the section Monitoring and Managing Your Jobs.
Once your job completes, unless you've specified otherwise, your
output and any errors that occur will be written to a file in the
same directory from which you submitted your job. The file will be
named slurm-NNNN
where the Ns are replaced by the job
identifier.
Note that by default when you log in to one of the clusters, you are sitting in your home directory, and all output and submissions will be transferred to and from your home directory. For best performance, you should consider running your jobs from a space set aside for them. See Files, Storage, and Securing Your Data, the Specifying which directory to run the job in page, and the examples for more information.
Here's what you should see when your job completes:
l:~: cat slurm-13222.out
compute-2-39.deepthought.umd.edu
Wed May 21 18:38:06 EDT 2014
As you can see in the output files above, the script ran and printed the hostname and date as specified by the job script.