Queues and such
Table of contents
- Queues vs Partitions
- Quality of Service (QoS) levels and Walltime Limits
- How jobs waiting in the Queue get processed
Queues vs Partitions
In a previous incarnation, the Deepthought cluster had queues that allowed shorter jobs to be favored over longer jobs, and help wider jobs to run. The slurm scheduler has something called partitions which basically correspond to the queues used by the scheduler for the previous version of Deepthought, but with the increased configuration capabilities of slurm, you generally no longer need to worry about queues and/or partitions. The Deepthought2 scheduler also uses slurm and also does not use queues to favor shorter jobs over longer, etc.
Instead, the only times you should be specifying a partition (which is what corresponds in slurm to the PBS/Moab/Torque queues) are:
- When you wish to submit a short, test job to the debug partition. The jobs have high priority, but will only run on a limited set of nodes and will not run for more the 15 minutes. It is intended for testing codes to speed up the development cycle.
- When you wish to submit a job to the ultra-low priority scavenger partition. Jobs in this queue run at the lowest possible priority, and are pre-emptible. I.e., even once the job has started, if another job comes along and needs the resources the job is using, the scavenger partition job will be terminated and put back in the queue. Your account does not get charged for jobs in the scavenger queue, but in order to make good use of it your job needs to be able to checkpoint itself so that it can make progress in the slices of time it gets between other jobs. This used to be referred to (inappropriately) as the serial queue, but it is hoped the new name better reflects its purpose, to allow jobs which can to scavenge free CPU cycles where they can.
See elsewhere for instructions on specifying a partition to run in.
In all other cases, the scheduler will submit your job to the appropriate
standard for normal priority jobs and
high-priority for high priority jobs, depending on the
account the job is charged to). In other words, you generally should
NOT be specifying a partition.
Quality of Service (QoS) levels and Walltime Limits
The favoring of shorter jobs over smaller jobs, etc. is handled by
Quality of Service (QoS) levels. Again, you do not generally
need to worry about these, as when you submit a job with
it will pick the correct QoS based on the requested job size and duration.
The only things that matters are the size/duration limits:
|QoS name||Maximum #
|debug||10||15 minutes||High||for development|
|wide-short||no limit||2 hours||Medium|
|wide-medium||no limit||8 hours||Low|
|narrow-long||40||3 days||Low||per-user job limits**|
|med-extended||100||7 days||Low||HW contributors only*|
|narrow-extended||50||14 days||Low||HW contributors only*, per-user job limits**|
|scavenger||no limit||no limit||Very Low||Pre-emptible|
* NOTE:The two -extended QoSes are only available to users in projects which have contributed hardware to the respective cluster.
** NOTE: The narrow-* QoSes are subject to
per-user job limits to try to maintain an equitable balance between wider,
more parallel jobs and large numbers of single core jobs. The exact limits
may be tweaked by systems staff with little or no advance
warning to try to find the best
value for all involved. Currently, there is no per-user job limit on
narrow-medium QoS, and limits of 250 jobs/user on the
narrow-extended QoSes. You can
submit jobs over and beyond these limits, but the extra jobs will remain in
a pending status (with
QOSResourceLimit as the reason) until
job slots become available (i.e. one of your currently running jobs completes).
- If you belong to a group which has contributed hardware to the respective cluster (i.e. a group with both a normal and high priority account), you can access the *-extended QoSes which allow you to submit jobs using up to 50 nodes for up to 14 days, or jobs using up to 100 nodes for 7 days.
- If you do not belong to a group which has contributed hardware to the cluster, you can submit a job using up to 40 nodes for 3 days, or any job for up to 8 hours.
- In either case, you can also submit jobs to the
scavengerpartition, which have no fixed size or wallclock limit, but are pre-emptible and will be killed whenever anyone else needs the node.
Note that these walltime limits are quite generous compared to many other HPC clusters at other universities, etc. A quick (not quite random) sampling from a google search yields:
- New York University: 4 days maximum
- University of Southern California: 2 weeks for 1 node, otherwise 1 day.
- PennState: 2 weeks for up to 32 cores (contributors), 4 days for up to 256 cores otherwise
- UMBC: 5 days
- TACC: Stampede: 2 days
- TACC: Lonestar: 1 day
- Princeton: Della: 6 days
- Princeton: Hecate: 15 days
How jobs waiting in the Queue get processed
The scheduler is a process running on the head node which determines when and where jobs will run. It is what is responsible for seeing that your job gets the resources it requested so it can run, and for doing so in a manner which tries to get everyone's jobs scheduled and running in a reasonable amount of time. The following is a simplified overview of the scheduling process. There is a lot of complexity to the problem, but the following should give you a basic understanding and help you to understand why specifying realistic requirements for your job will help reduce the amount of time it spends in the queue waiting to be scheduled.
Jobs submitted by
sbatch, etc. get placed into a queue, and
the scheduler periodically checks the list of jobs in the queue trying to
find resources for them so they can run. Even if the cluster is lightly
loaded and there are no other jobs in the queue, this might take a minute or
two, but since jobs on the HPC typically run for hours this is a minor
overhead. If the cluster is heavily loaded, the jobs might spend hours or
even days in the queued state.
The scheduler basically goes through the list in a FIFO (first in, first
out) fashion, that is, jobs are more or less processed in the order in which
they are submitted. But this is only a first approximation. Jobs will have
differing priorities; jobs submitted via high-priority allocations (e.g.
allocations which end with
for more information re high
priority accounts) run at a higher priority than jobs being charged
against standard priority allocations. Jobs submitted to the debug
partition also run at a higher priority than normal, since these are short
jobs and that partition is for people trying to debug stuff.
Jobs also have varying resources. The scheduler does not know what resources your job actually requires; it only knows what you requested. If you do not request enough of a resource, e.g. memory, it is possible that your job will run, but will fail at some point because the scheduler only gave the job the amount of memory that was requested. (If you do not specify an amount of memory that is required, the scheduler will assume that it does not need to worry about it and whatever resources it assigns to your job will meet your memory needs.) But if you specify such resources, the scheduler will only assign you resources meeting your specifications. And if such resources are not currently available, other jobs that were submitted after yours might get scheduled before yours.
For example, the Deepthought2 cluster has a small number of large (1 TB) memory nodes. If you specify in your job that you need that much memory, the scheduler will only assign one of those nodes to your job. But if all of those nodes are in use, you job cannot be scheduled until one of them becomes free. Whereas other jobs which do not have such strict memory requirements can still run, and might be scheduled before your job (on nodes with insufficient memory for your job) even though they are of lower priority and/or were submitted after your job was.
There is also the question of whether multiple jobs can share a node or
not. By default on the Deepthought clusters, jobs requesting only a single
CPU core can share a node with other such jobs, and jobs requesting multiple
CPU cores will not share nodes with other jobs. You can explicitly control
this with the
--share flags, but the defaults
are not unreasonable in many cases. But this can impact the time it
takes to schedule jobs; if all the nodes in the cluster are in use, even if
some nodes are only running jobs with shared mode set, if your job has the
exclusive mode set, there will be no nodes available to it. But a job with
shared mode set might be able to run on one of the shared nodes if there are
sufficient resources that are not being used.
The scheduler might also keep nodes in reserve for a large job. If the job at the top of the queue requires a large number, e.g. 50, nodes, chances are that they are not 50 nodes idle when it reaches the top of the queue. So as nodes become free, if they would be suitable for the large job to run on, the scheduler will earmark it for the large job. So these nodes might be kept idle even though there are other jobs, behind the large job in the queue, which could run immediately on them. This is required, or else the large job will never run because every time a few nodes are freed, a narrower job will gobble them up.
There is an exception to the above; the scheduler knows the walltime requested for all jobs. So let's return to our 50 node job; and let's say the scheduler has 30 nodes earmarked for it. All the other nodes that could be used by the large job currently have other jobs running on them. But looking at the walltimes of those jobs, the scheduler might compute that the next 20 nodes (to complete the set of 50 needed by the job) will be available in 6 hours. If there are jobs which have a walltime under 6 hours which can make use of those nodes, the scheduler can let them use those nodes, since they will be idle BEFORE the large 50 node job can make use of them. This is referred to as backfill. This way the large 50 node job is not delayed longer than necessary, but the smaller, shorter jobs can run ahead of when they otherwise would have. So a "win-win" situation. If there are not enough smaller, shorter jobs to make use of those windows, the scheduler can throw scavenger partition jobs also, since it can kill those jobs at any time.
Again, this is another reason why specifying accurate resource requirements
will help your job get scheduled more quickly. If your job should take about
4 hours to run, and so you request 5 hours so that there is some buffer
(since the scheduler will terminate your job as soon as its walltime
runs out, even if it was 99.99% finished), then your job could be considered
for that 6 hour window when the scheduler is trying to schedule the 50 node
job in the previous example. But if you said "I may as well request 8 hours
since that is the limit on the QoS," the scheduler is just going by what you
requested, and your job will NOT be considered for backfill in a 6 hour
window. If your job really needs around 8 hours, that is one thing, but if
you just are inaccurate in your request, it will spend needless time in the