Diagnosing and Fixing Problems
Jobs not running/Deferred jobs
If your job doesn't run and ends up in the BLOCKED JOBS
section, you can use the
checkjob command to get more
information about why your job isn't running.
login-1:~: checkjob 4195 [ ... deleted for brevity ... ] job is deferred. Reason: NoResources (cannot create reservation for job '4195' (intital reservation attempt)) Holds: Defer (hold reason: NoResources) PE: 232.00 StartPriority: 200 cannot select job 4195 for partition DEFAULT (job hold active)
In this example, we see that the job was deferred because there are insufficient resources available to run the job. Once sufficient resources become available, the job will run automatically.
If instead, you see the following as part of the checkjob output, it means that the job you are trying to run will exceed the allocation you have remaining. This may simply be because you did not specify a walltime as part of your job specification. If your specifications are correct, you can either resubmit your job to your standard-priority account, or to the free serial queue, or you can request an additional allocation from the committee.
login-1:~: checkjob 4204 [ ... deleted for brevity ... ] job is deferred. Reason: BankFailure (cannot debit job account) Holds: Defer (hold reason: BankFailure) PE: 32.00 StartPriority: 200 cannot select job 4204 for partition DEFAULT (job hold active)
If none of the above conditions apply, and your job is listed in the IDLE JOBS section, keep the following in mind:
- When a user runs a job, their job will never share a node with a different user. This is to prevent one user's job from interfering with another user's job. Once a user has access to a node, there's no way to prevent them from using all of the available memory, disk, or processors. This means that some processors on a given node may not always be used. However, if a single user submits multiple jobs, those jobs will be packed onto nodes if the job resource requirements allow this.
- In addition to the number of processors, a user may also request a certain amount of memory or certain amount of disk space when submitting their job. For example, a user may know that their job needs 4G of RAM for their process to run. So to someone viewing the queue, it may appear that the node is being used "inefficiently", but in reality, it is not. In this case, even though a node may have 4 processors, if it only has 4G of RAM total, only one 4G job is going to run on it.
- Remember that this is a shared system. There's no guarantee
that jobs submitted will run immediately. Jobs submitted using
high-priority accounts will run first, followed by the standard
priority accounts, with the 'free' serial jobs getting whatever's
left over. The
squeuecommand lists jobs according to priority order, with the highest priority jobs listed first.
The debug partition is available for running short tests with reasonable fast turn around. This is useful to see that your code, or your submit scripts, are working as intended, especially if your "real job" would require a fair amount of time and/or nodes and get stuck for a while in the queue.
The individual compute nodes do not allow direct shell access except when the node is allocated to a job owned by you. If you need shell access to one or more nodes, you can request the scheduler assign some to you. This request gets put in the queue with all the batch job requests, and depending on the cluster usage at the moment you might get a prompt in seconds, or in minutes, or it might take hours or days.
sinteractive is provided to assist with this
for most basic cases. It takes the following optional arguments:
-c NUMCPUSspecifies the number of CPU cores to request. Default is 1.
-a ACCOUNTspecifies the account to charge. Default is your default account.
-J NAMEspecifies the name to use for the job. Default is "interactive".
-s SHELLspecifies the shell to start up on the assigned node. Default is your default login shell.
-t MINUTESspecifies the wall time limit for your interactive session, in minutes. Default is 60 (1 hour). You cannot request more than 8 hours (480 minutes) with this utility.
-dIf given, use the
-tparameter is ignored, and wallclock limit is set to 15 minutes.
-g GRESspecifies a generic resource required. The resulting
--gres=GRESadded if given.
-hHelp. No interactive shell will be granted, but an explanation of these and some less common options will be given.
-DDry-run. No interactive shell will be granted, but the salloc command that would have been run is printed out. Useful if you need to go beyond what the 'sinteractive' script can do but want to use it as a starting point.
An example of using
login-2:~: sinteractive -t 120 -a test-hi salloc: Granted job allocation 1561831 salloc: Waiting for resource configuration salloc: Nodes compute-b19-14 are ready for job DISPLAY is login-2.deepthought2.umd.edu:15.0 Try re-authenticating(K5). You have no Kerberos tickets compute-b19-14:~: [ do some work interactively ] compute-b19-14:~: exit logout salloc: Relinquishing job allocation 1561831 salloc: Job allocation 1561831 has been revoked. login-2:~:
The warning message
Try re-authenticating(K5). You have no Kerberos tickets
can be ignored. The batch mechanism will not forward your kerberos tickets
to the compute node, but you probably do not need one there anyway. Should
you need it, you can issue the
renew command and enter your
password to obtain kerberos tickets.
Issuing the command
sinteractive -t 120 -a test-hi -g gpu
will behave similarly, but you will get assigned a node with a GPU.
sinteractive command can cover many of the more
common situations, it is limited, and if you need more control over the
request you will have to manually request an allocation and start an
interactive shell on it. The
-D flag to
sinteractive in dryrun mode --- in this mode the
script will print out what it would have run, but not run it. This might
be an useful starting place for you to begin. Basically, you need to
request the assignment of resources to you from the scheduler with the
salloc command, and then use the
srun to start
a process (typically a shell if you wish to use them interactively) on the
For example, if you want to request two seperate nodes, try this:
login-1:~: salloc -N 2 -t 00:15:00 -p debug --qos=debug salloc: Granted job allocation 13225 login-1:~: echo $SLURM_JOB_NODELIST compute-b20-47,compute-b20-49 login-1:~: srun hostname compute-b20-47.deepthought2.umd.edu compute-b20-49.deepthought2.umd.edu login-1:~: exit exit salloc: Relinquishing job allocation 13225
Remember to exit the spawned subshell when you are done, to relinquish the nodes that you have requested for the job.