Diagnosing and Fixing Problems

Jobs not running/Deferred jobs

If your job doesn't run and ends up in the BLOCKED JOBS section, you can use the checkjob command to get more information about why your job isn't running.

login-1:~: checkjob 4195

[ ... deleted for brevity ... ]

job is deferred.  Reason:  NoResources  (cannot create reservation for job '4195' (intital reservation attempt))
Holds:    Defer  (hold reason:  NoResources)
PE:  232.00  StartPriority:  200
cannot select job 4195 for partition DEFAULT (job hold active)

In this example, we see that the job was deferred because there are insufficient resources available to run the job. Once sufficient resources become available, the job will run automatically.

If instead, you see the following as part of the checkjob output, it means that the job you are trying to run will exceed the allocation you have remaining. This may simply be because you did not specify a walltime as part of your job specification. If your specifications are correct, you can either resubmit your job to your standard-priority account, or to the free serial queue, or you can request an additional allocation from the committee.

login-1:~: checkjob 4204

[ ... deleted for brevity ... ]

job is deferred.  Reason:  BankFailure  (cannot debit job account)
Holds:    Defer  (hold reason:  BankFailure)
PE:  32.00  StartPriority:  200
cannot select job 4204 for partition DEFAULT (job hold active)

If none of the above conditions apply, and your job is listed in the IDLE JOBS section, keep the following in mind:

  1. When a user runs a job, their job will never share a node with a different user. This is to prevent one user's job from interfering with another user's job. Once a user has access to a node, there's no way to prevent them from using all of the available memory, disk, or processors. This means that some processors on a given node may not always be used. However, if a single user submits multiple jobs, those jobs will be packed onto nodes if the job resource requirements allow this.
  2. In addition to the number of processors, a user may also request a certain amount of memory or certain amount of disk space when submitting their job. For example, a user may know that their job needs 4G of RAM for their process to run. So to someone viewing the queue, it may appear that the node is being used "inefficiently", but in reality, it is not. In this case, even though a node may have 4 processors, if it only has 4G of RAM total, only one 4G job is going to run on it.
  3. Remember that this is a shared system. There's no guarantee that jobs submitted will run immediately. Jobs submitted using high-priority accounts will run first, followed by the standard priority accounts, with the 'free' serial jobs getting whatever's left over. The showq command lists jobs according to priority order, with the highest priority jobs listed first.

Debug Partition

The debug partition is available for running short tests with reasonable fast turn around. This is useful to see that your code, or your submit scripts, are working as intended, especially if your "real job" would require a fair amount of time and/or nodes and get stuck for a while in the queue.

Instructions on how to submit a job to the debug partition.

Interactive Jobs

The individual compute nodes generally do not allow direct shell access. This can be problematic if you want to test out your code on the exact processor on which your code will run. If you only need a single node for compiling and debugging purposes, the compute nodes:
  • compute-f19-0.deepthought.umd.edu
  • compute-g20-0.deepthought.umd.edu
are always available for remote shell access from the login nodes.

If you need shell access to additional nodes, you can request the scheduler assign some to you. This request gets put in the queue with all the batch job requests, and depending on the cluster usage at the moment you might get a prompt in seconds, or in minutes, or it might take hours or days.

The script sinteractive is provided to assist with this for most basic cases. It takes the following optional arguments:

  • -c NUMCPUS specifies the number of CPU cores to request. Default is 1.
  • -a ACCOUNT specifies the account to charge. Default is your default account.
  • -J NAME specifies the name to use for the job. Default is "interactive".
  • -s SHELL specifies the shell to start up on the assigned node. Default is your default login shell.
  • -t MINUTES specifies the wall time limit for your interactive session, in minutes. Default is 60 (1 hour). You cannot request more than 8 hours (480 minutes) with this utility.
  • -d If given, use the debug partition. The -t parameter is ignored, and wallclock limit is set to 15 minutes.
  • -g GRES specifies a generic resource required. The resulting salloc will have --gres=GRES added if given.
  • -h Help. No interactive shell will be granted, but an explanation of these and some less common options will be given.

An example of using sinteractive:

login-2:~: sinteractive -t 120 -a test-hi
salloc: Granted job allocation 1561831
salloc: Waiting for resource configuration
salloc: Nodes compute-b19-14 are ready for job
DISPLAY is login-2.deepthought2.umd.edu:15.0
Try re-authenticating(K5).  You have no Kerberos tickets

[ do some work interactively ]

compute-b19-14:~: exit
salloc: Relinquishing job allocation 1561831
salloc: Job allocation 1561831 has been revoked.

The warning message Try re-authenticating(K5). You have no Kerberos tickets can be ignored. The batch mechanism will not forward your kerberos tickets to the compute node, but you probably do not need one there anyway. Should you need it, you can issue the renew command and enter your password to obtain kerberos tickets.

Issuing the command sinteractive -t 120 -a test-hi -g gpu will behave similarly, but you will get assigned a node with a GPU.

Although the sinteractive command can cover many of the more common situations, it is limited, and if you need more control over the request you will have to manually request an allocation and start an interactive shell on it. The -D flag to sinteractive will run sinteractive in dryrun mode --- in this mode the script will print out what it would have run, but not run it. This might be an useful starting place for you to begin. Basically, you need to request the assignment of resources to you from the scheduler with the salloc command, and then use the srun to start a process (typically a shell if you wish to use them interactively) on the node.

For example, if you want to request two seperate nodes, try this:

login-1:~: salloc -N 2 -t 00:15:00 -p debug --qos=debug
salloc: Granted job allocation 13225
login-1:~:  echo $SLURM_JOB_NODELIST
login-1:~:  srun hostname
login-1:~:  exit
salloc: Relinquishing job allocation 13225
The salloc command will require that you specify a partition and QoS. You can use debug for both partition and QoS if your session is under 15 minutes. Otherwise, you will need to use the partition standard (or high-priority for the high priority accounts) with one of the QoSes listed here.
Remember to exit the spawned subshell when you are done, to relinquish the nodes that you have requested for the job.