Monitoring and Managing Your Jobs

  1. Seeing what jobs are running/queued
  2. When will my job run
  3. Detailed information about your jobs
  4. Viewing output of jobs in progress
  5. Cancelling your jobs

Seeing what jobs are running/queued

To verify that your job is running, you can use the command showq. For example:
login-1:~: showq
ACTIVE JOBS--------------------
JOBNAME            USERNAME      STATE  PROC   REMAINING            STARTTIME

4178                  kevin    Running     4    00:01:00  Mon Jan 22 11:13:09

     1 Active Job        4 of  236 Processors Active (1.69%)
                         1 of   59 Nodes Active      (1.69%)

IDLE JOBS----------------------
JOBNAME            USERNAME      STATE  PROC     WCLIMIT            QUEUETIME


0 Idle Jobs

BLOCKED JOBS----------------
JOBNAME            USERNAME      STATE  PROC     WCLIMIT            QUEUETIME


Total Jobs: 1   Active Jobs: 1   Idle Jobs: 0   Blocked Jobs: 0

If your job shows up in the ACTIVE JOBS section as shown above, your job should be off and running.
If your job shows up in the IDLE JOBS section, that means that there currently are insufficient resources available to run your job. Check to make sure you haven't requested more processors than you need, and that you've specified a reasonable walltime. If you see lots of jobs in the ACTIVE JOBS section, it's probable that you'll just need to wait for someone else's job to finish before yours can start.
If your job shows up in the BLOCKED JOBS section, it most likely means that you did not have a sufficient amount of time remaining in your CPU allocation to run the job. Either specify a smaller walltime, or obtain an additional allocation. See the section Diagnosing Job Problems for further information.

When will my job start?

The scheduler tries to schedule all jobs as quickly as possible, subject to cluster policies, available hardware, allocation priority (contributers to the cluster get higher priority allocations), etc. Typically jobs run within a day or so, but this can vary and usage of the cluster can vary widely at times.

The command showstart can give you a general idea of when your job will start. Use it as:

login-2:~/work/hpcc-tests/test: showstart 1770833
job 1770833 requires 16 procs for 8:00:00

Estimated Rsv based start in                 3:15:58 on Fri Feb  7 21:50:13
Estimated Rsv based completion in           11:15:58 on Sat Feb  8 05:50:13

Best Partition: deepthought

Obviously, the times given are estimates. The job could start earlier if other jobs ahead of it in the queue do not use their full walltime, or could get delayed if jobs with a higher priority than yours are submitted before your start time.

Detailed information about your jobs

To find out more detailed information about your job, use the checkjob command. This command will show you which specific nodes were allocated to your job, and it will also show you the job requirements you specified when you submitted the job.

login-1:~: checkjob 4209

checking job 4209

State: Running
Creds:  user:kevin  group:wheel  account:kevin  class:serial  qos:serial
WallTime: 00:00:00 of 00:01:00
SubmitTime: Tue Jan 23 10:33:55
  (Time Queued  Total: 00:00:01  Eligible: 00:00:01)

StartTime: Tue Jan 23 10:33:56
Total Tasks: 1

Req[0]  TaskCount: 1  Partition: DEFAULT
Network: [NONE]  Memory >= 0  Disk >= 0  Swap >= 0
Opsys: [NONE]  Arch: [NONE]  Features: [prod]
Allocated Nodes:
[compute-2-39.deeptho:1]


IWD: [NONE]  Executable:  [NONE]
Bypass: 0  StartCount: 1
PartitionMask: [ALL]
Flags:       RESTARTABLE PREEMPTEE PREEMPTOR
Attr:        PREEMPTEE

Reservation '4209' (00:00:00 -> 00:01:00  Duration: 00:01:00)
PE:  1.00  StartPriority:  200

Viewing output of jobs in progress

If you want to view the output of your job while it is running, you can use the command qpeek. This command can be used to view both the standard output and standard error streams from your job, and can also be used to follow the output as it occurs.

login-1:~: qpeek
qpeek:  Peek into a job's output spool files

Usage:  qpeek [options] JOBID

Options:
  -c      Show all of the output file ("cat", default)
  -h      Show only the beginning of the output file ("head")
  -t      Show only the end of the output file ("tail")
  -f      Show only the end of the file and keep listening ("tail -f")
  -f Show only the last  lines and keep listening ("tail -f")
  +0f     Show all of the file and keep listening ("tail +0f")
  -#      Show only # lines of output ("tail -")
  -e      Show the stderr file of the job
  -o      Show the stdout file of the job (default)
  -?      Display this help message

login-1:~: qpeek 4209

...this is sample output from job 4209...

login-1:~: qpeek -e 4209

...this is sample error messages from job 4209...

login-1:~: qpeek -f 4209

...this is sample output from job 4209, the command will not exit, and
will continue to show output as it is generated...

Cancelling Your Jobs

To cancel your job before it completes, use the canceljob command.

login-1:~: canceljob 7274


job '7274' cancelled