Frequently Asked Questions about the University's High-Performance Computing Clusters

Introduction

Below are answers to some commonly asked questions about the University of Maryland's high-performance computing (HPC) clusters (HPCCs). More in-depth information on a given topic can be found by following the links to more in-depth information.

Table of Contents

  1. I ) Introduction to the Clusters/General Issues
    1. I-1 ) What are the high-performance computing clusters?
    2. I-2 ) Who owns/runs the cluster?
    3. I-3 ) What is the Allocations and Advisory Committee (AAC)?
    4. I-4 ) Where can I find detailed documentation on the clusters?
    5. I-5 ) What are the advantages of joining one of the campus clusters as opposed to starting my own?
    6. I-6 ) How do I contribute to one of the campus clusters?
    7. I-7 ) How can I get help using the clusters?
    8. I-8 ) How should I acknowledge the use of one of the clusters in papers, etc.?
    9. I-9 ) I just received email about my home directory being over quota. What does this mean?
  2. II ) Access to the system/Issues connecting, etc
    1. II-1 ) How do I get access to the system?
    2. II-2 ) My research group contributed to one of the clusters. How do I get access to the system?
    3. II-3 ) My advisor or research group already has an allocation from the AAC. How do I get access to the system?
    4. II-4 ) Can I get access to the system, even if I/my research group did not contribute to one of the campus clusters?
    5. II-5 ) What is a TerpConnect/Glue account and how do I get one?
    6. II-6 ) How do I get an associate/colleague/student/etc added to my allocation?
    7. II-7 ) I cannot connect to the system. What is wrong?
    8. II-8 ) I cannot transfer files. What gives?
    9. II-9 ) I am getting warnings about keys and fingerprints when I try to ssh. Should I be concerned?
    10. II-10 ) How do I change my password?
    11. II-11 ) I forgot my password. What can I do?
  3. III ) Slurm issues/error messages/warnings/etc
    1. III-1 ) What does "sbatch: error: This does not look like a batch script" mean?
    2. III-2 ) Sbatch errors with "Batch job submission failed: Job violates accounting/QOS policy (job sumbit limit, user's size and/or time limits)". What does that mean?
    3. III-3 ) What does "(AssocGrpCPUMinsLimit)" or "(AssociationJobLimit)" mean?
    4. III-4 ) My job failed with an error like 'slurmstepd: error: *** JOB NUMBER ON NodeName CANCELLED AT Time DUE TO TIME LIMIT ***'. What does that mean?
    5. III-5 ) My job failed with an error 'slurmstepd error: Exceeded step memory limit at some point'. What does that mean?
    6. III-6 ) What does "(QOSResourceLimit)" mean?
    7. III-7 ) My openmpi job complains about CUDA libraries. What does this mean?
    8. III-8 ) What does "Quota" mean in the status field for a job when using the showq command?
  4. IV ) Issues running jobs
    1. IV-1 ) My job is spending a lot of time in the queue. Why? When will it start?
    2. IV-2 ) How can I reduce the amount of time my job spends waiting in the queue?
    3. IV-3 ) My OpenMPI performances is much less than expected. What gives?
    4. IV-4 ) What does "(AssocGrpCPUMinsLimit)" or "(AssociationJobLimit)" mean?
    5. IV-5 ) What does "(QOSResourceLimit)" mean?
    6. IV-6 ) My openmpi job complains about CUDA libraries. What does this mean?
    7. IV-7 ) My OpenMPI job has a warning about 'N more processes have sent' some message. What does that mean?
    8. IV-8 ) What does "Quota" mean in the status field for a job when using the showq command?
  5. V ) Questions about job accounting
    1. V-1 ) What is an SU? Or a kSU?
    2. V-2 ) Which allocations do I have access to?
    3. V-3 ) Which allocation should I charge my job against?

FAQ I) Introduction to the Clusters/General Issues


FAQ I-1) What are the high-performance computing clusters?

The Division of Information Technology provides serveral high-performance computing (HPC) clusters (hereafter referred to as HPCCs) for general campus use. These are beowolf clusters consisting of hundreds of x86_64 based compute nodes configured for running large-scale calculations in support of the campus research community. They are especially designed for parallel computation.

The clusters are:


FAQ I-2) Who owns/runs the cluster?

Both the Deepthought and Deepthought2 clusters were purchased with funds from the Division of Information Technology at the University of Maryland along with contributions from various colleges, departments and research groups. Contributing groups receive high priority allocations, replenished quarterly, based on the amount of CPU time their contribution added to the cluster.

Both Deepthought clusters are managed by the Division of Information Technology, and campus IT staff handle all the hardware and system issues, and maintain a large software collection for the users.

The Bluecrab cluster was funded by a State of Maryland and is located at the Maryland Advanced Research Computing Center (MARCC for short). MARCC is jointly managed by Johns Hopkins University and the University of Maryland. The Bluecrab cluster is managed and operated by MARCC systems staff.

A portion of the resources of Bluecrab cluster has been allocated to the University of Maryland. Those resources, as well as the resources on the Deepthought clusters arising from the Division of Information Technology's contributions to the clusters, is made available to the campus community by the HPC Allocations and Advisory Committee (or AAC for short) for use by researchers investigating HPC, or who need some HPC time but do not have the funds to invest in one.


FAQ I-3) What is the Allocations and Advisory Committee (AAC)?

The Allocations and Advisory Committee (or AAC for short) is composed of faculty representing colleges and/or departments which have contributed hardware to the Deepthought clusters. This group provides oversight, sets policy, and allocates computational resources to campus researchers.


FAQ I-4) Where can I find detailed documentation on the clusters?

Cluster documentation, including hardware configurations, available software, status, reports, and more are available at /hpcc

Detailed information about each of the various clusters can be found at:


FAQ I-5) What are the advantages of joining one of the campus clusters as opposed to starting my own?

HPC clusters take a significant amount of work to set up, and after the initial procurement and installation, they also take a fair amount of time to maintain. They also need to be housed in spaces capable of supporting their demanding power and cooling needs. Because the Division of IT takes care of all of this for you, you can focus exclusively on your research needs without the added burden of managing your own IT environment.

Joining one of the campus cluster also provides flexibility with regard to running jobs that you might not be able to otherwise. For example, if you have already contributed several nodes and you would like to see if you applications would benefit from greater parallization, you could run a larger multi-core job even your contribution was smaller than that. (You wouldn't want to do that indefinitely since you would likely exhaust your allocation, but you could certainly do it on occasion should the need arise). Similarly, if you need to run a large number of jobs within a short time period to meet a particular deadline, you can "borrow ahead" on your allocation and obtain additional compute power when you need it. When you purchase your own computing environment, you cannot exceed its maximum compute capacity, and idle cycles cannot be reclaimed, as they can due to the flexible allocation scheme provided by the campus clusters.

Researchers who are unsure about HPC and whether it will improve their throughput are encouraged apply for a developmental allocation from the AAC. This will enable you to determine whether the use of HPC resources can benefit your research without having to invest in hardware.


FAQ I-6) How do I contribute to one of the campus clusters?

The Allocations and Advisory Committee (AAC) and the Division of Information Technology can help. Send email to the Division of Information Technology to let us know of your interest . We will discuss your research requirements with you and work with AAC members to determine whether the cluster or high-performance computing in general is appropriate for your type of research. Test allocations are also available to help determine whether and how much your application would benefit from running in a high-performance computing environment.

Once everyone is in agreement that that an investment in the cluster makes sense, we will initiate discussions with the AAC and cluster administrators to iron out the specifications and associated costs of the hardware contribution.

See the section on contributing the the UMD HPC environment for more information about the benefits of contributing to the cluster and how to start a dialog about doing so.


FAQ I-7) How can I get help using the clusters?

The systems staff for the HPC clusters will try to assist you with basic issues of accessing and using the system, but any but the most basic questions regarding the use of the various research software applications, and especially questions invovling details of your field of study, are likely to be beyond our expertise, and yuo are best off directing such questions at your colleagues.

We hope that our usage documentation will answer most questions, and other pages provide further mechanisms for getting assistance.

Basically, for system type questions, you can open a help or trouble ticket, and for application type questions you might get help from our hpcc-discuss mailing list.


FAQ I-8) How should I acknowledge the use of one of the clusters in papers, etc.?

Maintaining a first class HPC environment is expensive, and the Division of Information Technology request that you acknowledge your use of our clusters in papers or other publications of research which benefited from this campus resource. Such acknowledgements assist us in convincing people of the value of this resource to campus, and helps us to obtain funding for its continued maintenance and/or future expansion.

To acknowledge your use of the cluster, we request that you use this wording.


FAQ I-9) I just received email about my home directory being over quota. What does this mean?

Home directories on the Deepthought HPC clusters are not intended for the storage of large data sets. They are located on disks chosen for reliability over speed, so they are not optimized for heavy I/O, and they are backed up. As such, the available space is more limited than in, e.g., lustre data stores.

Because of this, we have a soft quota policy on these directories. You should keep the size of your home directory under 10 GB. However, because we recognize the large data demands of our HPC users, this is not a hard quota; if your usage is at 7 GB and you copy a 5 GB file into your home directory, we don't have a hard limit and kill the transfer when 10 GB is hit. Instead, we allow you to exceed the 10 GB soft quota by reasonable amounts for up to a week. When you are exceeding the soft quota, however, you will receive an email daily informing you of that, and asking you to remedy the matter before the week is up.

This policy is designed to try to give you the flexibility of storing large amounts of data in your home directory temporarily, without taxing the system unduly and interfering with the work of your colleagues on the cluster.

There are two types of email warnings you will get if the usage on your home directory goes over the 10 GB soft quota.

The first occurs while you are still in the 7 day grace period, and has a subject like Friendly notice: Your home directory on deepthought/deepthought2 is over quota. This email is to alert you to the fact that you have gone over the 10 GB soft quota, and that your 7 day grace period countdown has started. At this point, you are still in accordance with the policies on the clusters, but you should look into reducing the disk usage in your home directory before the grace period runs out. You should, at your earliest convenience, delete unneeded files, transfer data off the cluster, or move data needed for active research to lustre or other storage. These emails will occur for as long as you are over the 10 GB soft quota.

If you fail to reduce usage by the requested date, you will get the second email, with a subject like URGENT: You are OVER your quota on Deepthought/Deepthought2 homespace . This is more serious, and if you receive this email you are in violation of cluster policy. If you receive this email, you MUST reduce your homespace usage ASAP.

Unlike the "Friendly notice" emails, the emails when you exceed the grace period go to system staff as well, and if we do not see prompt action to rectify the matter we will contact your advisor and/or suspend your privilege to use the HPC clusters, as you are in violation of policy and negatively impacting the use of the cluster by other users.


FAQ II) Access to the system/Issues connecting, etc


FAQ II-1) How do I get access to the system?

There are basically two methods for getting access to the system:

These are explained in more detail below.

Note: All access to the MARCC/Bluecrab cluster is via application to the AAC.


FAQ II-2) My research group contributed to one of the clusters. How do I get access to the system?

Contact the person responsible for your research group's cluster allocation. Your colleagues and/or advisor should be able to direct you to that individual. Have the allocation owner send email to hpcc-help@umd.edu requesting that your account be granted access. The message should contain your name, your University ID, and the name of the allocation group, and which cluster (Deepthought or Deepthought2). Requests must come from a recognized point of contact for the allocation; any other requests will be ignored.

An email will be sent to to you within two business days of the request confirming that access to the requested allocation group has been granted. All cluster-related communication is sent to your @umd.edu account, so please monitor all communications and honor any requests from systems staff sent to that address.


FAQ II-3) My advisor or research group already has an allocation from the AAC. How do I get access to the system?

For the Deepthought clusters, this is the same process as for research groups that contributed to the cluster. Basically, have the point of contact submit a request, as described above.

For the MARCC/Bluecrab cluster, please see the section on Getting access to the MARCC/Bluecrab cluster..


FAQ II-4) Can I get access to the system, even if I/my research group did not contribute to one of the campus clusters?

The Allocations and Advisory Committee (AAC) considers all applications for access to the HPC clusters for general campus use. These requests are most often awarded to researchers investigating HPC computations (e.g. is the application suited to parallelization, how much speed up would it get at various levels of parallelization, etc), or for projects which would benefit from HPC methodology but are limited enough in scale that building a HPCC for the project is not cost effective.

Small, 4 kSU developmental allocations are also available for researchers who wish to investigate whether their research could benefit from HPC resources. This allows you to "test drive" an HPC cluster without the monetary investment.

There is no monetary charge to the applicant for the AAC granted allocations. (NOTE: Software licensing, etc. is NOT included in the allocation, even if mentioned in the application. If restrictively licensed software is required, you must provide the licenses. Please contact the Division of Information Technology BEFORE making any software purchases to ensure your license is compatible with the HPC cluster.)

Please see the section on Requesting an Allocation from the AAC for more information.


FAQ II-5) What is a TerpConnect/Glue account and how do I get one?

TerpConnect is the University of Maryland's Unix environment for students, faculty and staff. It is part of a larger Unix environment (named Glue) maintained by the Division of Information Technology. The Deepthought HPC clusters are also part of the Glue environment.

To access any of these Unix environments (including the Deepthought HPC environments) you need to have a TerpConnect/Glue account. The username and password for this will be the same as your campus Directory ID and password, but it might need to be activated separately.

Detailed instructions on how to activate your TerpConnect account are in the campus knowledge base, but basically you just go to , https://cgi.oit.umd.edu/cgi-bin/account/activation.cgi , log in with your directory ID and password, and you will see a table of Service Names (in the left column) and description (in the right column). There should be one service labelled "TerpConnect"; if it says "Activated" your TerpConnect account has been activated. If it does not say "Activated", check the box and submit the form. It might take a day to activate.

NOTE: If you have "affiliate" status and you do NOT see a box for the TerpConnect service, this means that your affiliate status was not granted permissions for the required service. Please contact your sponsor and/or PHR person, and look at the list of required services.

If you are not a member (faculty, currently registered student, or staff) of the University of Maryland, you might still be able to get a TerpConnect account if you are working with a faculty member who is willing to sponsor you as an affiliate.


FAQ II-6) How do I get an associate/colleague/student/etc added to my allocation?

WARNING
DO NOT SHARE YOUR PASSWORD with them, or anyone.

The procedure to follow depends on the cluster.

  • For the two Deepthought clusters:

    If you are the owner or point-of-contact for an allocation, you can just send email to hpcc-help@umd.edu requesting that the person be granted access. The message should contain your name and University ID, the name and University ID (e.g. their @umd.edu email address) of the person to be added, the name of your allocation, and the cluster it is on (Deepthought or Deepthought2). Requests must come from a recognized point of contact for the allocation; any other requests will be ignored. You can add points of contact to your allocation in the same way, just be sure to state that you wish the person added as a point of contact for the allocation (and not just as an user of the allocation).

    Email will be sent to to you within two business days of the request confirming that access to the requested allocation group has been granted. If you do not see such, feel free to follow up to confirm that the request was processed.

    Before someone can be added to your allocation, they must have an active TerpConnect/Glue account. These are readily available to members of the University community, and there are procedures to get such for colleagues, etc. not formally affiliated with the university.

  • For the MARCC/Bluecrab clusters:

    Please follow the instructions in the section on Getting access to the MARCC/Bluecrab cluster..


FAQ II-7) I cannot connect to the system. What is wrong?

Make sure you are trying to connect to the login node for the desired cluster , as described in the table found in the section on logging into the clusters.. Note: if you drop the login part at the front of the login nodes for the Deepthought clusters, that is a different machine which users are NOT allowed to log into.

Accounts will get disabled if your association with the university ends. I.e., if you graduate or stop registering for classes, or your appointment ends. If you are an affiliate, remember that affiliate status needs to be renewed (by your sponsor) annually.

If none of the above explain your issue, then contact us. To help us diagnose and resolve your issue, please include the following information:

  • The host you are trying to connect to
  • The exact command you are using. For windows applications, etc., any settings would be helpful.
  • The exact error messages if any
  • Your username. DO NOT INCLUDE PASSWORDS
  • As accurately as possible, the time of the failed login attempts
  • If possible, the IP address you are connecting from. The web page http://noc.net.umd.edu/cgi-bin/netmgr/whoami can provide this last bit of information.


FAQ II-8) I cannot transfer files. What gives?

The scp and sftp protocols are very sensitive to spurious output from your initialization scripts. If you can ssh into the box, notice if you are seeing any errors or unusual output. You should only see the "Unauthorized access ..." warning, the "Last login message", and perhaps a message starting like "DISPLAY is ...". If you see anything else, it is likely interfering with the scp and/or sftp programs, and you should edit your initialization scripts. Tap commands (on systems which have tap enabled; i.e. NOT the Deepthought clusters) must have the -q argument to supress the help message. Tap commands MUST not appear in Deepthought or Deepthought2 initialization scripts. See also the section on supressing output from dot files.


FAQ II-9) I am getting warnings about keys and fingerprints when I try to ssh. Should I be concerned?

The ssh protocol tries to protect you against a number of different threats. There are two possible warnings; that ssh cannot verify the key, or that the key changed. The first is normal, especially on your first login attempt from a system. The latter could signal an attack.

The section on Logging into the System gives more information, including showing samples of both such messages, and providing the key fingerprints for the login nodes of both Deepthought clusters so that you can manually verify the server in the former case.


FAQ II-10) How do I change my password?

The Deepthought clusters use your standard UMD campus directory ID and password. Information for changing this password can be found in the password change knowledge base article . This includes pictures and a video. You can also use the passwd command from one of the login nodes. NOTE: The aforementioned procedures will change your password on ALL university systems, as they are all part of one common authentication process.

EXCEPTION: If you are using one of the Deepthought clusters as part of a class via a temporary Glue class account (e.g. something like cmsc622-10xu account name), then your class account is distinct from your normal UMD campus directory ID. If you know the password, you can change the password using the standard Unix passwd command; when logged into the class account, just type passwd. It will prompt you for your current password, then ask you to enter the new password twice. If you forgot your password, you will need to ask your instructor to reset it for you (Instructors can find more information in this regard in the class access section of this documentation.)

You can change your password on the MARCC/Bluecrab cluster by visiting the MARCC password management page.


FAQ II-11) I forgot my password. What can I do?

Since the Deepthought clusters use your standard UMD campus directory ID and password, password resets are handled the same way as for this campus-wide password. Information for resetting this password can be found in the password reset knowledge base article . This includes pictures password reset knowledge base article. This includes pictures and a video. NOTE: The aforementioned procedures will reset your password on ALL university systems, as they are all part of one common authentication process.

EXCEPTION: If you are using one of the Deepthought clusters as part of a class via a temporary Glue class account (e.g. something like cmsc622-10xu account name), then your class account is distinct from your normal UMD campus directory ID. In this case, you will need to ask your instructor to reset it for you. Instructors can find more information in this regard in the class access section of this documentation.

You can reset your password on the MARCC/Bluecrab cluster by visiting the MARCC password reset page. You will be prompted to enter your username and the email address you registered with MARCC when you activated your account. Remember: on MARCC/Bluecrab, your username will include @umd.edu.


FAQ III) Slurm issues/error messages/warnings/etc


FAQ III-1) What does "sbatch: error: This does not look like a batch script" mean?

The Slurm sbatch command requires that your job scripts start with a shebang line, that is a line beginning with #! followed by the path to the shell to be used to process the script. For example, if you have a script written in tcsh shell ( with the set and setenv commands, etc.), your script should start like:

#!/bin/tcsh
#SBATCH -n 1
#SBATCH -t 30:00

setenv WORKDIR $SLURM_SUBMIT_DIR
...

A similar script in the bourne shell would be

#!/bin/bash
#SBATCH -n 1
#SBATCH -t 30:00

. ~/.profile
WORKDIR=$SLURM_SUBMIT_DIR
export WORKDIR

Slurm requires the shebang line, because it searches for that and uses that to determine which shell it should use to run your script. (This differs from the PBS/Torque/Moab/Maui behavior, which ignored the shebang and just ran the script under your default shell unless your specified another shell with a flag to qsub

This error is telling you that Slurm could not find a valid shebang line. Generally, you just need to figure out what shell you were using (if you see setenv commands or variable assignments beginning with set, use /bin/tcsh, if you see export lines or variable assignments without the set command, you probably want /bin/bash).


FAQ III-2) Sbatch errors with "Batch job submission failed: Job violates accounting/QOS policy (job sumbit limit, user's size and/or time limits)". What does that mean?

Although this cryptic error message can mean a number of things, most often on the Deepthought clusters it means that the allocation account you are trying to charge the job against is out of funds.

Normally, if there are insufficient funds for the completion of your job and all currently running jobs charging against the same allocation account, sbatch will accept the job and place in the queue, and it will simply refuse to run until sufficient funds are available (either due to replenishment or due to currently running jobs using less SUs than anticipated by the scheduler). These jobs will remain queued with the reason "AssocGrpCPUMinsLimit" or "AssociationJobLimit".

However, if the anticipated cost (based on walltime limit and CPU cores requested) of the job you are trying to submit exceeds the limit on the allocation account, the job will refuse to even be queued with the error

Batch job submission failed: Job violates accounting/QOS policy (job sumbit limit, user\'s size and/or time limits)

Typically on the Deepthought clusters this will occur when one tries to submit a job against the standard priority allocation in the final month of a quarter. The standard priority allocation in the final month of the quarter can have a rather small SU limit due to most or all of its funds being transferred to the high priority allocation , which can lead to this situation and error message.

If the allocation account you are charging the job against is not nearly empty (see the sbalance command), then please contact system staff. Please include the exact sbatch command you gave, and the path to the job submission script.


FAQ III-3) What does "(AssocGrpCPUMinsLimit)" or "(AssociationJobLimit)" mean?

If you see (AssocGrpCPUMinsLimit) or (AssociationJobLimit) in the NODELIST(REASON) field of the squeue output for your job, that means that the job is pending because there are insufficient funds in the account that the job is being charged against. If you are using the showq command, this manifests as a Quota message in the State column.

(Actually, there are in general a number of different factors that could be causing this message in Slurm, but on the Deepthought clusters the only one which is relevant is the amount of SUs granted to the account you are charging against.)

Note that for a job to start running, we require that the account has sufficient funds to complete it and all currently running jobs (the amount of funds required for completion is computed based on the amount of walltime requested for the jobs). So even if your account has 3kSU remaining and your job only would consume 0.5 kSU, if you (or others in your group) have other jobs currently running that are anticipated to need more than 2.5 kSU to finish, your new job will not run and be left pending with this status. If those jobs finish before expected, so that there is now enough funds for your job, it will be started when the scheduler nexts examines it.


FAQ III-4) My job failed with an error like 'slurmstepd: error: *** JOB NUMBER ON NodeName CANCELLED AT Time DUE TO TIME LIMIT ***'. What does that mean?

Every job has a time limit. You are strongly encouraged to explicitly state a time limit in your job (see the section on specifying the amount of time your job will run for more information) as the default time limit is rather small (about 15 minutes). If your job runs past the amount of time that was given to it, it will be killed with an error message like the above.

The time limit for the job is needed for the scheduler to efficiently schedule jobs. Your job will spend less time in the queue if you give a good value for this --- you need to specify a time in which you are sure the job will complete within (because otherwise it will be killed), and you might want to add some modest padding to that time just to be safe. But you do not want to be excessive, either. If you expect your job will only run for an hour, specifying a walltime of 2 hours is not unreasonable (giving it some padding), but 10 hours is excessive and may delay the start of your job. E.g., if the next job in the queue to be scheduled needs 10 nodes, but only 8 nodes are currently free, and the scheduler estimates the remaining two nodes will only become available in 2.5 hours, it might decide to let your 2 hour job run on some of the free nodes since your job will be finished before the 10 node job will need them. But if you specified 10 hours wall time, your job will not fit into that window.


FAQ III-5) My job failed with an error 'slurmstepd error: Exceeded step memory limit at some point'. What does that mean?

Slurm monitors the memory usage of your jobs, and will cancel you job if it uses more memory than it requested. This is necessary because memory is a shared resource and if your job tries to use more memory than was allocated to it, this could negatively impact other jobs on the same node. This error indicates that at some point, your job used more memory than was allocated to it.

If you are getting this error, you can try increasing the amount of memory that you request. The standard Deepthought2 nodes all have a bit over 6 GiB of RAM per core, for 128 GiB per node. There are also a small number of nodes with 1024 GiB per node on the Deepthought2 cluster. The standard nodes on the Bluecrab cluster also have 128 GiB per node (but because of larger core count this comes to a bit over 5 GiB/core), as well as a number of 1024 GiB large memory nodes. See the section on specifying memory requirements of your job for more information.

You should try to do an estimate of how much memory your code will need to ensure that you are not hitting a memory leak which will consume however much memory you throw at the job.


FAQ III-6) What does "(QOSResourceLimit)" mean?

If you see (QOSResourceLimit) in the NODELIST(REASON) field of the squeue output for your job, that means that the job has hit against a limit imposed at the QoS level. On the Deepthought clusters, this usually would mean that you have exceeded the maximum number of jobs by a single user that can run at a given QoS level at the same time.

Some users on the cluster legitimately submit hundreds or thousands of single core jobs at the same time which can run for several days. These jobs can have the adverse affect of blocking more parallel jobs from running. To try to balance this, we have imposed limits on the number of jobs from a given user that can be simultaneously running at a given QoS level. You can submit jobs over this limit, but they will remain in pending states (with a QOSResourceLimit in the NODELIST(REASON) field of the squeue command) until additional run slots are available (i.e. one of your currently running jobs completes).

The exact number of this limit is still subject to some tweaking as we try to find the best number to ensure the cluster can be used well by all of our diverse user base. Currently, this limit is in the thousands, so most users will not be impacted by this at all.


FAQ III-7) My openmpi job complains about CUDA libraries. What does this mean?

Many OpenMPI jobs may see a warning like the following near the start of their Slurm output file:

--------------------------------------------------------------------------
The library attempted to open the following supporting CUDA libraries,
but each of them failed.  CUDA-aware support is disabled.
libcuda.so.1: cannot open shared object file: No such file or directory
/usr/lib64/libcuda.so.1: cannot open shared object file: No such file or directory
If you are not interested in CUDA-aware support, then run with
--mca mpi_cuda_support 0 to suppress this message.  If you are interested
in CUDA-aware support, then try setting LD_LIBRARY_PATH to the location
of libcuda.so.1 to get passed this issue.
--------------------------------------------------------------------------

These are generally harmless warnings. The first message is stating that one of the MPI threads was not able to open certain CUDA related libraries. CUDA is an API primarily used for interfacing with GPUs --- unless your code was specifically designed to use GPUs and you wanted to use GPUs, you can ignore this message. This warning typically shows up on non-GPU jobs because they usually run on nodes that do not have GPUs and therefore do not have the hardware specific CUDA libraries being referred to.

If you were planning to use GPUs and you get this warning, then there is a problem. Most likely you did not request the scheduler to assign you GPU enabled nodes. If that is not the case, contact systems staff.

As stated in the warning message, if you are not planning on using GPUs and you wish to suppress the error message, you can add the flags --mca mpi_cuda_support 0 to your mpirun or equivalent command to turn this warning off.

You might also wish to see the question regarding messages about error/warning aggregation.


FAQ III-8) What does "Quota" mean in the status field for a job when using the showq command?

Please see the question about "(AssocGrpCPUMinsLimit)". That condition manifests with the Quota message with the showq command.


FAQ IV) Issues running jobs


FAQ IV-1) My job is spending a lot of time in the queue. Why? When will it start?

The HPC clusters have a large number of nodes, and many, many cores. However, we also have many users, and some of them submit very large jobs. We make no promises about the maximum amount of time a job can be queued. If the system is lightly loaded, most jobs will start within a few minutes of being submitted (there is some small overhead to the batch system). When the cluster is heavily used, wait times of hours to a significant fraction of a day are not to be unexpected.

If your job is taking a while to run, first check its status with the squeue command. Pay attention to the NODELIST(REASON) field. If it shows (Resources) or (Priority), that means that either there are no nodes available for it to run on (i.e. other jobs are running on the nodes it needs), or that there are higher priority jobs before it in the queue. These conditions usually indicate that the cluster is busy, and that it will start your job when resources are available and it is your job's turn. See the section on getting an estimate of when your job will start for more information.

If the NODELIST(REASON) field is showing something else, there might be something wrong which is preventing your job from starting. Typical cases are (AssociationJobLimit) and (QOSResourceLimit) which indicate that you have exceeded either your allocations funds or the maximum number of allowed jobs which a single user can run at one time.


FAQ IV-2) How can I reduce the amount of time my job spends waiting in the queue?

Please see the following for a more detailed overview of the scheduling process, and the factors that affect scheduling. But for a quick answer, to minimize the amount of time your jobs spend in the queue:

  • For short (under 15 minute) debugging jobs, use the debug partition.
  • Use your high-priority allocation if you have a high-priority allocation (unless you are already using the debug partition.)
  • Do NOT use the scavenger partition, as it has the lowest priority of all partitions.
  • Do not request an excessive walltime. Your job will be killed, finished or not, when the walltime runs out, so a bit of padding is recommended, but be reasonable. If your job should finish within 4 hours, don't request a walltime of 8 hours.
  • If your job might not use all the cores on a node, consider using the shared flag. This is especially true on the original Deepthought cluster where node sizes vary a lot.
  • While you should always request the resources you expect to need, and maybe with some padding since these are often estimates, do not be excessive. E.g., requesting 3 GB/core of RAM when in reality only 2 GB/core are needed can result in your job spending more time in the queue. Similarly, while the original Deepthought cluster has some nodes with feature tags related to processor speed, etc., their use is deprecated. Requesting such a tag limits the number of nodes your job can run on, and you might end up spending more time waiting in the queue than was saved in run time. (Plus, due to the wide range of chip architectures, sometimes the higher clock speeds do NOT correspond to faster chips).


FAQ IV-3) My OpenMPI performances is much less than expected. What gives?

We have been seeing poor performance when using OpenMPI versions below 1.8.1 and running on more than about 15 cores per node. The issues appear to be related to issues with OpenMPI in mapping the tasks to the cores.

This is predominantly an issue for Deepthought2 users, as most nodes on the original Deepthought do not have sufficient core counts to hit this issue. The MARCC/Bluecrab cluster does not at this time support OpenMPI versions below 1.8.

We are still investigating the issue, but so far it seems to be fixed by either:

  1. Adding the -bind-to-core option to your MPI run command on versions 1.6 and 1.6.5 of OpenMPI.
  2. Using version 1.8.1 or new versions of OpenMPI, which sets -bind-to-core by default.

See e.g. http://blogs.cisco.com/performance/open-mpi-binding-to-core-by-default/ for a discussion of this change. The above settings/new default settings are NOT the best in all cases, but should give better performance in the most common situations. The most notable exception is if memory bandwidth constraints prevent your job from being able to perform well with all the cores on a node.


FAQ IV-4) What does "(AssocGrpCPUMinsLimit)" or "(AssociationJobLimit)" mean?

See the answer to this question in a different section.


FAQ IV-5) What does "(QOSResourceLimit)" mean?

See the answer to this question in a different section.


FAQ IV-6) My openmpi job complains about CUDA libraries. What does this mean?

This is likely a harmless warning. See this question and answer for more details.


FAQ IV-7) My OpenMPI job has a warning about 'N more processes have sent' some message. What does that mean?

OpenMPI job output will sometimes include output similar to this:

[compute-b28-48.deepthought2.umd.edu:38553] 20 more processes have sent help message help-mpi-common-cuda.txt / dlopen failed
[compute-b28-48.deepthought2.umd.edu:38553] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages

This message is basically stating that a number of other (20 in this case) MPI tasks have generated the same error. OpenMPI normally assumes you only want to see errors and warnings once per job, not once per MPI task. So if more than one MPI task produces basically the same error message, it prints the error message once, and then prints a message like the one above to let you know that the message occurred multiple times. This is usually what you want --- typically the message occurs on every MPI task in the job, and most people do not want to be inundated with the same error message hundreds or thousands of times.

This message usually can just be ignored, as it is just saying that the previous warning/error message occurred multiple times. Instead, you should focus on previous warning/error message.

As stated in the warning, if you add the flag --mca orte_base_help_aggregate 0 to your mpirun command, OpenMPI will not aggregate messages and you will see the messages from each and every MPI task that generated them.


FAQ IV-8) What does "Quota" mean in the status field for a job when using the showq command?

See the answer for AsscGrpCPUMins.


FAQ V) Questions about job accounting


FAQ V-1) What is an SU? Or a kSU?

Accounting on the HPC clusters is done in units of Service Units, typically abbreviated as SUs. One SU is equal to 1 hour of wall time of a single core of a CPU. So, a job running for 4 hours on all the cores of 3 nodes, with each node having two 8-core CPUs, would accumulate a charge of 4 hours * 3 nodes * 2 CPUs/node * 8 cores/CPUs = 192 core-hours = 192 SU.

Both job charges and the funds in allocations are measured in SUs. The actual low-level accounting is done in core-seconds, but those numbers are unwieldy. Indeed, even SUs are unwieldy for many purposes, and we often talk in terms of kSUs, with 1 kSU = 1000 SU.

See the section on allocations and job accounting for more information.


FAQ V-2) Which allocations do I have access to?

The sbalance with no arguments will list the balances for all allocations which you have access to on the Deepthought HPC clusters. Use the mybalance command on the MARCC/Bluecrab cluster.

If you see any allocations which you do not believe you should have access to, please contact us. If you believe you should have access to allocations which are not listed, contact the point of contact/owner of the allocation and have them request that you be granted access.

Some allocations on the Deepthought clusters have a -hi suffix. These are high priority allocations. Groups which have contributed hardware to the cluster will get one or more projects with standard and high priority allocations. E.g., if the Foo group contributed hardware to the cluster, they might get a foo project consisting of the foo (normal priority) and foo-hi (high priority) allocations. We refer to the normal and high priority allocation with the same base name as a project.


FAQ V-3) Which allocation should I charge my job against?

If you are asking this question, presumably you have more than one allocation to which you have access. If not, you do not have a choice, and the system will automatically charge against the single allocation you have access to.

If the research groups, etc. that you belong to have local policies on which allocation you should charge, those take precedence over the advise in this FAQ. We are just providing some useful guidelines in the absence of any policies set by those in charge of the particular allocations.

If you belong to more than one project (i.e., you have access to multiple allocations with different base names (after the -hi suffix is removed)), then you should choose the project that best fits the work that the job you are submitting. I.e., if you belong to an allocation for Dr. Jones and one for Dr. Smith, work for Dr. Jones should be charged against Dr. Jones's allocation and not Dr. Smith's allocation, and vica versa.

If you know which project you should charge against, but the project has both normal and high priority allocations (e.g. has a foo and foo-hi allocation), you generally should be charging against the high priority allocation.

The two main exceptions to the above are:

  • You are running in the debug partition. Jobs in the debug partition run with a fixed priority which is the same whether they are charged against a normal or high priority allocation, so running a debug job against a high priority allocation is effectively wasting the "high priority" status.
  • You have exceeded you monthly high priority allotment.

This latter case is the primary reason for the dual allocation setup; as it allows you to effectively borrow SUs from the next month in the quarter (or the previous month if you did not use them), but such borrowed SUs can only run at normal priority.

See the section on specifying the account to be charged for more information on specifying which allocation account your job should be charged against.

See the section on the allocation replenishment process, for more information, which might help clarify the above statements.