--> Skip to main content

Getting Access to the High-Performance Computing Clusters

Table of Contents

  1. Overview
    1. List of Colleges/Departments with resource pools on the HPC clusters
    2. Allocation Limits and Pricing
  2. How do I get access?
  3. For PIs/Allocation Managers: How do I grant access to my allocation?
    1. To add/delete user from existing allocation
    2. To create suballocation
  4. How do I get an Allocation from the AAC?
  5. How do I get access for a class I am teaching?



Overview

Division of Information Technology (DIT) of the University of Maryland (UMD) maintains a number of High Performance Computing (HPC) clusters for use by researchers at UMD. These include the Zaratan and the Juggernaut clusters.

The Zaratan cluster is the University's flagship cluster, replacing the previous Deepthought2 cluster, and the initial hardware purchases were funded by the Office of the Provost at UMD, with contributions from the Engineering and CMNS colleges. Access to the Zaratan cluster is available to all faculty at the University, along with their students, post-docs, etc., through a proposal process.

The Juggernaut cluster consists primarily of hardware which was contributed by various research units --- access to this cluster is generally restricted to members of the contributing units. It was built primilarly to support researchers who needed additional HPC resources beyond what was available on the Deepthought2 cluster, and which because of sundry data center issues could not be added to that cluster. We are not currently planning to expand this cluster, and instead will be redirecting any expansion requests to the new Zaratan cluster. Indeed, the Juggernaut hardware likely will be merged into the Zaratan cluster in the near future.

How do I get access?

To get access to one of the HPC clusters, you need to be granted access to a project and an associated allocation on that cluster. If you were granted an allocation on one of the HPC clusters, you should already have access to the cluster, and should have received email welcoming you to the cluster and giving basic instructions. For more detailed instructions you can view the instructions on using the web portal or instructions on using the command line interface.

If you do not have an allocation of your own, but are working with a faculty member who has an allocation, any manager for that allocation (e.g. the allocation owner or someone they delegated management rights to) can grant you access to their allocation. After that is done, you should be able to log into the cluster either via the web portal or via the command line.

Allocations on the Juggernaut cluster are basically only for those units/research groups which have contributed hardware to the cluster.

Allocations on the Zaratan cluster are available to all faculty at UMD. If you (or your faculty advisor for students, etc) do not have an allocation on the cluster, the rest of this section will explain how to obtain an allocation. For students, post-docs, and other non-faculty members, please have the faculty member you are working with apply for the allocation, and then grant you access to it..

Allocations on the HPC clusters consist of allotments of compute time and storage on the cluster. Compute time is measured in Service Units or SU . Essentially, one SU is the use of one CPU   core for one hour, with some additional factors coming in to account for differing CPU speeds, excessive memory use, and/or the use of GPUs . Typically, we use units of kSU, where 1 kSU = 1000 SU. So a job running on 1 CPU core for 4 days would usually consume 1 core * 4 days * 24 hr/day = 96 SU. Another job running on 16 cores for 6 hours would also consume 96 SU.

The total number of SUs available on the cluster in a given time period is limited; e.g. the total number of SUs per quarter can basically be computed by multiplying the total number of CPU cores by the number of hours in a quarter. Anytime a CPU core sits idle for an hour represents an SU which is forever lost. We limit the number of SUs allotted to allocations in order to try to keep the wait times reasonable while still keeping the cluster well utilized. Although the large number of users on the clusters and the law of large numbers tend to cause the distribution of usage to be somewhat uniform over time, but for larger allocations we dole out the SUs quarterly to further encourage this.

Allocations also get an allotment of storage. All of the HPC clusters have a high performance file system (HPFS) or scratch tier, which is designed for the temporary storage of data being actively used by jobs. This storage tier is highly optimized so that it can, when used properly, support thousands of processes doing heavy I/O against it. The Zaratan cluster also supports a second, larger storage tier, the SHELL medium term storage tier , which allows for the storage of large data files which are inputs to or outputs from jobs, but for jobs which are not actively running, e.g. so that you do not need to spend days downloading large files just before submitting a job.

All faculty at UMD are eligible for a basic allocation from the Allocations and Advisory Committee (AAC) consisting of 50 kSU per year, with 0.1 TB of storage on the HPFS/scratch tier, and 1 TB of storage on the SHELL medium term storage tier. This basic allocation is available at no cost to the faculty member. All one has to do to obtain such an allocation is fill out an application. Because we use a single application for both this basic application level and for requesting additional resources from the AAC, there are a fair number of questions on the application. For the basic application, you can leave many of the questions blank (or put "N/A" in the answer) --- faculty members will be awarded the base allocation just by requesting it. However, it is helpful if you take a few moments and try to answer these questions to the best of your ability --- your answers will provide us with some insight into what you are trying to do and we might be able to offer useful suggestions. In addition, looking at the questions will be helpful if and when you need to request additional resources from the AAC; once you go beyond the "basic" allocation, we will require satisfactory answers to all of the questions in the form, and the answers for some fields require information about your jobs and their performance that you should be collecting while using your basic allocation. So being at least aware of the questions that you will need to answer when applying for more resources is useful. As always, if you need assistance with either the basic allocation or when requesting additional resources, please do not hesitate to contact the HPC team.

Atlhough this basic allocation might suffice for some small projects, and is useful if you wish to explore whether high performance computing would benefit your research, we expect most users will need more resources for their work. There are several ways to obtain additional resources.

Generally, the next step is to apply for additional compute and storage resources from the campus Allocations and Advisory Committee (AAC). This committee consists of a number of UMD faculty members with extensive experience in research involving the use of high performance computing, who will evaluate such requests to ensure proper and efficient use of the university's valuable HPC resources.

The AAC can authorize additional resources, up to 500 kSU of compute time, 10 TB of high performance/scratch storage, and 50 TB of SHELL/medium term storage, at no cost to the faculty member.

If even more resources are needed, there are basically two options, which we elaborate on below:

  1. Some colleges, departments, etc. have pools of resources on the HPC cluster in return for contributions they made toward the construction of the cluster. If you belong to one of these units, you might be able to get additional resources from your unit.
  2. All users are able to purchase additional resources from DIT.

The units which have pools of HPC resources to allocate, along with their contacts and which clusters they have pools on, are as follows:

UnitContact Person Zaratan?Notes
A. James Clark School of Engineering Jim ZahniserX ENGR is doing some cost recovery
College of Computer, Mathematical and Natural Sciences (CMNS) Mike LandavereX Delegated to Departmental Level (see below)
CMNSAtmospheric and Oceanic Science Kayo Ide?X 
CMNSAstronomy Benedikt DiemerX 
CMNSBiology Wan ChanX 
CMNSCBMG Wan ChanX 
CMNSChemistry Caedmon WaltersX 
CMNSComputer Science Jeanine WordenX 
CMNSEntomology Greg HessX 
CMNSGeology Phil PiccoliX 
CMNSIPST Alfredo Nava-TudelaX 
CMNSJoint Quantum Institute Jay SauX 
CMNSPhysics Jay SauX 
WARNING
Please note that all policies/procedures/etc related to the allocation of these resource pools by the units above are completely up to the unit; the Division of IT is not involved in the policies or decision-making process. Also note that while we try to present accurate and up-to-date information regarding these matters, the units are not required to inform us before making changes, and so for the most accurate and definitive information we suggest you contact the relevant people in the unit. To our knowledge, Engineering is doing some cost recovery on allocations from their pool, but all of the other units above are not directly charging faculty members for allocations granted from their resource pools.

All faculty are eligible to purchase additional HPC resources. Please see the cost sheet for pricing. These monies from these charges will be used to maintain, enhance, and expand the cluster.

You are not limited to just one of the above options. Indeed, the same application form is used for all allocation types managed by the Division of IT (i.e. everything but the allocations from college/departmental pools), and you only need to apply once for the full amount of resources you require (we will automatically grant the base allocation, submit for review by the AAC any additional resources requested (up to the cap), and provide a quote for the remainder. Compute time from a purchased allocation will not be available until arrangements for payment have been made; if you request an amount of resources which would require payment by mistake, it is not a problem as it will be corrected when we contact you to arrange for payment.

All allocations have an expiration date, which is at most one year from the date of approval. The allocations from DIT/AAC can be renewed, but such will require the submission of a renewal application. For "base" allocations, renewals are essentially automatic; for AACallocations the AAC will want to see more detail, including a summary of what was accomplished with the previously awarded resources. We also request all PIs update the list of publications in ColdFront.

Allocation Limits and Pricing

The following table summarizes the various options for obtaining allocations and the limits which apply:

Allocation Class From Compute Time Scratch/HPFS SHELL/MTS URL for appyling
"Free" Base allocation DIT 50 kSU/year 0.1 TB 1 TB Campus AAC application form
"Free" AAC allocation AAC up to 500 kSU/year up to 10 TB up to 50 TB Campus AAC application form
College/Deparmental Allocations CMNS up to the College/Department See College/Departmental pool table for a list of units and contacts

In addition to the "free" allocations from the Allocations and Advisory Committee (AAC), it is also possible to purchase additional resources from the Division of Information Technology. The pricing model depends on a number of factors, including the amount of resources being requested, the amount of excess capacity currently on the cluster, and the time frame for the request.

We strive to keep the capacity of the cluster at or above the total resource commitment as otherwise there could be serious issues with shortages of resources (e.g. long wait times in the queues for jobs, disks running out of space, etc). If your requested resources can be met from existing excess capacity (i.e. capacity on the cluster which has not been allocated to other users via their purchases or any of the allocation avenues described above), we can typically grant the request rather quickly.

Currently (Jan 2023) we have an excess capacity about 15,000 kSU/quarter and about 200 TB of scratch and 1 PB of SHELL storage. For requests that can be satisfied from our excess capacity, the pricing model for additional resources is as follows:

Resource Unit Price
Compute 1 kSU for 1 quarter $2.32
Scratch storage 1 TB for 1 quarter $8.28
SHELL storage 1 TB for 1 quarter $5.57

These resources are all tied to a specific quarter in which they are valid (this determination will be made at the time of purchase). Any resources which are not used in the quarter specified do not roll over, they simply vanish. You can make a purchase of e.g. 1000 kSU for a year, but this will be broken down into a set number of kSU for each quarter in that year; by default we will allocate 250 kSU/qu for each of the four quarters, but you can request a different quarterly allotment.

Similarly, you can purchase storage resources for a year or longer, and specify how wish the storage to be allocated quarter by quarter (although since files tend to be more permament than jobs, we strongly encourage either equally dividing among the quarters, or having the allotted amount increase with each successive quarter). Thus to get an additional 1 TB of scratch space for 3 years (or 3 years * 4 quarters/year = 12 quarters), you would need to pay 12 times the $8.28 quarterly price. At the end of the contracted period, unless you purchased additional space in another contract, the added space will go away (i.e., your project's quota on the relevant storage tier will return to the value it was prior to the purchase); this will likely result in your project being overquota unless you deleted data or transferred data elsewhere. As per HPC policy, in such cases you will be warned of the overage and asked to resolve the matter in a timely fashion (typically a week or so) --- failure to do so may result in your project being charged for additional quarters of storage use (at whatever the going rate at the time is, which might be more than the rates in the initial contract).

Note: Data stored on the various storage tiers is still subject to the HPC policies on the use of the respective storage tier, even if your project has purchased additional storage. For example, the policy that all data on the scratch tier must be in support of "active" jobs on the cluster (i.e. jobs that just finished, are actually running or in the queue, or jobs to be submitted in the near future) still applies to all data on the cluster, even if your project purchased additional scratch space.

In order to use the cluster, all allocations need some mix of compute and storage. Since many users are a bit unsure as to the relative amounts, we offer a "balanced" package of compute and storage resource, for $2.86 per "balanced unit", with a balanced unit consisting of:

Again, this is just a convenient ratio to buy compute and storage in, reflecting the ratio of those resources in the initial cluster configuration. The terms are the same as previously mentioned; and there is no "discount" for buying in "balanced units" (the prices listed above our actually our fully discounted prices so we cannot go lower). This is just a recommended ratio for would be purchasers who know they need a bit of all resources but are not sure how much of each they should purchase --- if you happen to know that your intended research will be need more of one resource and less of another, you should adjust accordingly.

If you request more resources than what we have in our current excess capacity, additional hardware will need to be purchased to accommodate your request. We will need to obtain quotes from the vendor in order to work out what the costs will be. The need to purchase hardware will also mean that there will be some delay before the hardware actually arrives and can be integrated into the cluster. Furthermore, if your are only requesting the additional resources for a short time (compared to the estimated usable lifespan of the hardware involved), there will be surcharges applied to cover the estimated overhead before we can sell the resources as excess capacity. This is standard industry practice; even large cloud providers like Google and AWS charge substantially more for short term purchases compared to purchases with a longer commitment, and we are small compared to them.

You can submit a single application for an allocation, AAC allocation, and paid allocation, and if you are ready to request all of them at the same time that would be preferred. Of course, if your needs (or your awareness of your needs) change over time, you can submit multiple applications to adjust the allocation sizes. Please note that although the form does not enforce limits, we do track and enforce the annual limits on compute time, etc.

Although the compute time requested in the application and as awarded is on an annual basic, the actual compute time might be meted out either annually or quarters. This decision is made by DIT based on the size of the allocation; smaller allocations will have compute time doled out annually, and larger ones quarterly --- this is to encourage the use of allocations to be more spread out over time. E.g., a 50 kSU "base" allocation will typcially be meted out annually; when it is awarded you will receive 50 kSU to use within 365 days from the date of award. But if you purchase 1000 kSU (or receive such from a departmental/college pool), typically you will get 250 kSU/quarter for the 4 quarters following the date of the award. SUs that are not used at the end of a quarter (or the end of the year for annually meted out allocations) will simply disappear, they will not carry over into the next time period.

If you have a "base" and an "AAC" allocation, we will typically try to consolidate these into a single allocation, representing a single Slurm allocation account; this will generally be more useful than multiple Slurm allocation accounts. We cannot consolidate allocations with different sources, schedules (quarterly or annual), or expirations, so generally college, departmental, and paid allocations will remain as distinct allocations.

Unlike CPU resources, disk space does not "regenerate" with time. Once a file is placed on the file system, it remains there, consuming disk space, until someone removes it. Allocations come with a limited amount of storage. Typically, for each storage resource on a given cluster we sum up the allowances for each allocation into a single limit for the project. E.g., if your AAC allocation granted you 2 TB of HPFS storage, your college allocation granted you an additional 1 TB, and your purchased an allocation granting another 1 TB, normally this would be combined to give a 4 TB storage limit for all members of any of those allocations. The storage allotment remains until the allocation expires (or the storage allotment for the allocation changes); it such events reduce your storage allotment causing your usage to exceed your allotment, the PIs and managers for the project will be contacted to inform them of the issue and request that it be rectified in a timely fashion (typically a week or so). If you receive such a notification and need assistance, etc. in rectifying the matter, please contact the HPC team.

For the scratch tier , a true quota system is imposed, which limits the amount of data which members of the allocation can store on the file system. By default, we do not impose quotas on individual users, although such can be done upon request.

The SHELL storage tier is volume based. Typically one volume will be created for the root of the SHELL project directory for the project, along with a volume for each member of the project. Additional volumes may be created on request. Each volume will have a limit on the amount of data that can be stored in it; this has a default value but within reason the default can be changed as well as the limit for specific volumes. Since often the amount of data on a volume will be only a fraction of its limit, we allow for oversubscriptions within reason (i.e. the total of the limits on all of the volumes above can exceed the limit for the project); this is fine as long as the total amount of space used on all the volumes fits within the project's limit.

Resources, paid or otherwise, are allocated ahead of time, and so for the most part you will not get billed for resources previously consumed. The exceptions are for storage. If an allocation expires and is not renewed, or has its storage allotment adjusted downward in a renewal or otherwise, then it is possible that the total amount of data for the project stored on the resource can exceed what was allotted to the project. Also, the volume design of the SHELL storage tier , combined with oversubscription, means that it is it is possible to store more data under the project's SHELL directory than was allocated to the project. Whenever the storage resources used by a project exceeds what was allocated to the project, warning emails will be sent to the PI and all members of the projects informing them of the quota violation and requesting that it be resolved within one week. Resolution can be made by reducing the amount of data stored, renewing the expired allocation, purchasing additional storage, etc. If you need help figuring out how to rectify the situation, or if there are extenuating circumstances which might warrant an extension of the time limit to resolve the matter, please let us know. While our responsibilities towards other users on the cluster will not allow us to ignore such overages, we are willing to work with you to find a mutually acceptable solution. If you are unable to resolve the matter in a reasonable time and are not working with us towards finding an acceptable resolution, we will be forced to bill you for the additional storage used.

For PIs and/or Allocation Managers: How do I grant access to my allocation?

This section is for managers of projects on one of the DIT maintained HPC clusters. If you are NOT a designated manager, i.e if you are only a member, or not even a member of the allocation, do NOT follow these steps. We will not honor requests made from people who are not managers for the project the allocation is in. If you are not the manager for the project, find the manager and have them make the request.

To add/delete users from an existing allocation

This can be done by designated allocation/project managers in either of two ways:

Either way, if there are multiple allocations within the project, it is strongly recommended that the membership lists for all such allocations be the same. Membership in any of the allocations for a project grant the user full access to all of the scratch and SHELL storage allotted to the project, and generally it is recommended that users have access to the compute time for all of the allocations belonging to a project as well.

Adding or deleting users from an existing allocation using ColdFront

PIs and managers of projects are now able to view and modify the membership lists of their allocations directly using the ColdFront web-based allocation management GUI.

To add an user to your allocation, there are only a few of steps:

  1. Open up your web browser and log into coldfront<\li>
  2. Find your project and add the user to your project.
  3. Find your allocation(s) underneath the project and add the user to your allocation(s).
  4. Repeat the last step for all of the allocations under the project.

Note: You must add the user to both the project and at least one allocation for them to get access to the HPC cluster. Adding the user to the project does not do much, it basically only makes them eligible to be added to allocations for the project. It is adding the user to the allocation which actually grants them access to the cluster and allocation resources.

To delete users from your allocation(s), the process is basically the reverse of the add user process:

  1. Open up your web browser and log into coldfront
  2. Find your allocation(s) underneath the project and delete the user from your allocation(s).
  3. Repeat the last step for all of the allocations under the project.
  4. Find your project and delete the user from your project.

Please note that it takes an hour or two for the provisioning process to complete. The deprovisioning process is currently somewhat manual, so that might normally take a couple of days. Please submit a ticket to HPC staff if the removal of user access is more urgent.

Requesting HPC staff add or delete the users

Basically, one of the points of contact for the allocation on the just needs to send email to hpcc-help@umd.edu requesting that the user be added to the allocation. The email should come from the point-of-contact's official @umd.edu email address), and also specify:

Note that certain subdomains of umd.edu (e.g. cs.umd.edu, astro.umd.edu) are NOT part of the unified campus username space, and as those subdomains are NOT maintained by DIT, are not usable by us to uniquely identify people. E.g., jsmith@cs.umd.edu might or might not be jsmith@umd.edu, so we cannot reliably map jsmith@cs.umd.edu to a specific person.

The DIT maintained HPC clusters currently require all users on the cluster to have active Glue/TerpConnect accounts. This condition should generally be true for most if not all users automatically, but if you are unsure or need to manually activate your Glue/TerpConnect account, please seethis Knowledge base article (KN0010073) . If you submit a request for users without a TerpConnect account, you will just get email back telling you they need to get a TerpConnect account first.

Requests to delete users from the allocation can be handled similarly. Here it does not matter whether the user's TerpConnect account is still active. If the user is not associated with any allocations other than yours, their access to the cluster (as well as to charge against your allocation(s)) will be revoked, and all access to their HPC home directory and any directories on lustre or data volumes will be revoked and those directories slated for deletion. If there is data which should be retained, you should mention that in the email so we can look into reassigning ownership. If the user has access to other allocations, only their ability to charge against your allocation will be revoked, and we will by default not do anything with respect to their home or data files. You should contact the user about any transfer of data that is required (and you can contact us if assistance is needed).

To create suballocations on the HPC clusters

Certain contributors (e.g. Engineering, CMNS and some of its departments) have not allocated all of the resources they are entitled to from their contribution to the Zaratan cluster, and are instead periodically creating suballocations carved from these unallocated resources.

To create new suballocations, or modify the resources granted to existing suballocations, the points of contact for the contributions with unassigned allocations should send email (from their official @umd.edu email address) to hpcc-help@umd.edu including the following information:

Again, all points of contact and members of the suballocation MUST already have active Glue/TerpConnect accounts before submitting the request. See here for information and instructions on activating TerpConnect accounts.

Also, all such requests MUST come from a designated point of contact for the parent contribution.

How do I get an Allocation from the AAC?

All applications for allocations from DIT and/or the AAC ("base" allocations, AAC allocations, and purchased allocations, but not allocations from colleges/departments/etc) are made via this form. This section discusses the various fields. If you are requesting an HPC allocation for use with a class you are teaching, please contact the HPC team directly, including the name of the class, the semester, and a brief discussion of what the allocation will be used for (e.g. class demonstrations, student projects, etc).

Because we use a single application for both this basic application level and for requesting additional resources from the AAC, there are a fair number of questions on the application. For the basic application, you can leave many of the questions blank (or put "N/A" in the answer) --- faculty members will be awarded the base allocation just by requesting it. However, it is helpful if you take a few moments and try to answer these questions to the best of your ability --- your answers will provide us with some insight into what you are trying to do and we might be able to offer useful suggestions. In addition, looking at the questions will be helpful if and when you need to request additional resources from the AAC; once you go beyond the "basic" allocation, we will require satisfactory answers to all of the questions in the form, and the answers for some fields require information about your jobs and their performance that you should be collecting while using your basic allocation. So being at least aware of the questions that you will need to answer when applying for more resources is useful.

Applications which are seeking resources beyond the "basic" allocation levels will need to provide satisfactory answers to all of the questions in the field. In general, the more resources being requested, the more carefully the AAC will scrutinize the application. If any of the answers are found lacking, or if questions arise when looking at your application, we will get back to you requesting elaboration. Typically it is this requesting and waiting for additional information that causes the most delays in the processing of the application. However, historically if you respond to the requests for additional information, the AAC will approve the application.

As always, if you need assistance with the form, either for the basic allocation or when requesting additional resources, please do not hesitate to contact the HPC team. We would be happy to assist you.

The following list gives all of the labels for the various fields for the form to request an Allocation from the AAC. This section click to be taken to a discussion of what is being requested by each field. This list is alphabetical, in the detailed discussion fields are presented more or less in order they appear on the form.

  1. Additional High-performance Scratch Disk Space (TB)
  2. Additional SHELL (Medium-Term) Disk Space (TB)
  3. Additional Software Needs
  4. Code Use and Scalability
  5. Desired Allocation Name
  6. Desired End Date
  7. Desired Start Date
  8. Disk space Justification
  9. Estimated Ram Per CPU Core (in GB)
  10. Faculty Advisor
  11. HPC Experience
  12. Milestones
  13. Past Results
  14. Processor Need
  15. Publications
  16. Renewal Cluster
  17. Request Type
  18. Requested Allocation Type
  19. Requested Cluster
  20. Requested for
  21. Requested kSU
  22. Research (Lay) Abstract
  23. Research Title
  24. Software Requested
  25. SU Justification
  26. Unix Experience
Requested for:
This should list the person who is filling out the form. If a student, etc. is filling out the form on behalf of their faculty advisor, this should be the student's name. (The faculty advisor's name goes in the Faculty Advisor field.
Faculty Advisor:
This field should list the faculty member you are filling out the form on behalf of. It the requestor is a faculty member filling out the form on their own behalf, this can be left blank.
Request Type:
This specifies the type of request, either New Allocation or Renewal. If this is the first project / allocation . for you (or your faculty advisor if requesting on their behalf), choose New Allocation. If it is a request to renew an existing allocation, or to add new resources to an existing allocation, select Renewal.
Desired Allocation Name
This is the desired name of the project / allocation .
Research Title
This is the title you want for your project / allocation .
Research (Lay) Abstract
This should be a paragraph or two discussing the research you are proposing to accomplish with the HPC cluster . This should discuss the scientific aspects of what you are trying to do; computational and algorthmic details belong in other sections ( SU justification, Disk space justification, and/or Code Use and scalability). Even for "base" allocations, it is requested that you provide this information so that we better know what research is being conducted on the HPC resources.
Past Results
(Only for Renewal type applications). For renewal applications, please list what was accomplished with your previous allocation. Publications, if any, should go in the Publications section. In this section, please describe what was achieved --- usually these would be related to the Milestones listed in the previous application. Even for the renewal "base" allocations, we would appreciate having this information.
Publications
(Only for Renewal type applications). For renewal applications, please list what publications were at least in part made possible based on the computations done in your previously awarded HPC allocations. Even for the renewal "base" allocations, we would appreciate having this information. We recommend that you enter this information in ColdFront first and then cut and paste here.
Desired Start Date
If you are requesting a project / allocation ahead of time, please specify the date when you want the allocation to start. It will default to a week from the current date (i.e. the allocation will start as soon as possible, often in a day or so).
Desired End Date
All allocations have an expiration date, which will default to one year after the allocation is created/renewed. aIf you do not need/require the resources for a full year, please adjust the date correspondingly. For quarterly allotted allocations, this will be adjusted to the date at which a quarter ends.
Requested Allocation Type
This is a drop-down of predefined allocation sizes for the allocation being requested. For new allocations, or renewing an expiring allocation, this should be the size of the allocation you are requesting. If you are seeking to add SUs to an existing allocation, this should be the size of the new allocation, i.e. the sum of what you currently have plus the additional SUs being requested (in this case, it does not hurt to break this down in the SU Justifaction field).

Development is for "base" allocations Small and Medium are for allocations which can be awarded by the AAC. Large is for allocations which require payment.

Whatever you set here will default the Requested kSU field.

Requested kSU
Please enter the requested kSU. As described in the Requested Allocation Type section, this should be the combined size of all allocations you are seeking --- i.e. if requesting additional SUs on top of what was already allocated to you, this should by the sum of the current SU levels plus the amount of new compute time being requested. This should be within the range determined by the Requested Allocation Type. You will also need to justify the requested kSU in this section in the Disk Space Justification section.
Additional High-performance Scratch Disk Space (TB)
All allocations get a base of allotment of 0.1 TB of scratch/HPFS storage. If that is all the scratch disk space you expect to need, you can just leave this at 0 TB. If you expect to need more scratch space, please enter your expected need (in TB). If you have an existing allocation and are requesting more space, this number should be the total space needed (i.e. the currently allotted space plus any additional space you are requesting). Round to the nearest TB. You will also need to justify the space in this section in the Disk Space Justification section.
Additional SHELL (Medium-Term) Disk Space (TB)
All allocations get a base of allotment of 0.1 TB of SHELL/medium-term storage. If that is all the SHELL disk space you expect to need, you can just leave this at 0 TB. If you expect to need more SHELL space, please enter your expected need (in TB). If you have an existing allocation and are requesting more space, this number should be the total space needed (i.e. the currently allotted space plus any additional space you are requesting). Round to the nearest TB. You will also need to justify the space in this section in the Disk Space Justification section.
Estimated RAM per core (in GB)
Please give the estimated amount of RAM that your jobs will need, in GB, per CPU core . For sequential jobs, this should be your estimate of the total memory needed by the jobs; for multithreaded or MPI jobs, this should be your estimate of the total memory needed by the jobs divided by the number of cores/tasks. If unsure, 4 GB/core is a standard value.
"Requested Cluster" or "Renewal Cluster"
This is a drop down list in which you can state the HPC cluster you wish the allocation to be made on. The decision as to what cluster the allocation is placed on is up to DIT and the AAC However, your request will be considered. If there are specific considerations you think the AAC should be aware of when making a decision as to what cluster, please include those in the SU Justification section.
Processor Need
This is a drop down giving different processor options. Currently, it only asks if you intend to use While we do not currently restrict your usage to either CPUs only or GPUs only on the basis of your answer to the above, an honest answer helps our understanding of how the cluster is being used. I.e., if you do not think you will be using GPUs, then answer CPU only. If some months later you discover that GPUs are useful, your answer here does not prevent you from using them --- just please change your answer when you renew the allocation.
Software Requested
This section allows you to list the software packages that you intend to use in carrying out your research. This is useful our better understanding of how people are using the cluster. The drop down list includes many of the packages installed on the cluster, but as we are frequently adding to our software library on the clusters, this list is not always up to date. But please glance at the list and select any packages you expect to use. If you require a software package that is not listed, please select Other and fill in the Additional Software Needs field as well.

Your answers here helps us to better evaluate your application as well as improve our understanding of how people are using the clusters, so accurate answers are appreciated. However, we do not restrict your access to software based on this answer; if there is an application "foo" on the list that you did not select when filling out the application, but after the allocation awarded you discover that it would be helpful to your research, the fact that you did not select it will not prevent you from using it. We do ask that if you continue using it that you include it when you renew the application.

Additional Software Needs
This section, like the Software Requested field, helps us better understand your application and the use of the cluster by users. This field is mainly intended so that you can elaborate on what is meant by your choice of Other in the Software Requested field. field. But you can also use it to specify version requirements of software in the drop-down list.

Please note that the presence of a package in either this or the Software Requested fields does not constitute a promise on the part of the HPC staff to install said software, even if the application is approved. The HPC team strives to make a large library of software packages available to our users, and will make reasonable attempts to install packages on request, but not all packages install nicely, or even are suitable for system wide installations.

Also note that the AAC and DIT do not generally provide licenses for licensed software. The HPC team will attempt to install licensed packages on request, assuming the requestee can provide proof of license (and likely they will need to provide installation media). Some of the packages in the drop down list are proprietarily licensed --- in a few such cases they are covered by a campus wide site license, but many such cases are only covered by licenses granted to certain departments and/or research groups. Your inclusion on the application, even an approved application, does not grant you access under these licenses --- we will open a discussion with you about licensing in such cases.

WARNING
Do NOT purchase licenses for software you intend to use on one of the UMD maintained HPC clusters without consulting with the HPC team first. Not all licenses are suitable for use on an HPC cluster, and we do not wish you to spend money on a license you cannot use on the cluster. Please contact us any such purchases.
SU Justification
In this field, you are requested to provide a justification for the amount of compute time that you are requesting. In order to try to keep the average length of time jobs wait in the queue acceptable, we limit the total amount of compute time over all allocations awarded to minimize the amount of oversubscription. Thus compute time awarded to one group reduces the available pool of compute time to award to future groups. So the AAC gives much weight to this field in their decision making process.

This section is where you describe your computational strategy, contrasting with the Research (Lay) Abstract section which is more about the science. In particular, for allocations from the AAC, the AAC wishes to see a quantitative justification of the compute time requested. In many cases, this can be as simple as an estimate of the number of jobs that will be required to achieve your stated research Milestones, and an estimate of the SU cost for each job. If there are several different types of jobs to be run, break this down by job type. Remember to include any multipliers if you are using GPUs or more than the average memory per core. When renewing an application or requesting additional time, the AAC typically would like to see estimates of SU consumption by job based on actual runs on the cluster where possible.

If you are requesting an increase to an existing allocation, it would be helpful to mention such, along with the existing allotment of compute time and the amount of additional compute time being requested. Generally, you do not need to re-justify the SUs already allotted to you; i.e. if the additional compute time is being requested to explore areas not included in the original request, you only need to discuss the new computations. If however the additional time is needed because you need to revise your previous estimates of compute time needed (e.g. you underestimated the memory consumption so a larger memory factor for CPU time is needed, or you discovered that you need to increase the detail in the calculations), it is probably best to justify the whole amount.

If you are only requesting an increase in disk space, not compute time, you can just enter "No change to SUs requested". Be sure to complete the Disk space Justification section.

For "base" level allocations, you do not need to provide much here, although even here any information you do wish to volunteer is useful in helping us to understand how the cluster is being used. You likely cannot leave the field blank, but if you want to be minimalist you can just enter "base".

Similarly, for "paid" allocations you again do not need to provide much here (although if you are requesting resources from the AAC in addition to what you plan to purchase, you need to follow the guidelines for AAC allocations) --- if you are willing to pay for the resources they are presumably needed. Again, any information you are willing to volunteer is useful to us and appreciated.

Allocations from the AAC do require this field to be properly filled out. That includes applications which combine multiple award types whenever one of the award types is AAC (e.g. base + AAC, or base + AAC + paid, or AAC + paid). In these cases, the AAC will expect a good justification of the compute time needed, and the more time requested the better the justification required.

If you have reason to prefer a specific HPC cluster over another, it is recommended that you include the reasons for your preference here. Although DIT and the AAC have the final as to which cluster your allocation is granted on, your preference (as indicated in the "Requested Cluster" or "Renewal Cluster" fields, together with your justification for that request here, will be considered and honored if feasible and reasonable.

Disk Space Justification
If you requested additional disk space in either the Additional High-performance Scratch Disk Space (TB) or Additional SHELL (Medium-Term) Disk Space (TB) fields, this is where you are requested to provide a justification those requests. Please justify each type of storage separately.

If you are only requesting the base amounts (e.g. 0.1 TB of scratch and 1 TB of SHELL storage), or you are just requesting additional compute time (but no additional storage) for an existing allocation, you can just enter "No additional storage".

Otherwise, please state how you arrived at the amount of disk space that you are requesting. For scratch space, this should be related to your estimate of the amount of scratch space required for running a single job multiplied by the number of jobs you expect to be running more or less at the same time (remember that you are expected to delete unneeded output files, etc. after the job finishes, and move precious input/output elsewhere (e.g. SHELL storage) when they are no longer needed for running or soon to run jobs. For SHELL storage, remember that SHELL storage is not intended as a long term archive.

If you have an existing allocation and are requesting additional storage, please explain why the additional storage is needed, and try to estimate the total storage going forward. Remember that after some threshold, we will need to start charging for disk space (in order that we can grow the total amount of disk storage on the cluster).

Code Use and scalability
This section is where you should discuss some technical aspects of your computation strategy. In particular, the extent to which you calculations are parallelizable . In particular, you should at least discuss whether your codes are capable of being parallelized across multiple nodes or are restricted to only using multiple cores on a single node , or even can only run sequentially .

For "base" level allocations, you need not put much here, e.g. "unknown" or "base" or similar. However, any information you do volunteer will help us better understand how you are using the cluster. If this application is just to increase the storage allotment for an existing allocation, you can just enter "no additional CPU time" or similar. However, for all other applications (e.g. renewal applications or other applications requesting more than the 50 kSU/year base level), this section is required and will be looked at closely. The larger the amount being requested, the more detail will be required.

In particular, for renewal applications the AAC will want to see a discussion of how the performance of the codes being used scale with the amount of resources being allocated to the job. Typically, the performance of jobs will increase significantly as additional resources are made available to the job, up to some threshold value. Increasing the resources beyond that value yields little if any increase in performance, and indeed in some cases degrades performance. This threshold value is dependent on many factors, including details of the problem being solved, details of the algorithm and the specific coding of the algorithm being used, as well as details of the cluster it is being run on. Because of this, this threshold is generally best determined empirically. The AAC will want to see that you can show that you are running your jobs in the optimal range.

For CPU only jobs, this is generally a matter of running jobs (either production jobs or test jobs which are expected to behave similarly to production jobs for this purpose) with different numbers of cores, and look at the observed parallel speedup . You should be collecting some data about this while using your original award of compute time. Basically, compare the runtimes of one of your jobs as you vary the number of CPU cores available to the code. Traditionally this is compared against the runtime when using only a single core. (Ideally you would be running the same job over and over again, but often you can get decent results running comparable jobs. In that case, you might wish to take more than one data point for each number of cores to minimize effects due to differences in the jobs.) Typically, the code will speedup a bit less than linearly as the number of cores increases, to a point. After that, the code still speeds up, but with diminishing returns, and at some point the performance either levels off or possibly even degrades as more cores are added. Generally, you can stop running tests with more cores once you detect significant levelling off. The goal of these tests is to determine what the "sweet spot" is, i.e. the ideal number of cores to use to maximize efficiency and throughput.

The above discussion applies well to either pure multithreaded jobs (i.e. the job only does multithreading for parallelization) and for pure MPI jobs (i.e., each MPI task is single threaded). For hybrid jobs using both multithreading and MPI, you should modify the above to do a two parameter search for the "sweet spot" by varying both the number of cores used per MPI tasks and the number of MPI tasks. (This is assuming that neither are constrained by the nature of the problem being solved; if such constraints come into play, then briefly discuss them.)

For jobs that rely primarily on GPUs for computation, you should discuss whether the code can use multiple GPUs or not, and if it can use multiple GPUs whether it is restricted to GPUs on the same node. If multiple GPUs can be used, you should do a similar process of determining the performance of the code as a function of the number of GPUs being used in order to find the "sweet spot".

Also for GPU jobs, you should discuss which of the different types of GPUs available on the system can be used for your job, with a brief comment on why any GPU types are unacceptable (e.g. insufficient GPU memory, etc). If your job will run on multiple types of GPUs, you should show the performance as a function of GPU type. If using a more powerful (and more expensive) GPU does not provide a significant gain in performance, then using the more powerful GPU would be wasteful.

Milestones
This section is where you should list some specific research goals you hope to accomplish with the HPC resources you are requesting in the application. While it is understood that your research and use of HPC resources is likely to be an ongoing, years-long process, the accomplishments you list here should be specific, measurable goals that you hope to realize before the expiration of this allocation (typically a year from the request). When you apply for a renewal of your allocation, you should hopefully be able to list the Milestones from the application for the existing allocation into the Past Results section of the renewal application (assuming they were actually accomplished).
HPC Experience
This checkbox simply asks if you have experience using HPC clusters and methods. If you select yes, we will also request you enter the approximate number of years of experience you have. If you have more than a few years of experience with using HPC clusters, feel free to round off to the nearest 5 or 10 years; we do not need a precise number.

This field is just to help us collect some data about who is using the campus HPC resources, i.e. how many novice vs experienced users. Answering "No" will not cause your application to be rejected (although will raise some questions if this is a renewal application). We encourage researchers who have not used HPC resources in the past to explore whether HPC techniques could benefit their research.

Unix Experience
This checkbox simply asks if you have experience using Unix or Unix-like (e.g. Linux, FreeBSD, etc) operating systems. This field is just for us to better understand our user base; answering "No" will not cause your application to be rejected.

The UMD HPC clusters , like most HPC clusters, run on Unix-like (specifically Linux at UMD) systems, and some advanced functionality is facilitated by (if not requires) some fluency with Unix. However, we have added a OnDemand Portal for interfacing with the cluster which greatly reduces the amount of Unix familiarity needed to use the cluster. We hope that such can help make HPC techniques more available to researchers at UMD.

Allocations of compute time are provided as service units (SUs), each of which represents one hour of wall clock time on one CPU core. Different categories of allocations provide cycles for newcomers (development: 20K SUs), for moderately demanding jobs (small: 60K SUs), and for compute-intensive research (large: 100K SUs). The larger allocations are naturally scrutinized more, and generally require the applicant to have shown reasonable knowledge of HPC and its issues, either from previous development grants or other experience on this or other clusters.

AAC allocations on the Zaratan cluster are one time grants of SUs with an one year (by default) expiration. SUs can be used as needed over the course of that year. You can apply to the AAC to renew your AAC allocation to extend the expiration another year (this can be done each year).

If an application is approved by the AAC, the allocation will be created, by default, shortly after approval - typically within about one business day. If you would prefer a later starting date (e.g. you will not be able to use start using the cluster immediately due to other priorities or because you are awaiting data), please specify such in the proposal, especially if there will be significant delay. The time between submission and approval can vary; if the application has sufficient detail that the AAC has no follow up questions, approval is typically within one or two business days. If there are follow up questions, an HPC administrator will contact you (typically via email) with the questions, and forward your replies back to the committee. Again, usually you should receive notification of approval or follow up questions within about one or two business days after a submission.

WARNING
Students are only allowed ONE allocation from the AAC for their duration with the university, and that will be a developmental allocation and not renewable. If more CPU cycles are required, their faculty advisor must apply for the allocation.

Criteria used in making such determinations include appropriateness of the clusters for the intended computation, the specific hardware and/or software requested, a researcher's prior experience with high-performance computing, the track record of a requestor who has received HPCC allocations in the past, and the overall merits of the research itself.

The AAC will determine which of the HPC clusters is most appropriate for the request. If the requestor has a specific cluster in mind, that should be explicitly mentioned in the proposal. In addition, the proposal should provide enough information to justify the use of a specific cluster (e.g. the need for Matlab DCS or other cluster specific licenses, or GPUs, or large memory nodes). While the committee will consider requests for a specific cluster, the committee decide which cluster to grant for a proposal based on which cluster is most appropriate for the request.

To submit an application, go to the HPC AAC application page, and select under the "Forms" menu item on the top menu bar the desired form. If you already have an allocation and you need to request additional time (either because more time is needed to complete the research than originally thought, or because the scope of research is expanding), then please select "Renew an allocation". Otherwise, select "New Allocation" for a new allocation.

When applying to the AAC for an allocation, remember that the AAC generally prefers to award allocations for specific projects. It is best to make a proposal for specific projects, with milestones that can be achieved within one year (or whatever time frame of requested allocation is), and if needed make a renewal request for more time for a second set of goals. In addition, it is useful to include the following in your proposal:

The AAC is unlikely to grant large allocations without a good discussion of most of the above points. However, it is recognized that not all applicants are experienced High Performance Computing (HPC) experts. Indeed, one of the aims of the AAC with this allocation process is to allow researchers who are not even sure if HPC techniques will work for their research an opportunity to try HPC out without a monetary investment. So if you are unable to address all of the above points, you can still apply for an allocation. It is likely that the AAC will, at least initially, only approve your application for a developmental (20 kSU) allocation, but that should not be viewed as a set back. The 20 kSU allocation might even be enough for some small projects, but at minimum it should allow you to collect the information regarding SUs required per job, scalability of code, etc. to address the above points when you request additional time in a renewal allocation.

If you have questions regarding the application process, or what information is requested or how to obtain such, or any other issues, please feel free to contact us.

Several samples of approved applications have been made available with the kind consent of the applicant to assist others who wish to apply.

How do I get access for a class I am teaching?

As befits an institute of learning, DIT is willing to make a reasonable amount of HPC cluster resources available to classes in most cases. If you are teaching a course and wish to use HPC resources, please submit a request to HPC admins for class access. Such requests should come from the instructor of record for the class, and should include the following information:

Please provide the above information to the best of your ability when making a request. If you are unsure what is meant or how to answer something, let us know and we will try to clarify. The more completely you answer things the faster the process will go.






Back to Top