Getting Access to the High-Performance Computing Clusters

Overview
1. List of Colleges/Departments with resource pools on the HPC clusters
2. Allocation Limits and Pricing
How do I get access?
For PIs/Allocation Managers: How do I grant access to my allocation?
1. To add/delete user from existing allocation
2. To create suballocation
How do I get an Allocation from the AAC?
1. AAC Application process: Overview
2. AAC Application process: Quickstart for Development Allocations/First time applicants
3. AAC Application process: Quickstart for Renewal Applications to extend the expiration of an existing allocation, or to request additional resources for an existing allocation.
4. AAC Application process: Detailed explanation of fields on the form
How do I get access for a class I am teaching?

Overview

Division of Information Technology (DIT) of the University of Maryland (UMD) maintains a number of High Performance Computing (HPC) clusters for use by researchers at UMD. These include the Zaratan and the Juggernaut clusters.

The Zaratan cluster is the University's flagship cluster, replacing the previous Deepthought2 cluster, and the initial hardware purchases were funded by the Office of the Provost at UMD, with contributions from the Engineering and CMNS colleges. Access to the Zaratan cluster is available to all faculty at the University, along with their students, post-docs, etc., through a proposal process.

The Juggernaut cluster consists primarily of hardware which was contributed by various research units --- access to this cluster is generally restricted to members of the contributing units. It was built primilarly to support researchers who needed additional HPC resources beyond what was available on the Deepthought2 cluster, and which because of sundry data center issues could not be added to that cluster. We are not currently planning to expand this cluster, and instead will be redirecting any expansion requests to the new Zaratan cluster. Indeed, the Juggernaut hardware likely will be merged into the Zaratan cluster in the near future.

How do I get access?

To get access to one of the HPC clusters, you need to be granted access to a project and an associated allocation on that cluster. If you were granted an allocation on one of the HPC clusters, you should already have access to the cluster, and should have received email welcoming you to the cluster and giving basic instructions. For more detailed instructions you can view the instructions on using the web portal or instructions on using the command line interface.

If you do not have an allocation of your own, but are working with a faculty member who has an allocation, any manager for that allocation (e.g. the allocation owner or someone they delegated management rights to) can grant you access to their allocation. After that is done, you should be able to log into the cluster either via the web portal or via the command line.

Allocations on the Juggernaut cluster are basically only for those units/research groups which have contributed hardware to the cluster.

Allocations on the Zaratan cluster are available to all faculty at UMD. If you (or your faculty advisor for students, etc) do not have an allocation on the cluster, the rest of this section will explain how to obtain an allocation. For students, post-docs, and other non-faculty members, please have the faculty member you are working with apply for the allocation, and then grant you access to it..

Allocations on the HPC clusters consist of allotments of compute time and storage on the cluster. Compute time is measured in Service Units or SU . Essentially, one SU is the use of one CPU core for one hour, with some additional factors coming in to account for differing CPU speeds, excessive memory use, and/or the use of GPUs . Typically, we use units of kSU, where 1 kSU = 1000 SU. So a job running on 1 CPU core for 4 days would usually consume 1 core * 4 days * 24 hr/day = 96 SU. Another job running on 16 cores for 6 hours would also consume 96 SU.

The total number of SUs available on the cluster in a given time period is limited; e.g. the total number of SUs per quarter can basically be computed by multiplying the total number of CPU cores by the number of hours in a quarter. Anytime a CPU core sits idle for an hour represents an SU which is forever lost. We limit the number of SUs allotted to allocations in order to try to keep the wait times reasonable while still keeping the cluster well utilized. Although the large number of users on the clusters and the law of large numbers tend to cause the distribution of usage to be somewhat uniform over time, but for larger allocations we dole out the SUs quarterly to further encourage this.

Allocations also get an allotment of storage. All of the HPC clusters have a high performance file system (HPFS) or scratch tier, which is designed for the temporary storage of data being actively used by jobs. This storage tier is highly optimized so that it can, when used properly, support thousands of processes doing heavy I/O against it. The Zaratan cluster also supports a second, larger storage tier, the SHELL medium term storage tier , which allows for the storage of large data files which are inputs to or outputs from jobs, but for jobs which are not actively running, e.g. so that you do not need to spend days downloading large files just before submitting a job.

All faculty at UMD are eligible for a basic allocation from the Allocations and Advisory Committee (AAC) consisting of 50 kSU per year, with 0.1 TB of storage on the HPFS/scratch tier, and 1 TB of storage on the SHELL medium term storage tier. This basic allocation is available at no cost to the faculty member. All one has to do to obtain such an allocation is fill out an application. Because we use a single application for both this basic application level and for requesting additional resources from the AAC, there are a fair number of questions on the application. For the basic application, you can leave many of the questions blank (or put "N/A" in the answer) --- faculty members will be awarded the base allocation just by requesting it. However, it is helpful if you take a few moments and try to answer these questions to the best of your ability --- your answers will provide us with some insight into what you are trying to do and we might be able to offer useful suggestions. In addition, looking at the questions will be helpful if and when you need to request additional resources from the AAC; once you go beyond the "basic" allocation, we will require satisfactory answers to all of the questions in the form, and the answers for some fields require information about your jobs and their performance that you should be collecting while using your basic allocation. So being at least aware of the questions that you will need to answer when applying for more resources is useful. As always, if you need assistance with either the basic allocation or when requesting additional resources, please do not hesitate to contact the HPC team.

Atlhough this basic allocation might suffice for some small projects, and is useful if you wish to explore whether high performance computing would benefit your research, we expect most users will need more resources for their work. There are several ways to obtain additional resources.

Generally, the next step is to apply for additional compute and storage resources from the campus Allocations and Advisory Committee (AAC). This committee consists of a number of UMD faculty members with extensive experience in research involving the use of high performance computing, who will evaluate such requests to ensure proper and efficient use of the university's valuable HPC resources.

The AAC can authorize additional resources, up to 500 kSU of compute time, 10 TB of high performance/scratch storage, and 50 TB of SHELL/medium term storage, at no cost to the faculty member.

If even more resources are needed, there are basically two options, which we elaborate on below:

Some colleges, departments, etc. have pools of resources on the HPC cluster in return for contributions they made toward the construction of the cluster. If you belong to one of these units, you might be able to get additional resources from your unit.
All users are able to purchase additional resources from DIT.

The units which have pools of HPC resources to allocate, along with their contacts and which clusters they have pools on, are as follows:

Unit		Contact Person	Zaratan?	Notes
A. James Clark School of Engineering		Jim Zahniser	X	ENGR is doing some cost recovery
College of Computer, Mathematical and Natural Sciences (CMNS)		Mike Landavere	X	Delegated to Departmental Level (see below)
CMNS	Atmospheric and Oceanic Science	Kayo Ide?	X
CMNS	Astronomy	Benedikt Diemer	X
CMNS	Biology	Wan Chan	X
CMNS	CBMG	Wan Chan	X
CMNS	Chemistry	Caedmon Walters	X
CMNS	Computer Science	Jeanine Worden	X
CMNS	Entomology	Greg Hess	X
CMNS	Geology	Phil Piccoli	X
CMNS	IPST	Alfredo Nava-Tudela	X
CMNS	Joint Quantum Institute	Jay Sau	X
CMNS	Physics	Jay Sau	X

Please note that all policies/procedures/etc related to the allocation of these resource pools by the units above are completely up to the unit; the Division of IT is not involved in the policies or decision-making process. Also note that while we try to present accurate and up-to-date information regarding these matters, the units are not required to inform us before making changes, and so for the most accurate and definitive information we suggest you contact the relevant people in the unit. To our knowledge, Engineering is doing some cost recovery on allocations from their pool, but all of the other units above are not directly charging faculty members for allocations granted from their resource pools.

All faculty are eligible to purchase additional HPC resources. Please see the cost sheet for pricing. These monies from these charges will be used to maintain, enhance, and expand the cluster.

You are not limited to just one of the above options. Indeed, the same application form is used for all allocation types managed by the Division of IT (i.e. everything but the allocations from college/departmental pools), and you only need to apply once for the full amount of resources you require (we will automatically grant the base allocation, submit for review by the AAC any additional resources requested (up to the cap), and provide a quote for the remainder. Compute time from a purchased allocation will not be available until arrangements for payment have been made; if you request an amount of resources which would require payment by mistake, it is not a problem as it will be corrected when we contact you to arrange for payment.

All allocations have an expiration date, which is at most one year from the date of approval. The allocations from DIT/AAC can be renewed, but such will require the submission of a renewal application. For "base" allocations, renewals are essentially automatic; for AACallocations the AAC will want to see more detail, including a summary of what was accomplished with the previously awarded resources. We also request all PIs update the list of publications in ColdFront.

Allocation Limits and Pricing

The following table summarizes the various options for obtaining allocations and the limits which apply:

Allocation Class	From	Compute Time	Scratch/HPFS	SHELL/MTS	URL for appyling
"Free" Base allocation	DIT	50 kSU/year	0.1 TB	1 TB	Campus AAC application form
"Free" AAC allocation	AAC	up to 500 kSU/year	up to 10 TB	up to 50 TB	Campus AAC application form
College/Deparmental Allocations	CMNS	up to the College/Department			See College/Departmental pool table for a list of units and contacts

In addition to the "free" allocations from the Allocations and Advisory Committee (AAC), it is also possible to purchase additional resources from the Division of Information Technology. The pricing model depends on a number of factors, including the amount of resources being requested, the amount of excess capacity currently on the cluster, and the time frame for the request.

We strive to keep the capacity of the cluster at or above the total resource commitment as otherwise there could be serious issues with shortages of resources (e.g. long wait times in the queues for jobs, disks running out of space, etc). If your requested resources can be met from existing excess capacity (i.e. capacity on the cluster which has not been allocated to other users via their purchases or any of the allocation avenues described above), we can typically grant the request rather quickly.

Currently (Jan 2023) we have an excess capacity about 15,000 kSU/quarter and about 200 TB of scratch and 1 PB of SHELL storage. For requests that can be satisfied from our excess capacity, the pricing model for additional resources is as follows:

Resource	Unit	Price
Compute	1 kSU for 1 quarter	$2.32
Scratch storage	1 TB for 1 quarter	$8.28
SHELL storage	1 TB for 1 quarter	$5.57

These resources are all tied to a specific quarter in which they are valid (this determination will be made at the time of purchase). Any resources which are not used in the quarter specified do not roll over, they simply vanish. You can make a purchase of e.g. 1000 kSU for a year, but this will be broken down into a set number of kSU for each quarter in that year; by default we will allocate 250 kSU/qu for each of the four quarters, but you can request a different quarterly allotment.

Similarly, you can purchase storage resources for a year or longer, and specify how wish the storage to be allocated quarter by quarter (although since files tend to be more permament than jobs, we strongly encourage either equally dividing among the quarters, or having the allotted amount increase with each successive quarter). Thus to get an additional 1 TB of scratch space for 3 years (or 3 years * 4 quarters/year = 12 quarters), you would need to pay 12 times the $8.28 quarterly price. At the end of the contracted period, unless you purchased additional space in another contract, the added space will go away (i.e., your project's quota on the relevant storage tier will return to the value it was prior to the purchase); this will likely result in your project being overquota unless you deleted data or transferred data elsewhere. As per HPC policy, in such cases you will be warned of the overage and asked to resolve the matter in a timely fashion (typically a week or so) --- failure to do so may result in your project being charged for additional quarters of storage use (at whatever the going rate at the time is, which might be more than the rates in the initial contract).

Note: Data stored on the various storage tiers is still subject to the HPC policies on the use of the respective storage tier, even if your project has purchased additional storage. For example, the policy that all data on the scratch tier must be in support of "active" jobs on the cluster (i.e. jobs that just finished, are actually running or in the queue, or jobs to be submitted in the near future) still applies to all data on the cluster, even if your project purchased additional scratch space.

In order to use the cluster, all allocations need some mix of compute and storage. Since many users are a bit unsure as to the relative amounts, we offer a "balanced" package of compute and storage resource, for $2.86 per "balanced unit", with a balanced unit consisting of:

1 kSU of compute (for 1 quarter)
15.8 GB of scratch storage (for 1 quarter)
73.0 GB of SHELL (for 1 quarter)

Again, this is just a convenient ratio to buy compute and storage in, reflecting the ratio of those resources in the initial cluster configuration. The terms are the same as previously mentioned; and there is no "discount" for buying in "balanced units" (the prices listed above our actually our fully discounted prices so we cannot go lower). This is just a recommended ratio for would be purchasers who know they need a bit of all resources but are not sure how much of each they should purchase --- if you happen to know that your intended research will be need more of one resource and less of another, you should adjust accordingly.

If you request more resources than what we have in our current excess capacity, additional hardware will need to be purchased to accommodate your request. We will need to obtain quotes from the vendor in order to work out what the costs will be. The need to purchase hardware will also mean that there will be some delay before the hardware actually arrives and can be integrated into the cluster. Furthermore, if your are only requesting the additional resources for a short time (compared to the estimated usable lifespan of the hardware involved), there will be surcharges applied to cover the estimated overhead before we can sell the resources as excess capacity. This is standard industry practice; even large cloud providers like Google and AWS charge substantially more for short term purchases compared to purchases with a longer commitment, and we are small compared to them.

You can submit a single application for an allocation, AAC allocation, and paid allocation, and if you are ready to request all of them at the same time that would be preferred. Of course, if your needs (or your awareness of your needs) change over time, you can submit multiple applications to adjust the allocation sizes. Please note that although the form does not enforce limits, we do track and enforce the annual limits on compute time, etc.

Although the compute time requested in the application and as awarded is on an annual basic, the actual compute time might be meted out either annually or quarters. This decision is made by DIT based on the size of the allocation; smaller allocations will have compute time doled out annually, and larger ones quarterly --- this is to encourage the use of allocations to be more spread out over time. E.g., a 50 kSU "base" allocation will typcially be meted out annually; when it is awarded you will receive 50 kSU to use within 365 days from the date of award. But if you purchase 1000 kSU (or receive such from a departmental/college pool), typically you will get 250 kSU/quarter for the 4 quarters following the date of the award. SUs that are not used at the end of a quarter (or the end of the year for annually meted out allocations) will simply disappear, they will not carry over into the next time period.

If you have a "base" and an "AAC" allocation, we will typically try to consolidate these into a single allocation, representing a single Slurm allocation account; this will generally be more useful than multiple Slurm allocation accounts. We cannot consolidate allocations with different sources, schedules (quarterly or annual), or expirations, so generally college, departmental, and paid allocations will remain as distinct allocations.

Unlike CPU resources, disk space does not "regenerate" with time. Once a file is placed on the file system, it remains there, consuming disk space, until someone removes it. Allocations come with a limited amount of storage. Typically, for each storage resource on a given cluster we sum up the allowances for each allocation into a single limit for the project. E.g., if your AAC allocation granted you 2 TB of HPFS storage, your college allocation granted you an additional 1 TB, and your purchased an allocation granting another 1 TB, normally this would be combined to give a 4 TB storage limit for all members of any of those allocations. The storage allotment remains until the allocation expires (or the storage allotment for the allocation changes); it such events reduce your storage allotment causing your usage to exceed your allotment, the PIs and managers for the project will be contacted to inform them of the issue and request that it be rectified in a timely fashion (typically a week or so). If you receive such a notification and need assistance, etc. in rectifying the matter, please contact the HPC team.

For the scratch tier , a true quota system is imposed, which limits the amount of data which members of the allocation can store on the file system. By default, we do not impose quotas on individual users, although such can be done upon request.

The SHELL storage tier is volume based. Typically one volume will be created for the root of the SHELL project directory for the project, along with a volume for each member of the project. Additional volumes may be created on request. Each volume will have a limit on the amount of data that can be stored in it; this has a default value but within reason the default can be changed as well as the limit for specific volumes. Since often the amount of data on a volume will be only a fraction of its limit, we allow for oversubscriptions within reason (i.e. the total of the limits on all of the volumes above can exceed the limit for the project); this is fine as long as the total amount of space used on all the volumes fits within the project's limit.

Resources, paid or otherwise, are allocated ahead of time, and so for the most part you will not get billed for resources previously consumed. The exceptions are for storage. If an allocation expires and is not renewed, or has its storage allotment adjusted downward in a renewal or otherwise, then it is possible that the total amount of data for the project stored on the resource can exceed what was allotted to the project. Also, the volume design of the SHELL storage tier , combined with oversubscription, means that it is it is possible to store more data under the project's SHELL directory than was allocated to the project. Whenever the storage resources used by a project exceeds what was allocated to the project, warning emails will be sent to the PI and all members of the projects informing them of the quota violation and requesting that it be resolved within one week. Resolution can be made by reducing the amount of data stored, renewing the expired allocation, purchasing additional storage, etc. If you need help figuring out how to rectify the situation, or if there are extenuating circumstances which might warrant an extension of the time limit to resolve the matter, please let us know. While our responsibilities towards other users on the cluster will not allow us to ignore such overages, we are willing to work with you to find a mutually acceptable solution. If you are unable to resolve the matter in a reasonable time and are not working with us towards finding an acceptable resolution, we will be forced to bill you for the additional storage used.

For PIs and/or Allocation Managers: How do I grant access to my allocation?

This section is for managers of projects on one of the DIT maintained HPC clusters. If you are NOT a designated manager, i.e if you are only a member, or not even a member of the allocation, do NOT follow these steps. We will not honor requests made from people who are not managers for the project the allocation is in. If you are not the manager for the project, find the manager and have them make the request.

To add/delete users from an existing allocation

This can be done by designated allocation/project managers in either of two ways:

Directly via the ColdFront web GUI.
By requesting HPC staff process the request.

Either way, if there are multiple allocations within the project, it is strongly recommended that the membership lists for all such allocations be the same. Membership in any of the allocations for a project grant the user full access to all of the scratch and SHELL storage allotted to the project, and generally it is recommended that users have access to the compute time for all of the allocations belonging to a project as well.

Adding or deleting users from an existing allocation using ColdFront

PIs and managers of projects are now able to view and modify the membership lists of their allocations directly using the ColdFront web-based allocation management GUI.

To add an user to your allocation, there are only a few of steps:

Open up your web browser and log into coldfront<\li>
Find your project and add the user to your project.
Find your allocation(s) underneath the project and add the user to your allocation(s).

Note: You must add the user to both the project and at least one allocation for them to get access to the HPC cluster. Adding the user to the project does not do much, it basically only makes them eligible to be added to allocations for the project. It is adding the user to the allocation which actually grants them access to the cluster and allocation resources.

To delete users from your allocation(s), the process is basically the reverse of the add user process:

Open up your web browser and log into coldfront
Find your allocation(s) underneath the project and delete the user from your allocation(s).
Find your project and delete the user from your project.

Please note that it takes an hour or two for the provisioning process to complete. The deprovisioning process is currently somewhat manual, so that might normally take a couple of days. Please submit a ticket to HPC staff if the removal of user access is more urgent.

Requesting HPC staff add or delete the users

Basically, one of the points of contact for the allocation on the just needs to send email to hpcc-help@umd.edu requesting that the user be added to the allocation. The email should come from the point-of-contact's official @umd.edu email address), and also specify:

the name of allocation
the Glue/Terpconnect username (i.e. campus LDAP directory ID, or the left part of their @umd.edu email address)

Note that certain subdomains of umd.edu (e.g. cs.umd.edu, astro.umd.edu) are NOT part of the unified campus username space, and as those subdomains are NOT maintained by DIT, are not usable by us to uniquely identify people. E.g., jsmith@cs.umd.edu might or might not be jsmith@umd.edu, so we cannot reliably map jsmith@cs.umd.edu to a specific person.

The DIT maintained HPC clusters currently require all users on the cluster to have active Glue/TerpConnect accounts. This condition should generally be true for most if not all users automatically, but if you are unsure or need to manually activate your Glue/TerpConnect account, please seethis Knowledge base article (KN0010073) . If you submit a request for users without a TerpConnect account, you will just get email back telling you they need to get a TerpConnect account first.

Requests to delete users from the allocation can be handled similarly. Here it does not matter whether the user's TerpConnect account is still active. If the user is not associated with any allocations other than yours, their access to the cluster (as well as to charge against your allocation(s)) will be revoked, and all access to their HPC home directory and any directories on lustre or data volumes will be revoked and those directories slated for deletion. If there is data which should be retained, you should mention that in the email so we can look into reassigning ownership. If the user has access to other allocations, only their ability to charge against your allocation will be revoked, and we will by default not do anything with respect to their home or data files. You should contact the user about any transfer of data that is required (and you can contact us if assistance is needed).

To create suballocations on the HPC clusters

Certain contributors (e.g. Engineering, CMNS and some of its departments) have not allocated all of the resources they are entitled to from their contribution to the Zaratan cluster, and are instead periodically creating suballocations carved from these unallocated resources.

To create new suballocations, or modify the resources granted to existing suballocations, the points of contact for the contributions with unassigned allocations should send email (from their official @umd.edu email address) to hpcc-help@umd.edu including the following information:

Name of the contribution the suballocation should be carved from
Proposed name of the suballocation
Quarterly size of the allocation in kSU (1000 core-hours)
The amount of scratch storage (in TB) to allot to the allocation
The amount of SHELL storage (in TB) to allot to the allocation
TerpConnect/Glue username of point of contact for suballocation (i.e. their @umd.edu email address)
You can also give additional users to be made members (or additional points of contact) for the suballocation (please specify members vs points of contact)
If the suballocation should have a shorter lifetime than the parent contribution, you must specify its expiration date. Note that since our accounting is essentially quarterly, expiration dates on quarter boundaries (Jan 1, Apr 1, Jul 1, Oct 1) probably make the most sense.

Again, all points of contact and members of the suballocation MUST already have active Glue/TerpConnect accounts before submitting the request. See here for information and instructions on activating TerpConnect accounts.

Also, all such requests MUST come from a designated point of contact for the parent contribution.

How do I get an Allocation from the AAC?

AAC Application process: Overview
AAC Application process: Quickstart for Development Allocations/First time applicants
AAC Application process: Quickstart for Renewal Applications to extend the expiration of an existing allocation, or to request additional resources for an existing allocation.
AAC Application process: Detailed explanation of fields on the form

Overview of AAC application process

All applications for allocations of HPC resources from the Allocations and Advisory Committee (AAC) are made via this form. Do not use that form for:

requests for allocations from your college/department or other unit. Such requests should be made to the contact person for the appropriate college/dept/etc as decisions re the allocation of such resources are made at the college and/or departmental level.
requests to purchase HPC resources from DIT. For such requests, please open a ticket with HPC staff to request a quote and MOU.
requets for an allocation for use for a class you are teaching. For such requests, please open a ticket with HPC staff requesting a class allocation. Please include the name of the class, the semester, and a brief discussion of what you intend to use the allocation for (e.g. class demonstrations, student projects, etc). We ask that you make such requests at least a month before the start of the relevant semester.
requests to be added as a member to an existing allocation. Please have the PI or designated manager of the allocation either grant you access to the allocation directly via ColdFront or open a ticket requesting HPC staff grant you access to the allocation. This form is for requesting a new allocation, or requesting additional resources for or renewing an existing allocation only.

All requests for allocations from the AAC must come from faculty members, although we allow students, etc. to submit the form on behalf of their faculty advisor as long as the advisor updates the request to indicate their approval. Note: while most students will have access to use this form by default, some students, especially those without graduate or research assistanceships, might not. If you fall into this category, please open a ticket with HPC staff indicating such, and we will update records to provide you access. Faculty members are eligible for a single allocation from the AAC, although this allocation may be renewed.

Note:Faculty members are eligible for at most a single allocation from the AAC (excluding allocations for classes they are teaching). Students, etc. should communicate with their advisor and coordinate with their colleagues before applying to the AAC for an allocation for their advisor. If there are multiple projects, perhaps one for each student, these should still be consolidated into a single proposal to the AAC. If the AAC receives multiple applications from different students of the same advisor will appear as a lack of coordination and planning and will not leave a favorable impression with the AAC.

Allocation requests are reviewed, first by HPC staff and then by the AAC, with the level of scrutiny given increasing as the cumulative amount requested by (or on behalf of) the faculty member (for the year) increases. If there is information missing or insufficient detail, the application will be sent back to the applicant for more information. For the first 50 kSU requested by a faculty member in a given year, the application is almost automatically approved. Requests for resources beyond that will generally performance benchmarks based on actual runs on the cluster. Therefore, new users to the cluster should start with a 50 kSU developmental allocation request, and then use the awarded resources to start their research and simultaneously start collecting the performance benchmarks for any future request for additional resources.

Allocations from the AAC are one-time grants of resources, valid for at most one year. They do not get replenished automatically. If needed, you can submit a "renewal" application requesting additional resources before the expiration of your current allocation; generally in such cases the amount of the new request will be added to the previous request(s) for the cumulative amount requested from the AAC for the year, and the expiration date will remain unchanged (i.e. one year after the initial request was granted). The AAC tries to not oversubscribe the computational resources of the cluster; because of this, any award made to one research group correspondingly decreases the amount of resources available for the AAC to award to other research groups, so it is incumbent on the AAC to ensure that resources are being used well and efficiently, hence the application process and the scrutiny on renewal applications. There are limits on the cumulative amount of resources a faculty member is able to receive from the AAC in a given year; if resources beyond those limits are required you must either:

request resources from your college/department/etc (note that depending on the policy of the college/department/etc, there might be costs associated with such)
purchase additional resources from DIT. See our standard pricing list, and open a ticket to request a quote and/or MOU. Please give the amount of resources your are considering and the timeframe for which you are requesting such, even if only estimates.

If this is the first time you (or your advisor if you are requesting an allocation on behalf of your advisor) is requesting an allocation from the AAC, please see the quickstart section for developmental allocations. You should also go there if you are requesting a renewal of an expired allocation at the 50 kSU/developmental level (but not if you are requesting an additional 50 kSU for a non-expired allocation). If you are requesting additional resources for a non-expired allocation, or seeking to renew an expired allocation for more than 50 kSU, please see the quickstart section for renewal applications.

There also is a section providing detailed descriptions of each field in the form and what is expected to go in it which you can refer to if you need more information than is provided in the "quickstart" sections. And as always, if you need assistance with the form, either for the basic allocation or when requesting additional resources, please do not hesitate to open a ticket with the HPC team requesting assistance. We would be happy to assist you.

AAC Application Quickstart: Developmental Allocations

All faculty members are eligible to receive a free "developmental" allocation from the AAC essentially just for the asking. These developmental allocations consist of 50 kSU , 100 GB of scratch storage , and 1 TB of SHELL storage . Faculty members are eligible to receive a developmental allocation once, and can renew it annually.

These developmental allocations are intended for users to explore if HPC techniques are useful for their research, and might even suffice for research with small computational needs. For research with larger computational needs, the AAC will usually first award a developmental allocation to allow the research to start, and to collect data showing that the research program is well-considered and using HPC resources efficiently; such data can be used in a follow-up "renewal" application for additional resources.

Because we use a single application for both this basic application level and for requesting additional resources from the AAC, there are a fair number of questions on the application. For the basic application, you can leave many of the questions blank (or put "N/A" in the answer) --- faculty members will be awarded the base allocation just by requesting it. However, it is helpful if you take a few moments and try to answer these questions to the best of your ability --- your answers will provide us with some insight into what you are trying to do and we might be able to offer useful suggestions. In addition, looking at the questions will be helpful if and when you need to request additional resources from the AAC; once you go beyond the "basic" allocation, we will require satisfactory answers to all of the questions in the form, and the answers for some fields require information about your jobs and their performance that you should be collecting while using your basic allocation. So being at least aware of the questions that you will need to answer when applying for more resources is useful.

For a developmental allocation, we ask that you please focus on the following fields of the form:

Request Type: if this is your first allocation, select New Allocation. If you are requesting a renewal of or additional resources for an existing allocation, select "Renewal Allocation".
Faculty Advisor: If you are not a Professor, please enter your faculty advisor here.
Research Title: Please give a one-line title for your research.
Research (Lay) Abstract: Please provide several paragraphs describing your research. Please include both an overview of your general research program, along with a somewhat detailed account of the research you are proposing to use HPC resources for, and how the HPC resources will be used in the furtherance of that research. The AAC is comprised of faculty members from various disciplines with extensive experience in the use of HPC for advancing research, but do not assume that have extensive experience in your particular discipline. Please address your comments here accordingly.
Requested Allocation Type and Requested kSU: These should be "Development" and "50", respectively, as that is what this quickstart is addressing.
Additional High-performance Scratch Disk Space and >Additional SHELL (Medium-Term) Disk Space: If the default amounts of 100 GB scratch and 1 TB SHELL will not be enough for your initial work, please estimate the amount of each you will need. Otherwise, leave at 0. If you enter something other than 0, you must fill out the Disk Space Justification field.
Disk Space Justification: If you entered something other than 0 for either Additional High-performance Scratch Disk Space or Additional SHELL (Medium-Term) Disk Space, you must fill out this section. For each storage tier you requested additional space for, please explain why additional storage is needed. This should be a quantitative estimate, based on the actual number and sizes of files that will needed to be stored. Note that scratch space is only intended for the storage of data pertaining to "active" jobs, you are expected to delete data on scratch which is no longer needed, and move to SHELL data which should be retained but is no longer needed for active jobs. Your discussion of storage needed should include such.
If you are requesting the renewal of an existing developmental allocation for another year, please fill in the Past Results and Publications. The Past Results fields should list what you achieved with the previous award of HPC resources. The Publications field should list any publications produced since the initial award of resources (or last renewal of the allocation) --- this is useful when we need to show University management the value of the cluster. It is understood that you might not always have something to put in these fields.

Again, you are encouraged to fill out any other fields to the best of your ability. This is useful to give you a better idea of what will be required in any applications for additional resources, as some fields (especially the Code Use and Scalability and SU Justification sections, both of which require some performance benchmarks that the AAC assumes you will be collecting with this initial award).

As always, if you need assistance with the form, either for the basic allocation or when requesting additional resources, please do not hesitate to open a ticket with the HPC team requesting assistance.

AAC Application Quickstart: Renewal Applications

If you have an existing allocation and are requesting to extend it for another year, or are requesting additional resources for that allocation, you need to submit a "renewal" application to the AAC. This section gives a quick overview of the fields you need to pay attention to.

NOTE: if you are simply seeking to renew an expired (or about to expire) developmental (50 kSU) allocation for another year at the same developmental (50 kSU) level, please see the AAC Application Quickstart: Developmental Allocations section as it is more relevant.

If the allocation being renewed is expired or about to expire shortly, then (if approved) the renewal will cause the expiration date to be set to a year in the future, and cumulative resource count for the allocation for the year will be set to whatever was granted for the renewal.

If the allocation is not expired/about to expire, then the expiration date is not changed (i.e. it will still be one year from when the allocation was first granted or the last time it was renewed after it expired or just before it was going to expiration), and any resources awarded are added to the cumulative amount awarded for the year. This will impact the amount that can be requested in future renewal requests until the allocation expires; see for limits on the cumulative amount of resources a faculty can receive in a given year. I.e., if a faculty member makes an initial request for 50 kSU in January, then a renewal request for 250 kSU in March, and another request for 250 kSU in August of the same year, this will be a cumulative request of 550 kSU, causing the faculty member to have "max-ed" out the amount available from the AAC until the following January.

For a renewal allocations, you must fill out all of the fields in the form. A discussion of all of the fields can be found below, but we suggest you focus your efforts on the following fields in order:

Request Type: This should be set to Renewal Allocation. This will add the Past Results and Publications fields to the form, which you will need to address.
Research (Lay) Abstract: Please provide several paragraphs describing your research. Please include both an overview of your general research program, along with a somewhat detailed account of the research you are proposing to use HPC resources for, and how the HPC resources will be used in the furtherance of that research. The AAC is comprised of faculty members from various disciplines with extensive experience in the use of HPC for advancing research, but do not assume that have extensive experience in your particular discipline. Please address your comments here accordingly.
Milestones: This should be set to a list of concrete, verifiable scientific goals you hope to accomplish with the awarded HPC resources. These should essentially be a list of criteria for the success of the project. These should directly relate to the estimates of resources required given in the SU Justification and Disk Space Justification sections. And ideally, you should be able to use this as the basis of the Past Results section in the next application you make.
Code Use and Scalability: In this section you should discuss the codes you expect to use and the degree and manner to which they support parallelization. While the UMD HPC clusters are designed primarily for the use of parallelized codes, we recognize that not all uses of the cluster are parallelizable --- if your intended use is not parallelizable simply state so.
The AAC will also wish to see that resources are being used efficiently. Parallel codes typically show close to linear improvements to performance as more resources are made available to the job, but only up to some threshold. Adding resources beyond that threshold yields little if any performance gains. This threshold is best determined empirically. The AAC will wish to see that you have run tests to determine the scalability of the codes you are using by running performance benchmarks, and that your proposal will be for jobs using an optimal amount of resources.
SU Justification: The AAC wishes that you use this field to detail the calculations you intend to perform, and give detailed, quantitative estimates of the amount of resources that will be needed to accomplish the research goals you listed in the Milestones section. Typically this is most easily done by estimating the number and type of jobs needed to accomplish each milestone in the list (please give a sentence or two describing how you arrived at that number), along with the specifications for each job (number of CPUs per job, the amount of CPU RAM per job, the number and type of GPUs per job, and the average runtime and SU cost of a typical job). All of this should be based on actual jobs run on the cluster you are requesting resources for.
Note: If you have departmental or college allocations in addition to the allocation from the AAC, you will need to justify your total research plan across all such allocations. The AAC will not award additional resources if it concludes that your research program can be readily satisfied with your existing resources from other allocations. If those allocations or allocation levels are expected to change over the course of the year, explain this in your application. If the resources for the other allocations is committed to other projects, then you must detail the resource requirements of those projects in your AAC application as well.
Disk Space Justification: If you entered something other than 0 for either Additional High-performance Scratch Disk Space or Additional SHELL (Medium-Term) Disk Space, you must fill out this section. For each storage tier you requested additional space for, please explain why additional storage is needed. This should be a quantitative estimate, based on the actual number and sizes of files that will needed to be stored. Note that scratch space is only intended for the storage of data pertaining to "active" jobs, you are expected to delete data on scratch which is no longer needed, and move to SHELL data which should be retained but is no longer needed for active jobs. Your discussion of storage needed should include such.
Past Results: In this section, you should list what you accomplished with the previous award of HPC resources. Ideally, you should be able to take the Milestones section from your previous application to the AAC, and just copy it here (assuming you achieved all of the previous milestones). If some milestones were not achieved, you may wish to explain why not, either here or in the SU Justification section if you are requesting additional SUs for such --- the AAC is comprised of experienced researchers, and while habitual failure to accomplish goals will be frowned upon, they do understand that
Publications: In this section you should list any publications produced since the last award of resources to which the use of HPC resources at UMD contributed. Please note that we request that you acknowledge the use of the UMD HPC resources in such papers. Building and maintaining a first class HPC cluster is expensive, and providing us with a list of papers and acknowledgements of the use of UMD HPC resources is useful when we need to justify the need for continued funding of these resources. It is understood that producing publications takes time, and you will not always have publications to list. Please also < a href="/hpcc/help/coldfront.html#proj-update-research">update your publications in the ColdFront portal

Again, you are encouraged to fill out any other fields to the best of your ability. This is useful to give you a better idea of what will be required in any applications for additional resources, as some fields (especially the Code Scalability and SU Justification sections, both of which require some performance benchmarks that the AAC assumes you will be collecting with this initial award).

Detailed field list for AAC application form

The following list gives all of the labels for the various fields for the form to request an Allocation from the AAC. This section click to be taken to a discussion of what is being requested by each field. This list is alphabetical, in the detailed discussion fields are presented more or less in order they appear on the form.

Additional High-performance Scratch Disk Space (TB)
Additional SHELL (Medium-Term) Disk Space (TB)
Additional Software Needs
Code Use and Scalability
Desired Allocation Name
Desired End Date
Desired Start Date
Disk space Justification
Estimated Ram Per CPU Core (in GB)
Faculty Advisor
HPC Experience
Milestones
Past Results
Processor Need
Publications
Renewal Cluster
Request Type
Requested Allocation Type
Requested Cluster
Requested for
Requested kSU
Research (Lay) Abstract
Research Title
Software Requested
SU Justification
Unix Experience

Requested for:

This should list the person who is filling out the form. If a student, etc. is filling out the form on behalf of their faculty advisor, this should be the student's name. (The faculty advisor's name goes in the Faculty Advisor field.

Faculty Advisor:

This field should list the faculty member you are filling out the form on behalf of. It the requestor is a faculty member (e.g. Professor) filling out the form on their own behalf, this can be left blank.

Request Type:

This specifies the type of request, either New Allocation or Renewal. If this is the first project / allocation . for you (or your faculty advisor if requesting on their behalf), choose New Allocation. If it is a request to renew an existing allocation, or to add new resources to an existing allocation, select Renewal.

Desired Allocation Name

This is the desired name of the project / allocation .

Research Title

This should be an one-line title for your research.

Research (Lay) Abstract

Please provide several paragraphs describing your research. Please include both an overview of your general research program, along with a somewhat detailed account of the research you are proposing to use HPC resources for, and how the HPC resources will be used in the furtherance of that research. This section should discuss the scientific aspects of what you are trying to do; computational and algorthmic details belong in other sections (e.g SU justification, Disk space justification, and/or Code Use and scalability).

The AAC is comprised of faculty members from various disciplines with extensive experience in the use of HPC for advancing research, but do not assume that have extensive experience in your particular discipline. Please address your comments here accordingly.

Past Results

(Only for Renewal type applications). For renewal applications, please list what was accomplished with your previous allocation. Publications, if any, should go in the Publications section. In this section, please describe what was achieved --- usually these would be related to the Milestones listed in the previous application. Even for the renewal "base" allocations, we would appreciate having this information.

Publications

(Only for Renewal type applications). For renewal applications, please list what publications were at least in part made possible based on the computations done in your previously awarded HPC allocations. Even for the renewal "base" allocations, we would appreciate having this information. We recommend that you enter this information in ColdFront first and then cut and paste here.

Desired Start Date

If you are requesting a project / allocation ahead of time, please specify the date when you want the allocation to start. It will default to a week from the current date (i.e. the allocation will start as soon as possible, often in a day or so).

Desired End Date

All allocations have an expiration date, which will default to one year after the allocation is created/renewed. aIf you do not need/require the resources for a full year, please adjust the date correspondingly. For quarterly allotted allocations, this will be adjusted to the date at which a quarter ends.

Requested Allocation Type

This is a drop-down of predefined allocation sizes for the allocation being requested. For new allocations, or renewing an expiring allocation, this should be the size of the allocation you are requesting. If you are seeking to add SUs to an existing allocation, this should be the size of the new allocation, i.e. the sum of what you currently have plus the additional SUs being requested (in this case, it does not hurt to break this down in the SU Justifaction field).

Development is for "base" allocations Small and Medium are for allocations which can be awarded by the AAC. Large is for allocations which require payment.

Whatever you set here will default the Requested kSU field.

Requested kSU

Please enter the requested kSU. As described in the Requested Allocation Type section, this should be the combined size of all allocations you are seeking --- i.e. if requesting additional SUs on top of what was already allocated to you, this should by the sum of the current SU levels plus the amount of new compute time being requested. This should be within the range determined by the Requested Allocation Type. You will also need to justify the requested kSU in this section in the Disk Space Justification section.

Additional High-performance Scratch Disk Space (TB)

All allocations get a base of allotment of 0.1 TB of scratch/HPFS storage. If that is all the scratch disk space you expect to need, you can just leave this at 0 TB. If you expect to need more scratch space, please enter your expected need (in TB). If you have an existing allocation and are requesting more space, this number should be the total space needed (i.e. the currently allotted space plus any additional space you are requesting). Round to the nearest TB. You will also need to justify the space in this section in the Disk Space Justification section.

Additional SHELL (Medium-Term) Disk Space (TB)

All allocations get a base of allotment of 0.1 TB of SHELL/medium-term storage. If that is all the SHELL disk space you expect to need, you can just leave this at 0 TB. If you expect to need more SHELL space, please enter your expected need (in TB). If you have an existing allocation and are requesting more space, this number should be the total space needed (i.e. the currently allotted space plus any additional space you are requesting). Round to the nearest TB. You will also need to justify the space in this section in the Disk Space Justification section.

Estimated RAM per core (in GB)

Please give the estimated amount of RAM that your jobs will need, in GB, per CPU core . For sequential jobs, this should be your estimate of the total memory needed by the jobs; for multithreaded or MPI jobs, this should be your estimate of the total memory needed by the jobs divided by the number of cores/tasks. If unsure, 4 GB/core is a standard value.

"Requested Cluster" or "Renewal Cluster"

This is a drop down list in which you can state the HPC cluster you wish the allocation to be made on. The decision as to what cluster the allocation is placed on is up to DIT and the AAC However, your request will be considered. If there are specific considerations you think the AAC should be aware of when making a decision as to what cluster, please include those in the SU Justification section.

Processor Need

This is a drop down giving different processor options. Currently, it only asks if you intend to use

CPU, i.e. you only plan to use CPUs
GPU, i.e. you only plan to use GPUs (graphical processing units)
CPU and GPU, i.e. you plan to use both CPUs and GPUs.

While we do not currently restrict your usage to either CPUs only or GPUs only on the basis of your answer to the above, an honest answer helps our understanding of how the cluster is being used. I.e., if you do not think you will be using GPUs, then answer CPU only. An answer of "CPU only" will not prevent you from submitting GPU jobs, or exploring if your codes can benefit from using GPUs. If you do such tests and discover that GPUs are useful for your research, your answer here does not prevent you from using them --- just please change your answer when you renew the allocation.

Software Requested

This section allows you to list the software packages that you intend to use in carrying out your research. This is useful our better understanding of how people are using the cluster. The drop down list includes many of the packages installed on the cluster, but as we are frequently adding to our software library on the clusters, this list is not always up to date. But please glance at the list and select any packages you expect to use. If you require a software package that is not listed, please select Other and fill in the Additional Software Needs field as well.

Your answers here helps us to better evaluate your application as well as improve our understanding of how people are using the clusters, so accurate answers are appreciated. However, we do not restrict your access to software based on this answer; if there is an application "foo" on the list that you did not select when filling out the application, but after the allocation awarded you discover that it would be helpful to your research, the fact that you did not select it will not prevent you from using it. We do ask that if you continue using it that you include it when you renew the application.

Additional Software Needs

This section, like the Software Requested field, helps us better understand your application and the use of the cluster by users. This field is mainly intended so that you can elaborate on what is meant by your choice of Other in the Software Requested field. field. But you can also use it to specify version requirements of software in the drop-down list.

Please note that the presence of a package in either this or the Software Requested fields does not constitute a promise on the part of the HPC staff to install said software, even if the application is approved. The HPC team strives to make a large library of software packages available to our users, and will make reasonable attempts to install packages on request, but not all packages install nicely, or even are suitable for system wide installations.

Also note that the AAC and DIT do not generally provide licenses for licensed software. The HPC team will attempt to install licensed packages on request, assuming the requestee can provide proof of license (and likely they will need to provide installation media). Some of the packages in the drop down list are proprietarily licensed --- in a few such cases they are covered by a campus wide site license, but many such cases are only covered by licenses granted to certain departments and/or research groups. Your inclusion on the application, even an approved application, does not grant you access under these licenses --- we will open a discussion with you about licensing in such cases.

Do NOT purchase licenses for software you intend to use on one of the UMD maintained HPC clusters without consulting with the HPC team first. Not all licenses are suitable for use on an HPC cluster, and we do not wish you to spend money on a license you cannot use on the cluster. Please contact us any such purchases.

SU Justification

In this field, you are requested to provide a justification for the amount of compute time that you are requesting. In order to try to keep the average length of time jobs wait in the queue acceptable, we limit the total amount of compute time over all allocations awarded to minimize the amount of oversubscription. Thus compute time awarded to one group reduces the available pool of compute time to award to future groups. So the AAC gives much weight to this field in their decision making process.

This section is where you describe your computational strategy, contrasting with the Research (Lay) Abstract section which is more about the science. In particular, for allocations from the AAC, the AAC wishes to see a detailed quantitative justification of the compute time requested.

In many cases, this can be as simple as an estimate of the number of jobs that will be required to achieve your stated research Milestones, and an estimate of the SU cost for each job. For the estimate of the number of jobs, please give a sentence or two describing how you arrived at that number. For the SU cost of each job, please give the specifications of a typical job, including the number of CPU cores requested, the amount of CPU memory requested, the number and type of GPUs requested, and the average walltime and SU cost for the job. The last two should be based on actual runs of similar jobs performed on Zaratan. If there are several different types of jobs to be run, break this down by job type. Remember to include any multipliers if you are using GPUs or more than the average memory per core. When renewing an application or requesting additional time, the AAC typically would like to see estimates of SU consumption by job based on actual runs on the cluster where possible.

If you are requesting an increase to an existing allocation, it would be helpful to mention such, along with the existing allotment of compute time and the amount of additional compute time being requested. Generally, you do not need to re-justify the SUs already allotted to you; i.e. if the additional compute time is being requested to explore areas not included in the original request, you only need to discuss the new computations. If however the additional time is needed because you need to revise your previous estimates of compute time needed (e.g. you underestimated the memory consumption so a larger memory factor for CPU time is needed, or you discovered that you need to increase the detail in the calculations), it is probably best to justify the whole amount. It would also be useful to explain why the previous estimates were off --- while habitually underestimating the resources needed will likely not be viewed favorably, the AAC is comprised of experienced researchers who understand that the unexpected sometimes occurs while performing advanced research.

If you are only requesting an increase in disk space, not compute time, you can just enter "No change to SUs requested". Be sure to complete the Disk space Justification section.

If you are not requesting any additional scratch or SHELL storage, you do not need to provide much here, although even in that case any information you do wish to volunteer is useful in helping us to understand how the cluster is being used. You likely cannot leave the field blank, but if you want to be minimalist you can just enter "base".

If you are requesting additional scratch and/or SHELL space, this field this field must be properly filled out. For each storage tier (scratch/SHELL) for which you are requesting additional space, give a quantitative justification for this request. This should be a quantitative estimate of the amount of storage needed to complete the research goals listed in the Milestones section, based on the actual number and sizes of files that will needed to be stored. Note that scratch space is only intended for the storage of data pertaining to "active" jobs, you are expected to delete data on scratch which is no longer needed, and move to SHELL data which should be retained but is no longer needed for active jobs. Your discussion of storage needed should include such.

If you have reason to prefer a specific HPC cluster over another, it is recommended that you include the reasons for your preference here. Although DIT and the AAC have the final as to which cluster your allocation is granted on, your preference (as indicated in the "Requested Cluster" or "Renewal Cluster" fields, together with your justification for that request here, will be considered and honored if feasible and reasonable.

Disk Space Justification

If you requested additional disk space in either the Additional High-performance Scratch Disk Space (TB) or Additional SHELL (Medium-Term) Disk Space (TB) fields, this is where you are requested to provide a justification those requests. Please justify each type of storage separately.

If you are only requesting the base amounts (e.g. 0.1 TB of scratch and 1 TB of SHELL storage), or you are just requesting additional compute time (but no additional storage) for an existing allocation, you can just enter "No additional storage".

Otherwise, please state how you arrived at the amount of disk space that you are requesting. For scratch space, this should be related to your estimate of the amount of scratch space required for running a single job multiplied by the number of jobs you expect to be running more or less at the same time (remember that you are expected to delete unneeded output files, etc. after the job finishes, and move precious input/output elsewhere (e.g. SHELL storage) when they are no longer needed for running or soon to run jobs. For SHELL storage, remember that SHELL storage is not intended as a long term archive.

If you have an existing allocation and are requesting additional storage, please explain why the additional storage is needed, and try to estimate the total storage going forward. Remember that after some threshold, we will need to start charging for disk space (in order that we can grow the total amount of disk storage on the cluster).

Code Use and scalability

This section is where you should discuss some technical aspects of your computation strategy. In particular, the extent to which you calculations are parallelizable . In particular, you should at least discuss whether your codes are capable of being parallelized across multiple nodes or are restricted to only using multiple cores on a single node , or even can only run sequentially .

For "base" level allocations, you need not put much here, e.g. "unknown" or "base" or similar. However, any information you do volunteer will help us better understand how you are using the cluster. If this application is just to increase the storage allotment for an existing allocation, you can just enter "no additional CPU time" or similar. However, for all other applications (e.g. renewal applications or other applications requesting more than the 50 kSU/year base level), this section is required and will be looked at closely. The larger the amount being requested, the more detail will be required.

In particular, for renewal applications the AAC will want to see a discussion of how the performance of the codes being used scale with the amount of resources being allocated to the job in order to show to the AAC that you are using the cluster resources efficiently. Typically, the performance of jobs will increase significantly as additional resources are made available to the job, up to some threshold value. Increasing the resources beyond that value yields little if any increase in performance, and indeed in some cases degrades performance. This threshold value is dependent on many factors, including details of the problem being solved, details of the algorithm and the specific coding of the algorithm being used, as well as details of the cluster it is being run on. Because of this, this threshold is generally best determined empirically. The AAC will want to see that you can show that you are running your jobs in the optimal range.

For CPU only jobs, this is generally a matter of running jobs (either production jobs or test jobs which are expected to behave similarly to production jobs for this purpose) with different numbers of cores, and look at the observed parallel speedup . You should be collecting some data about this while using your original award of compute time. Basically, compare the runtimes of one of your jobs as you vary the number of CPU cores available to the code. Traditionally this is compared against the runtime when using only a single core. (Ideally you would be running the same job over and over again, but often you can get decent results running comparable jobs. In that case, you might wish to take more than one data point for each number of cores to minimize effects due to differences in the jobs.) Typically, the code will speedup a bit less than linearly as the number of cores increases, to a point. After that, the code still speeds up, but with diminishing returns, and at some point the performance either levels off or possibly even degrades as more cores are added. Generally, you can stop running tests with more cores once you detect significant levelling off. The goal of these tests is to determine what the "sweet spot" is, i.e. the ideal number of cores to use to maximize efficiency and throughput.

The above discussion applies well to either pure multithreaded jobs (i.e. the job only does multithreading for parallelization) and for pure MPI jobs (i.e., each MPI task is single threaded). For hybrid jobs using both multithreading and MPI, you should modify the above to do a two parameter search for the "sweet spot" by varying both the number of cores used per MPI tasks and the number of MPI tasks. (This is assuming that neither are constrained by the nature of the problem being solved; if such constraints come into play, then briefly discuss them.)

If your CPU jobs are requesting more than the default amount of memory (usually 4 GB per CPU core), the AAC want to see that you have looked at the jobs to make sure they really require the requested memory. This is usually best determined by examining the MaxRSS value from jobs that successfully completed using the sacct command. In cases where there is significant and unpredicatable variations in the amount of memory used by jobs, you might wish to look for a value of the memory setting which satisfies almost all of the jobs, and then rerunning the few cases which need more with a higher memory setting, as this might be more efficient even after accounting for the rerunning of some jobs. E.g., let's assume you need to run 100 jobs, all single core, CPU only jobs which run for 20 hours, and 90 of these jobs require more than 10 GB but less than 12 GB of RAM, but the other 10 jobs require between 22 and 24 GB of RAM, but you cannot predict beforehand how much memory a job will need. If you were to run all 100 jobs requesting 24 GB of RAM (the amount needed by the most memory-hungry job), the hourly SU cost for each job would be 0.25 SU/GB/hour * 24 GB = 6 SU/hour, so the cost of all 100 jobs would be 100 jobs * 6 SU/hour * 20 hour/job = 12 kSU. But if instead we ran all jobs requesting only 12 GB of RAM, this would cost 0.25 SU/GB/hour * 12 GB = 3 GB/hour, and running all 100 jobs would cost 100 jobs * 3 SU/hour * 20 hour/job = 6 kSU. Of course, we would still have 10 jobs that failed and would need to be rerun with 12 GB, costing 10 jobs * 6 SU/hour * 20 hour/job = 1.2 kSU, for a total of 7.2 kSU, or only 60% of the SU cost compared to running all jobs requesting 24 GB.

For jobs that rely primarily on GPUs for computation, you should discuss whether the code can use multiple GPUs or not, and if it can use multiple GPUs whether it is restricted to GPUs on the same node. If multiple GPUs can be used, you should do a similar process of determining the performance of the code as a function of the number of GPUs being used in order to find the "sweet spot".

Also for GPU jobs, you should discuss which of the different types of GPUs available on the system can be used for your job, with a brief comment on why any GPU types are unacceptable (e.g. insufficient GPU memory, etc). If your job will run on multiple types of GPUs, you should show the performance as a function of GPU type. If using a more powerful (and more expensive) GPU does not provide a significant gain in performance, then using the more powerful GPU would be wasteful.

Milestones

This section is where you should list some specific research goals you hope to accomplish with the HPC resources you are requesting in the application. While it is understood that your research and use of HPC resources is likely to be an ongoing, years-long process, the accomplishments you list here should be specific, concrete, measurable scientific goals that you hope to realize before the expiration of this allocation (typically a year from the request). When you apply for a renewal of your allocation, you should hopefully be able to list the Milestones from the application for the existing allocation into the Past Results section of the renewal application (assuming they were actually accomplished). The AAC assumes that you will be publishing any significant scientific accomplishments, so it is redundant to include things like proposed publications, etc. --- the AAC wants to see here the science that you hope to achieve. The goals listed here should directly relate to the what you list in the SU justification and Disk Justification sections.

HPC Experience

This checkbox simply asks if you have experience using HPC clusters and methods. If you select yes, we will also request you enter the approximate number of years of experience you have. If you have more than a few years of experience with using HPC clusters, feel free to round off to the nearest 5 or 10 years; we do not need a precise number.

This field is just to help us collect some data about who is using the campus HPC resources, i.e. how many novice vs experienced users. Answering "No" will not cause your application to be rejected (although will raise some questions if this is a renewal application). We encourage researchers who have not used HPC resources in the past to explore whether HPC techniques could benefit their research.

Unix Experience

This checkbox simply asks if you have experience using Unix or Unix-like (e.g. Linux, FreeBSD, etc) operating systems. This field is just for us to better understand our user base; answering "No" will not cause your application to be rejected.

The UMD HPC clusters , like most HPC clusters, run on Unix-like (specifically Linux at UMD) systems, and some advanced functionality is facilitated by (if not requires) some fluency with Unix. However, we have added a OnDemand Portal for interfacing with the cluster which greatly reduces the amount of Unix familiarity needed to use the cluster. We hope that such can help make HPC techniques more available to researchers at UMD.

Allocations of compute time are provided as service units (SUs), each of which represents one hour of wall clock time on one CPU core. Different categories of allocations provide cycles for newcomers (development: 20K SUs), for moderately demanding jobs (small: 60K SUs), and for compute-intensive research (large: 100K SUs). The larger allocations are naturally scrutinized more, and generally require the applicant to have shown reasonable knowledge of HPC and its issues, either from previous development grants or other experience on this or other clusters.

AAC allocations on the Zaratan cluster are one time grants of SUs with an one year (by default) expiration. SUs can be used as needed over the course of that year. You can apply to the AAC to renew your AAC allocation to extend the expiration another year (this can be done each year).

If an application is approved by the AAC, the allocation will be created, by default, shortly after approval - typically within about one business day. If you would prefer a later starting date (e.g. you will not be able to use start using the cluster immediately due to other priorities or because you are awaiting data), please specify such in the proposal, especially if there will be significant delay. The time between submission and approval can vary; if the application has sufficient detail that the AAC has no follow up questions, approval is typically within one or two business days. If there are follow up questions, an HPC administrator will contact you (typically via email) with the questions, and forward your replies back to the committee. Again, usually you should receive notification of approval or follow up questions within about one or two business days after a submission.

Students are only allowed ONE allocation from the AAC for their duration with the university, and that will be a developmental allocation and not renewable. If more CPU cycles are required, their faculty advisor must apply for the allocation.

Criteria used in making such determinations include appropriateness of the clusters for the intended computation, the specific hardware and/or software requested, a researcher's prior experience with high-performance computing, the track record of a requestor who has received HPCC allocations in the past, and the overall merits of the research itself.

The AAC will determine which of the HPC clusters is most appropriate for the request. If the requestor has a specific cluster in mind, that should be explicitly mentioned in the proposal. In addition, the proposal should provide enough information to justify the use of a specific cluster (e.g. the need for Matlab DCS or other cluster specific licenses, or GPUs, or large memory nodes). While the committee will consider requests for a specific cluster, the committee decide which cluster to grant for a proposal based on which cluster is most appropriate for the request.

To submit an application, go to the HPC AAC application page, and select under the "Forms" menu item on the top menu bar the desired form. If you already have an allocation and you need to request additional time (either because more time is needed to complete the research than originally thought, or because the scope of research is expanding), then please select "Renew an allocation". Otherwise, select "New Allocation" for a new allocation.

When applying to the AAC for an allocation, remember that the AAC generally prefers to award allocations for specific projects. It is best to make a proposal for specific projects, with milestones that can be achieved within one year (or whatever time frame of requested allocation is), and if needed make a renewal request for more time for a second set of goals. In addition, it is useful to include the following in your proposal:

A quantitative estimate of the total number of SUs (where 1 SU = 1 CPU-core-hour) needed to complete the project. This can be as simple as an estimate of the number of jobs that need to be run times the estimated average SUs required per job (if running several different classes of jobs, maybe break the calculations down by class of job).
What, if any, parallelization strategies are used by your codes? Do they use distributed memory parallelization techniques (e.g. MPI) that allow the code to run over multiple nodes? Do they use shared memory parallelization techniques (e.g. OpenMP, threading) that allows for use of multiple cores on the same node? A hybrid of the two? Although the clusters are not restricted/reserved for parallel applications, it is hard to run large, parallel applications except on HPC clusters, so they do get some preference.
If the codes are parallel, how well do they scale? Generally, as the number of processors given to a program increases, the incremental benefit from each new processor added diminishes (and at some point might even go negative). Thus there usually is an optimal value for the number of processors to use in the calculation. A discussion of how the code scales (if known), and/or how you plan to determine the optimal degree of parallelization would be of value.
If this is a renewal of an existing application, you should discuss what has been accomplished with the original grant of SUs. Publications, etc. are always nice to include (and are useful when we need to convince budget directors of the value of investing in HPC resources), but less formal milestones can be included as well. If there was a miscalculation in the amount of compute time needed to complete the original goals, a discussion of such is worthwhile.

The AAC is unlikely to grant large allocations without a good discussion of most of the above points. However, it is recognized that not all applicants are experienced High Performance Computing (HPC) experts. Indeed, one of the aims of the AAC with this allocation process is to allow researchers who are not even sure if HPC techniques will work for their research an opportunity to try HPC out without a monetary investment. So if you are unable to address all of the above points, you can still apply for an allocation. It is likely that the AAC will, at least initially, only approve your application for a developmental (20 kSU) allocation, but that should not be viewed as a set back. The 20 kSU allocation might even be enough for some small projects, but at minimum it should allow you to collect the information regarding SUs required per job, scalability of code, etc. to address the above points when you request additional time in a renewal allocation.

If you have questions regarding the application process, or what information is requested or how to obtain such, or any other issues, please feel free to contact us.

Several samples of approved applications have been made available with the kind consent of the applicant to assist others who wish to apply.

How do I get access for a class I am teaching?

As befits an institute of learning, DIT is willing to make a reasonable amount of HPC cluster resources available to classes in most cases. If you are teaching a course and wish to use HPC resources, please submit a request to HPC admins for class access. Such requests should come from the instructor of record for the class, and should include the following information:

Please indicate whether this is an official UMD course (i.e. in the official UMD schedule of classes) or not. If not, please explain what the course is for and how it connects with the academic mission of UMD.
the course id (e.g. AMSC 663 or BIOE 658B)
the title of the course (e.g. "Advanced Scientific Computing" or "Introduction to Medical Image Analysis")
the estimated number of students to be enrolled in the course
the semester for the course. If this is part of a multi-semester course sequence and you wish to have the course allocation continue across semesters, please provide the above information for both the first and subsequent semesters, and discuss that here.
Please describe how the HPC resources are intended to be used in the class. In particular, discuss the following points (but feel free to expand beyond these points):
- Will the HPC be used for in class "live" demonstrations by the instructor? How frequently? Will you need reservations of resources for such, and what resources?
- Will the HPC be used for in class "labs" wherein students access the cluster in a "lab"-like setting (with faculty and/or TAs present for assistance). If so, how frequently and will you need reservations for such?
- Will the cluster be used for regular homework assignments? How frequently? How much resources do you expect will be needed for this (per student and total)?
- Will the cluster be used for long term (i.e. once per semester?) projects? How much resources do you expect will be needed for this (per student and total)?
For all of the above, please indicate whether you will require CPU only, GPU only, or a mix of CPU and GPUs. What GPUs types are you able to use --- in general we would ask that you consider using the a100_1g.5gb GPUs (a virtual GPU made by splitting an A100 into 7 pieces) when possible.
Please summarize the total amount of resources that will be needed for the class throughout the semester. Include both CPU, GPU, memory, and storage (for both scratch and SHELL tiers) in these calculations. Generally, requests for 20 - 50 kSU will be readily granted. Larger requests will be considered.
It is assumed that all students for "official" UMD courses will be formally registered and be in the UMD databases. For less official courses, please describe the expected student base --- will they be exclusively members of the UMD community, a mix of UMD and non-UMD people, or (almost) exclusively non-UMD people? Please give estimates of UMD vs non-UMD people. There are added difficulties associated with non-UMD access to the cluster.

Please provide the above information to the best of your ability when making a request. If you are unsure what is meant or how to answer something, let us know and we will try to clarify. The more completely you answer things the faster the process will go.