Gypsum Cluster Policy Document

On this page... (hide)

  1.   1.  Introduction
  2.   2.  Cluster Specifications
  3.   3.  Accounts
  4.   4.  Jobs
  5.   5.  Disk Space

1.  Introduction

We expect that the Gypsum cluster will have many competing priorities for GPU computation. This policy document is written in the hope of minimizing conflicts. Many of the policies will be automatic and are enforced by software. We hope this will minimize the load on people.

Requests should be sent to gypsum-admin@cs.umass.edu. The Gypsum Policy managers Erik Learned-Miller, Evangelos Kalogerakis, and Dan Parker will resolve conflicts and also decide on special requests.

2.  Cluster Specifications

  • 824 GPUs (4-8 GPUs per node)
  • 2472 CPU cores (12-24 cores per node, 24-48 threads per node, 256-384 GB RAM per node)
  • 325 TB centralized storage

Each node has about 256 GB of local storage. The nodes run Linux. The cluster file system is ZFS. Slurm is used to submit jobs.

3.  Accounts

  • Each researcher laboratory will be a group and will be assigned a group ID (gid). Resources will be allocated according to groups.
  • New users/groups will be added per user request (contact gypsum-admin@cs.umass.edu). The request should come from the faculty member, or be approved by a faculty member.
  • Accounts for course work do not have space allocated on the work1 partition. If the user's home directory is not enough the user can store files on the scratch1 partition but should be aware that it is not backed up. Course work accounts also have lower limits to the number of jobs that can be run simultaneously on each queue.
  • The procedure for non-CS individuals (who are part of the research laboratories of senior researchers on the grant) may be more complicated and we will try to figure this out as we proceed.

4.  Jobs

Jobs will be run in batch mode using queues. There will be 3 queues:

Short Jobs: Each job (i.e., process) is allowed to run for 4 hours. Jobs that run more than 4 hours will be automatically killed. Note that a user can submit many jobs -each of up to 4 hours duration. Thus by subdividing their work users can run many jobs. An individual user is limited to 35 Tesla M40 GPUs, 110 TitanX GPUs and 150 1080 Ti GPUs.

Long Jobs: A long job queue will be a smaller portion of the cluster. An individual user is limited to 15 Tesla M40 GPUs, 30 TitanX GPUs and 40 1080 Ti GPUs. Long jobs will be allowed to run at most 1 week and will be automatically killed at the end of this time.

Special Requests: Special requests for running long jobs on many cores will need to be made in advance. We recommend at least a month’s notice although if the cluster is not heavily used shorter notice may be sufficient. An example of a special request: A user may want to run a job on 4 GPUs for two weeks. Note that it is very unlikely that special requests for use of the entire cluster for an extended period of time will be granted since this will preempt all other users.

Special requests should be made to gypsum-admin@cs.umass.edu

5.  Disk Space

There are three key file systems on Gypsum that users should know about:

/home/<user>

	* Backed Up
	* Uncompressed
	* User Quota: 100 GB

/mnt/nfs/work1/<faculty>/<user>

	* Backed Up
	* Compressed
	* Group Quota: 5 TB

/mnt/nfs/scratch1/<user>

	* NOT Backed Up
	* Compressed
	* User Quota: 1 TB

Similar to Swarm2 (and the previous Swarm cluster), quotas on work1 are based on group rather than user. Each user's primary group is set to the faculty member they work under.