Gypsum Cluster User Document
On this page... (hide)
- 1. Quick Start
- 2. Slurm - Job Scheduler
- 3. Software
Important - do not run jobs directly on the head node, the head node should be used for submitting jobs.
To log into Gypsum ssh into gypsum.cs.umass.edu. The compute nodes are only accessible from within the Gypsum local network, and should only be used through Slurm. They are named node001-node100 and node104-node156.
To begin submitting jobs to Gypsum cluster you must use Slurm and have disk space on one of the work directories.
Slurm will automatically start your job on the cluster nodes with available resources.
Place your commands in a shell script. For batch jobs, use:
sbatch -p <partition> --gres=gpu:<num> your_script.sh your_argument_1 your_argument_2 ...
... where <partition> can be titanx-short (TitanX GPUs, short queue), titanx-long (TitanX GPUs, long queue), m40-short (Tesla M40, short queue), m40-long (Tesla M40, long queue), 1080ti-short (GTX 1080 Ti, short queue), 1080ti-long (GTX 1080 Ti, long queue). Please see the policy document regarding the use of queues. The <num> argument specifies the desired number of GPUs that will be used in your job. For example, to launch a batch job in the TitanX short queue involving 2 GPUs, use:
sbatch -p titanx-short --gres=gpu:2 your_script.sh your_argument_1 your_argument_2 ...
To launch an interactive job instead, replace 'sbatch' with 'srun' above.
We want people to use the cluster as much as possible i.e., fill the nodes with jobs! However, things will not work well if everybody piles their jobs in the long queues!. Almost all deep learning packages allow users to save snapshots of your program (e.g., deep net and solver state) every specified number of iterations. Unless there is something that prevents you from doing so, please try to make use of the short queues as much as possible. This means that you save snapshots regularly, then relaunch your job every four hours. If this is impossible for a particular job, you may use the long queues, however, there is a limited number of nodes allocated to long queues. Please see the policy document regarding the job limits per queue.
Slurm (Simple Linux Utility for Resource Management) is a workload manager that provides a framework for job queues, allocation of compute nodes, and the start and execution of jobs.
The cluster compute nodes are available in Slurm queues (Slurm actually calls them partitions but we'll use the term 'queue' in this documentation). Users submit jobs to request node resources in a queue. The Slurm partitions on Gypsum are titanx-short (the default queue), titanx-long, m40-short, m40-long, 1080ti-short and 1080ti-long.
<queue>-short jobs are restricted to 4 hours. <queue>-long jobs are restricted to 7 days. When the time limit is reached, each task in each job step is sent SIGTERM followed by SIGKILL.
- sbatch - submit a job script
- srun - run a command on allocated compute nodes
- squeue - show status of jobs in queue
- scancel - delete a job
- sinfo - show status of compute nodes
- salloc - allocate compute nodes for interactive use
A job consists of two parts - resource requests and job steps. Resource requests consist in a number of CPUs, computing expected duration, amounts of RAM, etc. Job steps describe tasks that must be done, software which must be run. A sample submission script to request one CPU for 10 minutes, along with 100 MB of RAM, in the m40-long partition would look like:
#!/bin/bash # #SBATCH --job-name=test #SBATCH --output=res_%j.txt # output file #SBATCH -e res_%j.err # File to which STDERR will be written #SBATCH --partition=m40-long # Partition to submit to # #SBATCH --ntasks=1 #SBATCH --time=10:00 # Runtime in D-HH:MM #SBATCH --mem-per-cpu=100 # Memory in MB per cpu allocated hostname sleep 1 exit
The job flags are used with SBATCH command. The syntax for the Slurm directive in a script is :
Some of the possible flags used with the srun and salloc commands
|partition||--partition=titanx-short||Partition is a queue for jobs||Default is titanx-short|
|time||--time=02-01:00:00||Time limit for the job||2 days and 1 hour; default is MaxTime for partition|
|nodes||--nodes=2||Number of compute nodes for the job||Default is 1|
|cpus/cores||--ntasks-per-node=8||Number of cores on the compute node.||Default is 1|
|memory||--mem=2400||Memory limit per compute node for the job. Do not use with mem-per-cpu flag.||memory in MB; default limit is 4096MB per core|
|memory||--mem-per-cpu=4000||Per core memory limit. Do not use the mem flag||memory in MB; default limit is 4096MB per core|
|output file||--output=test.out||Name of file for stdout.||default is the JobID|
In the Slurm context, a task is to be understood as a process. So a multi-process program is made of several tasks. By contrast, a multithreaded program is composed of only one task, which uses several CPUs.
Tasks are requested/created with the --ntasks option, while CPUs, for the multithreaded programs, are requested with the --cpus-per-task option. Tasks cannot be split across several compute nodes, so requesting several CPUs with the --cpus-per-task option will ensure all CPUs are allocated on the same compute node. By contrast, requesting the same amount of CPUs with the --ntasks option may lead to several CPUs being allocated on several, distinct compute nodes.
Though batch submission is best, foreground, interactive jobs can also be run. Jobs should be initiated with the srun command instead of sbatch.
srun --pty --mem 500 -t 0-01:00 /bin/bash
will start a command line shell (/bin/bash) on the defq queue with 500 MB of RAM for 1 hour. The --pty option allows the session to act like a standard terminal. For interactive logins which last longer then 4 hours remember to use <queue>-long.
After you enter the srun command you will be put into the normal queue waiting for nodes to become available. When they do you will get an interactive session on a compute node and you are put into the directory from which you ran the launched the session. You can then run commands.
The Gypsum cluster uses Environment Modules which make it easy to maintain multiple versions of compilers, libraries and applications for different users on the cluster. Each module file contains the information needed to configure the shell for an application. When a user loads an environment module for an application, all environment variables are set correctly for that particular application.
Use the following commands to adjust your environment:
module avail - show available modules module add <module> - adds a module to your environment for this session module initadd <module> - configure module to be loaded at every login
MATLAB is available on one node through a special queue named matlab. After logging into the node, load the desired version of MATLAB with the 'module load' command. You can see what versions are available with:
module avail matlab
You can also compile your MATLAB program into a standalone application on your machine and run the executable on Gypsum with MATLAB Runtime. Currently, there are two MATLAB Runtime versions available:
These can be loaded as environment modules:
module load mcr/v84
module load mcr/v901
You have to make sure the MATLAB version you use for compilation is consistent with the runtime version you use. Also make sure library versions are consistent. For example, if you want to run matconvnet with cudnn, use cuda7.5 and cudnn v5.
The following links provide several examples from simple to complicated to run matlab program on Gypsum:
The default version of Python is the one that comes with CentOS 7:
$ which python /usr/bin/python $ python -V Python 2.7.5
This is rarely the Python you'll want to use. There are some useful modules but the vendor that packaged them rarely issues updates and we can't manually update things without breaking parts of the OS.
Python via Environment Modules
Several versions of Python are available via Environment Modules. To see what is currently available:
$ module avail python
A specific version can be loaded with:
$ module load python2/2.7.14-1710
NOTE: Using Python this way loads an entire self-contained distribution of Python, including whatever Python modules are installed. To see what Python modules are included with the version you just loaded (in Python 3, the command is 'pip3' rather than 'pip'):
$ pip list
Updating Python (and its modules)
Python modules, particularly those with GPU support, are updated quickly. Unfortunately with the constant use of the cluster there is no good way to upgrade modules without disrupting someone currently using them.
New versions of Python will be compiled every six months and appended with the year and date (-YYDD) they were created. All modules will be carried over but installed with the latest versions. This way users can always have access to software that isn't older than six months.
To save the trouble of remembering to switch every six months, module aliases will be set up for Python 2 and Python 3:
Loading one of those modules rather than a specific version will ensure you always have the latest version.
Adding New Modules
Users can add or upgrade Python modules by installing them in their home directories with the '--user' flag. For example, to upgrade the 'wheel' package:
$ module load python/2.7.13 $ pip list | grep wheel wheel 0.29.0 $ pip install --upgrade --user wheel Collecting wheel Using cached wheel-0.30.0-py2.py3-none-any.whl Installing collected packages: wheel Successfully installed wheel-0.30.0 [drp@gypsum ~]$ pip list | grep wheel wheel 0.30.0
You can use TensorFlow in several ways on Gypsum.
Using the TensorFlow Module Managed by Bright Cluster Manager
$ module load tensorflow/1.5.0
(Optional) To avoid doing this again, add modules at login automatically:
$ module initadd tensorflow/1.5.0
A major limitation though is that you can only use this module in Python 2.7.
Using the TensorFlow Module in a Pre-Compiled Python
Several version of Python are available as environment modules:
$ module avail python
Each of these includes TensorFlow. To see what version of TensorFlow is included, load the desired version of Python and run pip:
$ pip list | grep tensorflow
pip3 list | grep tensorflow
Install and use TensorFlow in User Space
You can have a custom install of TensorFlow in you user space and have the flexibility of using the Python and TensorFlow version of your choice. It is recommended to use a virtual environment managing tool for this purpose. Sample installation scripts are provided here:
- Python 2.7, tensorflow 0.10: http://maxwell.cs.umass.edu/hsu/install-tf-py2.sh
- Python 3.5, tensorflow 0.10: http://maxwell.cs.umass.edu/hsu/install-tf.sh
The script will install conda (a package and virtual environment managing tool) and setup a virtual environment with TensorFlow installed. You only need to run the script once for installation. Activate the virtual environment by using : source activate tf-py2 or source activate tf (depending the installation script you used). To leave the environment, use: source deactivate.
While in the environment, you can import the TensorFlow Python module and use it normally.
Note: Please use only one of the two scripts provided above. Running both may cause conflicts.