High-Performance Computing
The Miami Redhawk cluster is available for use by faculty, staff, and students. The RCS group is here to provide support for how the cluster can support research and teaching efforts. Faculty can request account activation using the button below. Students can ask a faculty member to sponsor an account for them.
Resources and Information
Access
Faculty can request account activation on the cluster management website or using the button above. Students can ask a faculty member to sponsor an account for them.
Once you have an account, there are different ways to access the Redhawk cluster. Note that when you are accessing the cluster, you are connecting to the head or login node of the cluster. See the section on using the cluster for information about accessing the compute nodes.
Command line access is available using SSH, with SFTP/SCP used to transfer files.
Access to the cluster with a full Linux desktop is also available using a tool called NoMachine or NX.
To get more information on how to connect to the Redhawk cluster, submit a Connection Instruction Request.
Redhawk Cluster Hardware
Miami's current HPC cluster consists of:
- 2 login nodes – 24 cores, 384 GB of memory each. Machine names:
- mualhplp01
- mualhplp02
- 26 compute nodes – 24 cores, Intel Xeon Gold 6126 2.6 GHz processors, 96 GB of memory each. Machine names:
- mualhpcp10.mpi-mualhpcp26.mpi
- mualhpcp28.mpi-mualhpcp35.mpi
- mualhpcp37.mpi
- 5 compute nodes - 24 cores, Intel Xeon Gold 6226 2.7 GHz processors, 96 GB memory each. Machine names:
- mualhpcp42.mpi-mualhpcp45.mpi
- mualhpcp47.mpi
- 2 large memory nodes – 24 cores, Intel Xeon Gold 6126 2.6 GHz processors , 1.5 TB of memory each. Machine names:
- mualhpcp27.mpi
- mualhpcp36.mpi
- 4 GPU nodes – 96 GB of RAM, 24 cores, Intel Xeon Gold 6126 2.6 GHz processors and each with 2 Nvidia Tesla V100-PCIE-16GB GPUs. Machine names:
- mualhpcp38.mpi-mualhpcp41.mpi
- Shared storage system with approximately 120 TB of storage, expandable.
Redhawk Cluster Software
Most software on the cluster is managed with the modules tool.
The HPC Software table lists software installed on the Redhawk cluster.
Contact the RCS group to request the installation of additional packages or with other questions about software on the cluster.
Using the Redhawk Cluster
Cluster usage is broken into two categories – interactive and batch. Interactive use allows the user to interact with a cluster node to run the software and receive output in real time. In batch usage, work is submitted to the cluster and executes when the needed resources are available, with optional e-mail notifications to the user when the job starts and ends. Information in the subsequent accordion sections provide details for using the Slurm resource manager, and details for batch and interactive usage.
Using the Slurm Resource Manager
We have switched the scheduler and resource manager from Moab/Torque to Slurm. Below is a side-by-side comparison for translating between Torque and Slurm. We also provide wrapper scripts, below, for Slurm commands which allow users to continue to use basic Moab/Torque commands and scripts during a transitional period of time.
Side-by-Side Comparison of Slurm and Moab/Torque
Slurm is different from Torque in several ways. These include the commands used to submit and monitor jobs, the syntax used to request resources, and the way environment variables behave.
Some specific ways in which Slurm is different from Torque include:
- What Torque calls queues, Slurm calls partitions
- In Slurm, environmental variables of the submitting process are passed to the job by default
Submitting jobs
To submit jobs in Slurm, replace qsub with one of the commands from the table below.
Task | Torque Command | Slurm Command |
---|---|---|
Submit a batch job to the queue |
|
|
Start an interactive job |
|
|
where <job script>
needs to be replaced by the name of your job submission script (e.g. slurm_job.sh
). See below for changes that need to be made when converting the Torque syntax into that of Slurm.
Job Submission Options
In Slurm, as with Torque, job options and resource requests can either be set in the job script or at the command line when submitting the job. Below is a summary table.
Option | Torque (qsub) | Slurm (sbatch) |
---|---|---|
Script directive |
|
|
Job name |
|
|
Queue |
|
|
Wall time limit |
|
|
Node count |
|
|
Process count per node |
|
|
core count (per process) |
|
|
Memory limit |
|
|
Minimum memory per processor |
|
|
Request GPUs |
|
|
Request specific nodes |
|
|
Request node feature |
|
|
Standard output file |
|
|
Standard error file |
|
|
Combine stdout/stderr to stdout |
|
|
Copy environment |
|
|
Copy environment variable |
|
|
Job dependency |
|
|
Request event notification |
|
|
Email address |
|
|
Defer job until the specified time |
|
|
Node exclusive job |
|
|
Common Job Commands
Task | Torque Command | Slurm Command |
---|---|---|
Submit a job |
|
|
Delete a job |
|
|
Hold a job |
|
|
Release a job |
|
|
Start an interactive job |
|
|
Start an interactive job with X forwarding |
|
|
Monitoring Resources on the Cluster
Task | Torque Command | Slurm Command |
---|---|---|
Queue list / info |
|
|
Node list |
|
|
Node details |
|
|
Cluster status |
|
|
Monitoring Jobs
Info | Torque Command | Slurm Command |
---|---|---|
Job status (all) |
|
|
Job status (by job) |
|
|
Job status (by user) |
|
|
Job status (only own jobs) |
|
|
Job status (detailed) |
|
|
Show expected start time |
|
|
Monitor or review a job’s resource usage |
|
|
View job batch script |
|
Valid Job States
Code | State | Meaning |
---|---|---|
CA |
Canceled |
Job was canceled |
CD |
Completed |
Job completed |
CF |
Configuring |
Job resources being configured |
CG |
Completing |
Job is completing |
F |
Failed |
Job terminated with non-zero exit code |
NF |
Node Fail |
Job terminated due to failure of node(s) |
PD |
Pending |
Job is waiting for compute node(s) |
R |
Running |
Job is running on compute node(s) |
Job Environment and Environment Variables
Slurm sets its own environment variables within a job, as does Torque. A summary is in the table below.
Info | Torque | Slurm | Notes |
---|---|---|---|
Version |
|
– |
Can extract from |
Job name |
|
|
|
Job ID |
|
|
|
Batch or interactive |
|
– |
|
Submit directory |
|
|
Slurm jobs start from the submit directory by default. |
Submit host |
|
|
|
Node file |
|
A filename and path that lists the nodes a job has been allocated. |
|
Node list |
|
|
The Slurm variable has a different format to the Torque/PBS one. To get a list of nodes use: |
Walltime |
|
– |
|
Queue name |
|
|
|
Number of nodes allocated |
|
|
|
Number of processes |
|
|
|
Number of processes per node |
|
|
|
List of allocated GPUs |
|
– |
|
Requested tasks per node |
– |
|
|
Requested CPUs per task |
– |
|
|
Scheduling priority |
– |
|
|
Job user |
– |
|
Slurm Documentation
Extensive documentation on Slurm is available at https://slurm.schedmd.com/documentation.html
Batch Cluster Usage
When you log into the cluster, you are connecting to the cluster’s head node. For computational work users should subsequently access the compute nodes by using the batch scheduling system.
The batch scheduling system Slurm allows users to submit job requests using the sbatch command. The jobs will run when resources become available. All output is captured to file by the batch job system, and users can request e-mail notifications when jobs start or end.
To submit a batch job, a user must create a short text file called a batch job script. The batch job script contains instructions for running the batch job and the specific commands the batch job should execute. Batch jobs run as a new user session, so the batch job script should include any commands needed to set up the user session and navigate to the location of data files, etc.
Below is a sample batch job script for a job that will use a single CPU core.
#!/bin/bash
#Comment - to be submitted by: sbatch slurm_job.txt
#SBATCH --time=1:00:00
#SBATCH --nodes=1 --ntasks-per-node=1
#SBATCH --partition=batch
#SBATCH --mail-type=BEGIN,END
#SBATCH --job-name=hello
#Comment - batch job setup complete
#Comment – load a program module, for example Python
module load anaconda-python3
#Comment – path to execution directory, for example
cd $HOME/Desktop
#Comment – program execution line
python helloWorld.py
To submit a batch job script, execute:
sbatch slurm_job.txt
The output of this will include a unique job identifier in the form of a job number.
The batch job script is a Linux shell script, so the first line specifies the shell interpreter to use when running the script.
Lines starting with #SBATCH
are instructions to the batch scheduling system.
MPI parallel execution
Parallel program execution in Slurm is launched by the srun command which replaces mpiexec. Here is an example of an execution line that can be used in a job script:
srun --ntasks=28 ./a.out
GPU and Big Memory Node Access
GPU nodes and the big memory nodes are part of separate partitions (formerly queues).
A GPU node can be accessed by giving the --partition=gpu option:
#SBATCH --partition=gpu
A big memory node can be accessed by giving the --partition=bigmem option:
#SBATCH --partition=bigmem
Job Environment and Environment Variables
Environment variables will get passed to your job by default in Slurm. The command sbatch can be run with one of these options to override the default behavior:
sbatch --export=None
sbatch --export MYVar=2
sbatch --export=ALL, MYVAR=2
Job Monitoring and Status Check
Upon submission the scheduler reports back a job ID, for example: Submitted batch job 35.
The job’s progress in the queue can be monitored with the command squeue, see below. If cancellation is the job is required,
scancel <jobid>
will do the trick.
To check the status of a job the command squeue
can be used:
squeue --job <jobid>
The displayed information includes the status of the job.
Code | State | Meaning |
---|---|---|
CA |
Canceled |
Job was canceled |
CD |
Completed |
Job completed |
CF |
Configuring |
Job resources being configured |
CG |
Completing |
Job is completing |
F |
Failed |
Job terminated with non-zero exit code |
NF |
Node Fail |
Job terminated due to failure of node(s) |
PD |
Pending |
Job is waiting for compute node(s) |
R |
Running |
Job is running on compute node(s) |
The squeue
command also shows the length of time the job has been running.
Details on a specific job can be seen using the
scontrol show job <jobid>
where <jobid>
is the numeric portion of the name returned by the squeue command.
#SBATCH OR sbatch Option | Usage |
---|---|
-or-
|
Name of the batch job: |
--output |
Slurm collects all job output in a single file rather than having separate files for standard output and error. |
--mail-type= <events> |
Mail options: |
--mail-user=<email address> |
Job status E-mail address specification. |
-or-
|
Resource specifications. There are three main types of resources, nodes, CPU cores on a node as well as time. Multiple #SBATCH lines can be used to request these separately. |
-or-
|
Destination - which batch partitions (formerly queue) to use. |
|
Exports an environment variable to the job:
|
Interactive Cluster Usage
Interactive use of the cluster can take place on the head/login node, but since this node is shared by all users and also supports some of the system administrative activities on the cluster, interactive use requiring significant CPU time (more than 1 hour) or memory (more than 10GB) should take place on a compute node.
Interactive Jobs
An interactive batch job is a process that allows a user to get interactive, command line access to a compute node or nodes. Interactive batch jobs use the same syntax as regular batch jobs, but no job script is specified. Slurm allows for different ways to access compute nodes interactively. The following command requests an interactive batch job with the default parameters for all resource requests (1 hour of run time on 1 processor):
/usr/bin/salloc --x11
Environment variables defined in the terminal where the request is submitted will automatically be transferred to the interactive node(s). To request a longer run time or additional nodes, add specific resource requests:
/usr/bin/salloc -t 0:30:00 -N 1 --tasks-per-node=12 --x11
The above requests one interactive node and 12 compute cores on that node for half an hour.
Additional options to salloc can be seen by typing:
/usr/bin/salloc --help
The big memory nodes can be accessed by typing:
/usr/bin/salloc -p bigmem --x11
The GPU nodes can be accessed by typing:
/usr/bin/salloc -p gpu --x11
When an interactive batch job is submitted, it is evaluated by the scheduling system to determine if the requested resources are available. The user will see output similar to:
salloc: Pending job allocation 162
salloc: job 162 queued and waiting for resources
When a node becomes available and the jobs starts, the user’s command prompt will change to uniqueID@mualhpcp<n>, n is the allocated node’s number. When you have finished working on the compute node, type exit to end the job and return to the login node. The interactive job will automatically end once the requested time has elapsed.
If no nodes are available, the command will wait until a node becomes available. If you want to stop waiting and abandon your job request, use the Control-x-c key combination to terminate the job or type
scancel <jobid>
in a different terminal.