ORION, GPU, and Leo User Notes

Our Starlight Cluster is made up of several partitions (or queues) that can be accessed via SSH to "hpc.uncc.edu."  This will connect the user to one of the Interactive/Submit hosts. "Orion" is the general compute partition, and "GPU" is the general GPU partition, both of which are available to all of our researchers for job submission. The Orion and GPU partitions use Slurm for job scheduling. More information about what computing resources are available in our various Slurm partitions can be found on the Research Clusters page.

Things to keep in mind

  • Jobs should always be submitted to the "Orion" or "GPU" partition unless otherwise directed by URC Support
  • Users can have a max of 512 CPU cores active in the Orion partition
  • Users can submit a max of 5000 jobs to the Orion partition
  • If a user submits several jobs that total>512 CPU cores across all jobs, only a max of 512 cores will become active, while the remaining jobs stay queued.. But once the active jobs exit and free up enough cores the scheduler will release the queued jobs until the 512 user core limit is reached once again.
  • If a single job requests >256 CPU cores, it will never run.
  • Users may run interactively on hpc.uncc.edu to perform tasks such as transferring data* using SCP or SFTP, code development, and executing short test runs up to about 10 CPU minutes.  Tests that exceed 10 CPU minutes should be run as scheduled jobs.
  • When using MobaXterm to connect, do not use the "Start local terminal" option.  Instead, create and save a new session fo HPC and connect via the left menu.  The "Start local terminal" option will prevent the Duo prompt from displaying and will result in continuous prompting for the password. 

* For transferring larger amounts of data, please take a look at URC's Data Transfer Node offering.

Create a Submit Script for your Compute Job

You can find examples of Slurm submit scripts in the following location: /apps/slurm/examples. This is a good starting point if you do not already have a submit script for your job. Make a copy of one that closely resembles your job. If you don't find one that is of your application, you can always make a copy of another application, and modify the execution line to execute your application or code. Edit the script using the information below:

To direct a job to the general compute partition:

#SBATCH --partition=Orion        # (directs job to the general partition, Orion)

Orion Defaults

To make more efficient use of the resources, user jobs are now submitted with a set of default resource requests which can be overridden on the qsub command line or in the job submit script via qsub directives.    If not specified by the user, the following defaults are set:

#SBATCH --time=8:00:00             # (Max Job Run time is 8 hours)
#SBATCH --mem-per-cpu=2GB  # (Allow up to 2GB of Memory per CPU core requested)

Requesting Nodes, CPUs, Memory, and Wall Time

To request 1 node and 16 CPUs (tasks) on that node for your job:

#SBATCH --nodes=1
#SBATCH --ntasks-per-node=16

For memory, you can request memory per node, or memory per CPU. If you would like to request memory per the number of nodes you requested, the syntax is as follows:

#SBATCH --mem=64GB

To request memory per CPU:

#SBATCH --mem-per-cpu=4GB

Walltime

This determines the actual amount of time a job will be allowed to run. For example:  

#SBATCH --time=48:30:00      # Requests 48 hours, 30 minutes, 0 seconds for your job

Parallel Processing with OpenMPI

Slurm supports parallel processing via message passing.  To access OpenMPI, load the desired modules: e.g.

$ module load openmpi

$ mpicc myprogram.c

And include a request for multiple processes in the submit script:

#! /bin/bash

 

#SBATCH --job-name="MyMPIJob"

#SBATCH --partition=Orion

#SBATCH --nodes=4

#SBATCH --ntasks-per-node=4

#SBATCH --time=00:01:00

 

module load openmpi/4.0.3

srun --mpi=pmix_v3 /users/<username>/myprogram

and submit with sbatch:

$ sbatch my_script.sh

The Slurm options may also be set on the sbatch command line as follows

$ sbatch --job-name=MyMPIJob --partition=Orion --nodes=4 --ntasks-per-node=4 my_script.sh

In this example, the resource request is for 4 cores (or processes) on each of 4 compute nodes for a total of 16 processes.

Submitting a GPU Job

Our Starlight cluster has a separate GPU partition, so if you have a job that requires a GPU, you must first remember to set the partition accordingly.

To submit a job to the GPU partition:

#SBATCH --partition=GPU        # (Submits job to the GPU partition)

To request 1 node, 8 CPU cores, and 4 GPUs, you would use the following syntax:

#SBATCH --nodes=1
#SBATCH --ntasks-per-node=8
#SBATCH --gres=gpu:4

Request a particular type of GPU

You can specify the GPU type by modifying the "gres" directive, like so:

#SBATCH --gres=gpu:TitanV:4       #  will reserve 4 Titan V GPUs
#SBATCH --gres=gpu:TitanRTX:2  #  (will reserve 2 Titan RTX GPUs
#SBATCH --gres=gpu:V100:1        #  (will reserve 1 Tesla V100s GPU

In order to find out what types of GPUs and the number of those GPUs that are available in each node within the GPU partition, you can use the "sinfo" command, for example:

$ sinfo -p GPU -o "%15N  %10c  %10m  %15f  %20G"
NODELIST         CPUS        MEMORY      AVAIL_FEATURES   GRES
str-gpu[1-2]     16          189364      gpu,stdmem       gpu:TitanRTX:4
str-gpu3         16          189364      gpu,stdmem       gpu:TitanV:8
str-gpu4         16          189364      gpu,stdmem       gpu:V100:4
str-gpu5         16          189364      gpu,stdmem       gpu:V100:8
str-gpu[13-14]   32          256899      gpu,stdmem       gpu:A100:4
str-gpu[15-20]   32          256899      gpu,stdmem       gpu:A40:4

When we add new GPU nodes to the partition, they may have a new model of GPU in them, so the above list may change as we add new (and retire old) GPU compute nodes to the cluster. 

Submitting a Job

Once you are satisfied with the contents of your submit script, save it, then submit it to the Slurm Workload Manager. Here are some helpful commands to do so:

Submit Your Job: sbatch submit-script.slurm
Check the Queue: sbatch submit-script.slurm
Show a Job's Detail: scontrol show job -d [job-id]
Cancel a Job: scancel [job-id]

Submitting GPU-based Jobs to Leo

We have also added a new GPU node that has AMD EPYC CPUs and the latest NVIDIA A100 Tensor Core GPUs, which are all connected in an 8-way NVLink.

For submitting GPU-based compute jobs to the Leo partition:

  • This NVIDIA A100 Tensor Core GPU node is in its own Slurm partition named "Leo". Make sure you update your job submit script for the new partition name prior to submitting it.
  • The new GPU node has 128 CPU cores, and 8 x NVIDIA A100 GPUs. One user may take up the entire node.
  • The new GPU node has 1TB of RAM, so adjust your "--mem" value if need be.
  • You do NOT have to take up the entire node. If you submit a 64-core + 4-GPU job, then 2 researchers can process side-by-side.
  • Here are some example #SBATCH directives for a GPU job on the A100 node (32 CPU cores, 2 GPUs, 256gb RAM):

#SBATCH --job-name=test_leo_job
#SBATCH --partition=Leo
#SBATCH --gres=gpu:2
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=32
#SBATCH --mem=256gb

 

More information