ORION, Nebula, AND GPU User Notes
Our Starlight Cluster is made up of several partitions (or queues) that can be accessed via SSH to “hpc.charlotte.edu.” This will connect the user to one of the Interactive/Submit hosts. “Orion” is the general compute partition, and “GPU” is the general GPU partition, both of which are available to all of our researchers for job submission. “Nebula” is also a general compute partition available to all researchers on our system. Nebula is a virtual partition running as a lower priority overlay to insure maximum system efficiency by using cores from multiple partitions including faculty sponsored systems. The resources in this partition will vary over time. The Orion, Nebula, and GPU partitions use Slurm for job scheduling. More information about what computing resources are available in our various Slurm partitions can be found on the Research Clusters page.
Things to keep in mind
- Jobs should always be submitted to the “Orion”, “Nebula”, or “GPU” partition, unless you have been granted to addiitonal Faculty partition(s)
- Limits for the Orion partition:
- 512 active CPU cores
- 512 active jobs
- 5000 jobs submitted/queued
- 30 day max job time
- If a user submits several jobs that total>512 CPU cores across all jobs, only a max of 512 cores will become active, while the remaining jobs stay queued.. But once the active jobs exit and free up enough cores the scheduler will release the queued jobs until the 512 user core limit is reached once again.
- If a single job requests >512 CPU cores, it will never run.
- Limits for the Nebula partition:
- 256 active CPU cores
- 256 active jobs
- 5000 jobs submitted/queued
- 48 hour max job time
- Limits for the GPU partition:
- 8 active GPUs
- 64 active CPU cores
- 64 active jobs
- 5000 jobs submitted/queued
- 30 day max job time
- If a memory request is not specified, all jobs will default to 2GB per task requested
- Users may run interactively on hpc.charlotte.edu to perform tasks such as transferring data* using SCP or SFTP, code development, and executing short test runs up to about 10 CPU minutes. Tests that exceed 10 CPU minutes should be run as scheduled jobs.
- When using MobaXterm to connect, do not use the “Start local terminal” option. Instead, create and save a new session fo HPC and connect via the left menu. The “Start local terminal” option will prevent the Duo prompt from displaying and will result in continuous prompting for the password.
* For transferring larger amounts of data, please take a look at URC’s Data Transfer Node offering.
Create a Submit Script for your Compute Job
You can find examples of Slurm submit scripts in the following location: /apps/slurm/examples. This is a good starting point if you do not already have a submit script for your job. Make a copy of one that closely resembles your job. If you don’t find one that is of your application, you can always make a copy of another application, and modify the execution line to execute your application or code. Edit the script using the information below:
To direct a job to one of the 2 general compute partitions: Orion or Nebula:
#SBATCH --partition=Orion # (directs a job to the Orion partition) -or- #SBATCH --partition=Nebula # (directs a job to the Nebula partition)
To make more efficient use of the resources, user jobs are now submitted with a set of default resource requests which can be overridden on the qsub command line or in the job submit script via qsub directives. If not specified by the user, the following defaults are set:
#SBATCH --time=8:00:00 # (Max Job Run time is 8 hours)
#SBATCH --mem-per-cpu=2GB # (Allow up to 2GB of Memory per CPU core requested)
Requesting Nodes, CPUs, Memory, and Wall Time
To request 1 node and 16 CPUs (tasks) on that node for your job (please make sure your code is either multithreaded, or MPI before requesting >1 CPU):
For memory, you can request memory per node, or memory per CPU. If you would like to request memory per the number of nodes you requested, the syntax is as follows:
To request memory per CPU:
This determines the actual amount of time a job will be allowed to run. For example:
#SBATCH --time=48:30:00 # (requests 48 hours, 30 minutes, 0 seconds for your job)
Intel Xeon vs AMD EPYC
There is a mix of Intel Xeon and AMD EPYC based compute nodes in Orion. If your code requires a particular CPU, you can specify that as a constraint within your submit script. For example:
#SBATCH --constraint=xeon # (will run your job on an Intel Xeon based compute node) -or- #SBATCH --constraint=epyc # (will run your job on an AMD EPYC based compute node)
Here are some additional points regarding the Intel Xeon vs AMD EPYC compute nodes:
- If you are planning to submit a multi-core or multi-node job, you may want to adjust your “–nodes” and/or “–ntasks-per-node” accordingly. Remember: the new AMD EPYC nodes are 64-cores/node (we have a mix of 36-core and 48-core Intel Xeon compute nodes in Orion)
- The AMD compute nodes have ~ 500gb RAM each, so adjust your “–mem” value if need be
- There is also an AMD EPYC based large memory node in Orion, which has 64-cores and 4TB of RAM, in case you would like to test a large memory job on an AMD EPYC-based system
There are a couple of families of Xeon and EPYC CPUs within our partitions. You can direct your job to a particular family of CPU by adding such a constraint to your submit script. For example:
#SBATCH --constraint=caslake # (will run your job on an Intel Xeon "Cascade Lake" based compute node) -or- #SBATCH --constraint=rome # (will run your job on an AMD EPYC "Rome" based compute node)
To find out what “Features” are available to use as constraints in your submit script, you can issue the following “sinfo” command:
$ sinfo -p Orion -o "%20N %10c %10m %50f" NODELIST CPUS MEMORY AVAIL_FEATURES str-bm1 16 1546051 bigmem,intel,xeon,caslake str-bm5 64 4127816 bigmem,intel,xeon,broadwell str-abm1 64 4127515 bigmem,amd,epyc,milan str-c[1-36,128-167] 48 385092 stdmem,intel,xeon,caslake str-c[49-69] 36 385092 stdmem,intel,xeon,skylake str-ac[1-10] 64 515101 stdmem,amd,epyc,rome
This is not exclusive to the “Orion” partition. You can find out the available features of any compute node in any partition that you have access to by simply changing the partition name in the above command.
Parallel Processing with OpenMPI
Slurm supports parallel processing via message passing. To access OpenMPI, load the desired modules: e.g.
$ module load openmpi/4.1.0 $ mpicc myprogram.c
And include a request for multiple processes in the submit script:
#! /bin/bash #SBATCH --job-name="MyMPIJob" #SBATCH --partition=Orion #SBATCH --nodes=4 #SBATCH --ntasks-per-node=4 #SBATCH --time=00:01:00 module load openmpi/4.1.0 srun --mpi=pmix_v3 /users//myprogram
and submit with sbatch:
$ sbatch my_script.sh
The Slurm options may also be set on the sbatch command line as follows
$ sbatch --job-name=MyMPIJob --partition=Orion --nodes=4 --ntasks-per-node=4 my_script.sh
In this example, the resource request is for 4 cores (or processes) on each of 4 compute nodes for a total of 16 processes.
Submitting a GPU Job
Our Starlight cluster has a separate GPU partition, so if you have a job that requires a GPU, you must first remember to set the partition accordingly.
To submit a job to the GPU partition:
#SBATCH --partition=GPU # (Submits job to the GPU partition)
To request 1 node, 8 CPU cores, and 4 GPUs, you would use the following syntax:
Request a particular type of GPU
You can specify the GPU type by modifying the “gres” directive, like so:
#SBATCH --gres=gpu:TitanV:4 # (will reserve 4 Titan V GPUs) #SBATCH --gres=gpu:TitanRTX:2 # (will reserve 2 Titan RTX GPUs) #SBATCH --gres=gpu:V100:1 # (will reserve 1 Tesla V100s GPU)
Request a single- or double-precision GPU
You can request a single-precision (FP32) or a double-precision (FP64) GPU by specifying a constraint for your job. For example:
#SBATCH --gres=gpu:1 #SBATCH --constraint=FP32 # (will reserve 1 single-precision GPU) -or- #SBATCH --gres=gpu:1 #SBATCH --constraint=FP64 # (will reserve 1 double-precision GPU)
In order to find out what type/count of the GPUs (in the GRES column), and the FP32/FP64 precision (in the AVAIL_FEATURES column), you can use the following “sinfo” command:
$ sinfo -p GPU -o "%15N %10c %10m %15f %20G" NODELIST CPUS MEMORY AVAIL_FEATURES GRES str-gpu[13-14] 32 256899 gpu,FP32,FP64,stdmem gpu:A100:4 str-gpu[1-2] 16 189364 gpu,FP32,stdmem gpu:TitanRTX:4 str-gpu3 16 189364 gpu,FP64,stdmem gpu:TitanV:8 str-gpu4 16 189364 gpu,FP64,stdmem gpu:V100:4 str-gpu5 16 189364 gpu,FP64,stdmem gpu:V100:8 str-gpu[15-20] 32 256899 gpu,FP32,stdmem gpu:A40:4
When we add new GPU nodes to the partition, they may have a new model of GPU in them, so the above list may change as we add new (and retire old) GPU compute nodes to the cluster.
Submitting a Job
Once you are satisfied with the contents of your submit script, save it, then submit it to the Slurm Workload Manager. Here are some helpful commands to do so:
Submit Your Job: sbatch submit-script.slurm
Check the Queue: sbatch submit-script.slurm
Show a Job’s Detail: scontrol show job -d [job-id]
Cancel a Job: scancel [job-id]