ORION, Nebula, AND GPU User Notes

Our Starlight Cluster is made up of several partitions (or queues) that can be accessed via SSH to “hpc.charlotte.edu.” This will connect the user to one of the Interactive/Submit hosts. “Orion” is the general compute partition, and “GPU” is the general GPU partition, both of which are available to all of our researchers for job submission. “Nebula” is also a general compute partition available to all researchers on our system. Nebula is a virtual partition running as a lower priority overlay to insure maximum system efficiency by using cores from multiple partitions including faculty sponsored systems. The resources in this partition will vary over time. The Orion, Nebula, and GPU partitions use Slurm for job scheduling. More information about what computing resources are available in our various Slurm partitions can be found on the Research Clusters page.

Things to keep in mind

  • Jobs should always be submitted to the “Orion”, “Nebula”, or “GPU” partition, unless you have been granted to addiitonal Faculty partition(s)
  • Limits for the Orion partition:
    • 512 active CPU cores
    • 512 active jobs
    • 5000 jobs submitted/queued
    • 30 day max job time
    • If a user submits several jobs that total>512 CPU cores across all jobs, only a max of 512 cores will become active, while the remaining jobs stay queued.. But once the active jobs exit and free up enough cores the scheduler will release the queued jobs until the 512 user core limit is reached once again.
    • If a single job requests >512 CPU cores, it will never run.
  • Limits for the Nebula partition:
    • 256 active CPU cores
    • 256 active jobs
    • 5000 jobs submitted/queued
    • 48 hour max job time
  • Limits for the GPU partition:
    • 8 active GPUs
    • 64 active CPU cores
    • 64 active jobs
    • 5000 jobs submitted/queued
    • 30 day max job time
  • If a memory request is not specified, all jobs will default to 2GB per task requested
  • Users may run interactively on hpc.charlotte.edu to perform tasks such as transferring data* using SCP or SFTP, code development, and executing short test runs up to about 10 CPU minutes. Tests that exceed 10 CPU minutes should be run as scheduled jobs.
  • When using MobaXterm to connect, do not use the “Start local terminal” option. Instead, create and save a new session fo HPC and connect via the left menu. The “Start local terminal” option will prevent the Duo prompt from displaying and will result in continuous prompting for the password.

* For transferring larger amounts of data, please take a look at URC’s Data Transfer Node offering.

Theory of Operation

Slurm is a batch scheduling system designed to schedule the computing resources in a cluster of computers. When users of the URC research Computing system submit a prepared job script using the sbatch command from the interactive nodes (hpc.charlotte.edu) the Slurm scheduler analyzes information about the job from the command line and #SBATCH directives in the job script to determine the resources required for the job. Slurm also tracks the time of submission and the amount of compute time used recently by each user.

When a given job will be executed is determined by five things:

  1. The resources requested: Obviously, Slurm can only execute a job when it has the requested resources available.
  2. Time spent in queue: Jobs that have been in the queue longer have higher priority.
  3. Fair share: A user that has received more execution time recently will have lower priority than a user that has used less compute time recently.
  4. Backfilling: If Slurm sees an opportunity to execute a shorter job on resources that would otherwise be idle, it will do so even if it means starting that shorter job before a longer job that has higher priority. This practice is called backfilling. Slurm will not do this if it predicts that the start time of the longer running job will be delayed relative to what it would have been without backfilling. In other words, backfilling helps users running short jobs without hurting users running long jobs.
  5. Per user core limit: In Orion, we limit any one user to 512 cores at a time, which is less than 10% of the total system capacity.

Some users have expressed concerns that their jobs spend more time in the queue than they should, or that jobs are not submitted in the order they expect. However, this confusion becomes more clear when you keep the above principles in mind. If you have been using more resources lately than another user, the other user’s jobs may be submitted before yours (fair share). If some of your jobs request large amounts of memory or all of the cores on multiple nodes, it may take Slurm a long time to find the necessary resources.

Slurm, like any other system, is not perfect. However, we have done our best to set up the system such that resource allocation will be as fair as possible. The most important thing you can do to reduce the time your jobs spend in the queue is to make your resource request as accurate as possible. This includes the execution time, as this is used for backfilling.

The next section covers other miscellaneous tips, and at the end of the article we have an example that demonstrates how these principles work in practice.

Best Practices

Again, the best way to ensure that your job runs as soon as possible is to make sure that the job only requests the resources it needs to efficiently execute. Do not request more cores, memory, GPUS or wall clock time than you need. You may also benefit from breaking down large tasks into smaller jobs, especially if your workflow includes elements that can be run multiple times or include convenient checkpoints.

You can try these things out for yourself to optimize your use of Slurm:

  1. Know what resources you need for a job. If there is an application which you use regularly with a range of input files or parameters you can use the seff command to examine the resources requested, used, and the % of requested resources consumed. Experimentation with different resource requests or doing online research into the ability of the application to use multiple cores in combination with the seff command enables you to tune your resource request to make them efficient and give it the best chance of running sooner. For example:
    $ seff 12345678
    Job ID: 12345678
    Cluster: starlight
    User/Group: joeuser/joeuser
    State: COMPLETED (exit code 0)
    Cores: 1
    CPU Utilized: 00:00:15
    CPU Efficiency: 3.67% of 00:06:49 core-walltime
    Job Wall-clock time: 00:06:49
    Memory Utilized: 1.73 GB
    Memory Efficiency: 10.82% of 16.00 GB
    
  2. Experiment with wall clock time and request 10-20% longer than the maximum time you predict it will take to successfully complete the job.
  3. Use care when requesting memory very close to the total memory of a node. The operating system and service process uses a few GB of memory. Note that the maximum amount of memory available for Slurm jobs on a node can vary slightly over time. To obtain the current maximum for the nodes in a partition you can use the command ($pestat -p ) and look at the numbers in the Memsize column.
  4. Consider breaking down your workflow into small, productive chunks. For example, Job Arrays ( https://slurm.schedmd.com/job_array.html ) provide an automated technique to structure repetitive uses of an application spanning multiple variations of input parameters. Each element of the array can be scheduled independently on the system as resources become available. If you have a workflow that has a large parallel component preceded by or followed by a long sequential component, consider breaking the workflow into multiple jobs with an execution dependency. Below is a simple example of specifying a job dependency. After typing these two commands both jobs (job1.sh and job2.sh) will have been submitted to Slurm, and will gain priority over time, but job2.sh will not begin execution until job1.sh has completed.
    
    $ sbatch job1.sh
    12254323
    $ sbatch –dependency=afterok=11254323 job2.sh
  5. Research your application by looking at documentation, discussion forums or experimentation to determine if speedup is expected by providing more memory or more cores looking for a sweet spot where increasing resource requests will not improve run time.
  6. Do not request more than one node for an application unless it can run in parallel on multiple nodes. Generalizing this recommendation, do not request more nodes than the application will use.

Create a Submit Script for your Compute Job

You can find examples of Slurm submit scripts in the following location: /apps/slurm/examples. This is a good starting point if you do not already have a submit script for your job. Make a copy of one that closely resembles your job. If you don’t find one that is of your application, you can always make a copy of another application, and modify the execution line to execute your application or code. Edit the script using the information below:

To direct a job to one of the 2 general compute partitions: Orion or Nebula:

#SBATCH --partition=Orion # (directs a job to the Orion partition)

-or-

#SBATCH --partition=Nebula # (directs a job to the Nebula partition)

Orion Defaults

To make more efficient use of the resources, user jobs are now submitted with a set of default resource requests which can be overridden on the qsub command line or in the job submit script via qsub directives. If not specified by the user, the following defaults are set:

#SBATCH --time=8:00:00 # (Max Job Run time is 8 hours)
#SBATCH --mem-per-cpu=2GB # (Allow up to 2GB of Memory per CPU core requested)

Requesting Nodes, CPUs, Memory, and Wall Time

To request 1 node and 16 CPUs (tasks) on that node for your job (please make sure your code is either multithreaded, or MPI before requesting >1 CPU):

#SBATCH --nodes=1
#SBATCH --ntasks-per-node=16

For memory, you can request memory per node, or memory per CPU. If you would like to request memory per the number of nodes you requested, the syntax is as follows:

#SBATCH --mem=64GB

To request memory per CPU:

#SBATCH --mem-per-cpu=4GB

Walltime

This determines the actual amount of time a job will be allowed to run. For example:

#SBATCH --time=48:30:00 # (requests 48 hours, 30 minutes, 0 seconds for your job)

Intel Xeon vs AMD EPYC

There is a mix of Intel Xeon and AMD EPYC based compute nodes in Orion. If your code requires a particular CPU, you can specify that as a constraint within your submit script. For example:

#SBATCH --constraint=xeon # (will run your job on an Intel Xeon based compute node)

-or-

#SBATCH --constraint=epyc # (will run your job on an AMD EPYC based compute node)

Here are some additional points regarding the Intel Xeon vs AMD EPYC compute nodes:

  • If you are planning to submit a multi-core or multi-node job, you may want to adjust your “–nodes” and/or “–ntasks-per-node” accordingly. Remember: the new AMD EPYC nodes are 64-cores/node (we have a mix of 36-core and 48-core Intel Xeon compute nodes in Orion)
  • The AMD compute nodes have ~ 500gb RAM each, so adjust your “–mem” value if need be
  • There is also an AMD EPYC based large memory node in Orion, which has 64-cores and 4TB of RAM, in case you would like to test a large memory job on an AMD EPYC-based system

There are a couple of families of Xeon and EPYC CPUs within our partitions. You can direct your job to a particular family of CPU by adding such a constraint to your submit script. For example:

#SBATCH --constraint=caslake # (will run your job on an Intel Xeon "Cascade Lake" based compute node)

-or-

#SBATCH --constraint=rome # (will run your job on an AMD EPYC "Rome" based compute node)

To find out what “Features” are available to use as constraints in your submit script, you can issue the following “sinfo” command:

$ sinfo -p Orion -o "%20N %10c %10m %50f"
NODELIST             CPUS       MEMORY     AVAIL_FEATURES
str-bm1              16         1546051    bigmem,intel,xeon,caslake
str-bm5              64         4127816    bigmem,intel,xeon,broadwell
str-abm1             64         4127515    bigmem,amd,epyc,milan
str-c[1-36,128-167]  48         385092     stdmem,intel,xeon,caslake
str-c[49-69]         36         385092     stdmem,intel,xeon,skylake
str-ac[1-10]         64         515101     stdmem,amd,epyc,rome

This is not exclusive to the “Orion” partition. You can find out the available features of any compute node in any partition that you have access to by simply changing the partition name in the above command.

Parallel Processing with OpenMPI

Slurm supports parallel processing via message passing. To access OpenMPI, load the desired modules: e.g.

$ module load openmpi/4.1.0
$ mpicc myprogram.c

And include a request for multiple processes in the submit script:

#! /bin/bash

#SBATCH --job-name="MyMPIJob"
#SBATCH --partition=Orion
#SBATCH --nodes=4
#SBATCH --ntasks-per-node=4
#SBATCH --time=00:01:00

module load openmpi/4.1.0
srun --mpi=pmix_v3 /users//myprogram

and submit with sbatch:

$ sbatch my_script.sh

The Slurm options may also be set on the sbatch command line as follows

$ sbatch --job-name=MyMPIJob --partition=Orion --nodes=4 --ntasks-per-node=4 my_script.sh

In this example, the resource request is for 4 cores (or processes) on each of 4 compute nodes for a total of 16 processes.

Submitting a GPU Job

Our Starlight cluster has a separate GPU partition, so if you have a job that requires a GPU, you must first remember to set the partition accordingly.

To submit a job to the GPU partition:

#SBATCH --partition=GPU # (Submits job to the GPU partition)

To request 1 node, 8 CPU cores, and 4 GPUs, you would use the following syntax:

#SBATCH --nodes=1
#SBATCH --ntasks-per-node=8
#SBATCH --gres=gpu:4

Request a particular type of GPU

You can specify the GPU type by modifying the “gres” directive, like so:

#SBATCH --gres=gpu:TitanV:4   # (will reserve 4 Titan V GPUs)
#SBATCH --gres=gpu:TitanRTX:2 # (will reserve 2 Titan RTX GPUs)
#SBATCH --gres=gpu:V100:1     # (will reserve 1 Tesla V100s GPU)

Request a single- or double-precision GPU

You can request a single-precision (FP32) or a double-precision (FP64) GPU by specifying a constraint for your job. For example:

#SBATCH --gres=gpu:1
#SBATCH --constraint=FP32 # (will reserve 1 single-precision GPU)

-or-

#SBATCH --gres=gpu:1
#SBATCH --constraint=FP64 # (will reserve 1 double-precision GPU)

In order to find out what type/count of the GPUs (in the GRES column), and the FP32/FP64 precision (in the AVAIL_FEATURES column), you can use the following “sinfo” command:

$ sinfo -p GPU -o "%14N %6c %8m %34f %20G"
NODELIST       CPUS   MEMORY   AVAIL_FEATURES                     GRES
str-gpu4       16     189364   gpu,FP64,stdmem                    gpu:V100:4
str-gpu5       16     189364   gpu,FP64,stdmem                    gpu:V100:8
str-gpu[13-14] 32     256899   gpu,FP32,FP64,stdmem               gpu:A100:4
str-gpu[15-20] 32     256899   gpu,FP32,stdmem                    gpu:A40:4
str-gpu21      128    1031654  gpu,FP32,FP64,amd,epyc,rome,bigmem gpu:A100:8
str-gpu[24-25] 64     515178   gpu,FP32,stdmem                    gpu:L40S:4

When we add new GPU nodes to the partition, they may have a new model of GPU in them, so the above list may change as we add new (and retire old) GPU compute nodes to the cluster.

Submitting a Job

Once you are satisfied with the contents of your submit script, save it, then submit it to the Slurm Workload Manager. Here are some helpful commands to do so:

Submit Your Job: sbatch submit-script.slurm
Check the Queue: sbatch submit-script.slurm
Show a Job’s Detail: scontrol show job -d [job-id]
Cancel a Job: scancel [job-id]

More information