Overview
Slurm is an open-source cluster management and job scheduler, originally developed at the Lawrence Livermore National Laboratory. Commercial support is now provided by SchedMD. The information provided in this document is a basic guide for some of the most useful commands, along with specific information for the RDHPCS systems. The SchedMD site maintains full documentation and basic tutorials.
Some common Slurm commands are summarized in the table below.
Command |
Action/Task |
---|---|
|
Show the current queue |
|
Submit a batch script |
|
Submit an interactive job |
|
Launch a parallel job |
|
Show node/partition info |
|
View accounting information for jobs/job steps |
|
View account information |
|
Cancel a job or job step |
|
View or modify job configuration. |
All Slurm commands have on-line manual pages viewable via the man
command
(e.g., man sbatch
) and extensive usage information using the --help
option (e.g., sinfo --help
). See References for links to the
SchedMD documentation.
Running a Job
Computational work on the RDHPCS is performed by jobs. Jobs typically consist of several components:
A batch submission script
A binary executable
A set of input files for the executable
A set of output files created by the executable
In general, the process for running a job is to:
Prepare executables and input files.
Write a batch script.
Submit the batch script to the batch scheduler.
Optionally monitor the job before and during execution.
In addition, users can perform interactive work on the compute nodes using the
salloc
command.
Batch Scripts
The most common way to interact with the batch system is via batch scripts. A batch script is simply a shell script with added directives to request various resources from or provide certain information to the scheduling system. Aside from these directives, the batch script is simply the series of commands needed to set up and run your job.
To submit a batch script, use the command sbatch myjob.sl
.
Consider the following batch script:
1 #!/bin/bash
2 #SBATCH -A ABC123
3 #SBATCH -J RunSim123
4 #SBATCH -o %x-%j.out
5 #SBATCH -t 1:00:00
6 #SBATCH -p hera
7 #SBATCH -N 1024
8
9 cd $MEMBERWORK/abc123/Run.456
10 cp $PROJWORK/abc123/RunData/Input.456 ./Input.456
11 srun ...
12 cp my_output_file $PROJWORK/abc123/RunData/Output.456
In the script, Slurm directives are preceded by #SBATCH
, making them appear
as comments to the shell. Slurm looks for these directives through the first
non-comment, non-whitespace line. Options after that will be ignored by Slurm
(and the shell).
Line |
Description |
---|---|
1 |
Shell interpreter line |
2 |
Project to charge |
3 |
Job name |
4 |
Job standard output file ( |
5 |
Walltime requested (in |
6 |
Partition (queue) to use |
7 |
Number of compute nodes requested |
8 |
Blank line |
9 |
Change into the run directory |
10 |
Copy the input file into place |
11 |
Run the job ( add layout details ) |
12 |
Copy the output file to an appropriate location. |
Note
The environment variables used in the above script example are used to indicate locations as specified in Summary of Storage Areas, and are not available on any RDHPCS system.
Interactive Jobs
Most users will find batch jobs an easy way to use the system, as they allow you to “hand off” a job to the scheduler, allowing them to focus on other tasks while their job waits in the queue and eventually runs. Occasionally, it is necessary to run interactively, especially when developing, testing, modifying or debugging a code.
Since all compute resources are managed and scheduled by Slurm, it is not
possible to simply log into the system and immediately begin running parallel
codes interactively. Rather, you must request the appropriate resources from
Slurm and, if necessary, wait for them to become available. This is done
through an “interactive batch” job. Interactive batch jobs are submitted with
the salloc
command. Resources are requested via the same options that are
passed via #SBATCH
in a regular batch script (but without the #SBATCH
prefix). For example, to request an interactive batch job with the same
resources that the batch script above requests, you would use salloc -A
ABC123 -J RunSim123 -t 1:00:00 -p batch -N 1024
. Note there is no option for
an output file…you are running interactively, so standard output and standard
error will be displayed to the terminal.
Note
At times it will be useful to use a graphical interface (GUI) while running
an interactive job, for example a graphical debugger. To allow the
interactive job to allow displaying the graphical interface, you must supply
the --x11
option to salloc
.
Common sbatch
Options
There are two ways to specify sbatch options. The first is on the command line when using the sbatch command.
$ sbatch --clusters=<cluster> --account=abc123 myrunScript.sh
The second method is to insert directives at the top of the batch script using #SBATCH syntax. For example,
#SBATCH --clusters=<cluster>
#SBATCH --account=abc123
The two methods can be mixed together. However, options specified on the command line always override options specified in the script.
The table below summarizes options for submitted jobs. Check the Slurm Man Pages for a more complete list.
Option |
Example Usage |
Description |
---|---|---|
|
|
Specifies the project to which the job should be charged. |
|
|
Specify a maximum wall clock limit. |
|
|
Set the name of the job. |
|
|
Request the number of nodes be allocated to a job. |
|
|
Request for a number of total tasks. |
|
|
Specify the real memory required per node |
|
|
Request a quality of service for the job. |
|
|
File where the job’s STDOUT will be directed. ( |
|
|
File where the job’s STDERR will be directed. ( |
|
|
Email address to be used for notifications. |
|
|
Clusters to submit the job to. |
Note
Gaea uses a federation of clusters which include the login and dtn cluster
(es), the compute clusters (e.g., c5, c6), and the GFDL post processing and
analysis cluster (gfdl). On gaea, the --clusters
option must be
specified, and should be specified for many of the Slurm commands.
Slurm Environment Variables
Slurm reads a number of environment variables, many of which can provide the same information as the job options noted above. We recommend using the job options rather than environment variables to specify job options, as it allows you to have everything self-contained within the job submission script (rather than having to remember what options you set for a given job).
Slurm also provides a number of environment variables within your running job. The following table summarizes those that may be particularly useful within your job (e.g. for naming output log files):
Variable |
Description |
---|---|
|
The directory from which the batch job was
submitted. By default, a new job starts in your
home directory. You can get back to the
directory of job submission with
|
|
The job’s full identifier. A common use for
|
|
The number of nodes requested. |
|
The job name supplied by the user. |
|
The list of nodes assigned to the job. |
State Codes
A job will transition through several states during its lifetime. Common ones include:
State Code |
Description |
|
---|---|---|
CA |
Cancelled |
The job was explicitly cancelled by the user or system administrator |
CD |
Completed |
Job has terminated all processes on all nodes. Exit code of zero. |
F |
Failed |
Job terminated with non-zero exit code or other failure condition. |
R |
Running |
Job currently has an allocation. |
TO |
Timeout |
Job terminated upon reaching its time limit. |
PD |
Pending |
Job is awaiting resource allocation. |
OOM |
Out Of Memory |
Job experienced out of memory error. |
NF |
Node Fail |
The list of nodes assigned to the job. |
Job Reason Codes
Reason |
Meaning |
---|---|
InvalidQOS |
The job’s QOS is invalid. |
InvalidAccount |
The job’s account is invalid |
NonZeroExitCode |
The job terminated with a non-zero exit code. |
NodeDown |
A node required by the job is down. |
TimeLimit |
The job exhausted its time limit |
SystemFailure |
Failure of the Slurm system, a file system, the network, etc. |
JobLaunchFailure |
The job cannot be launched. This may be due to a file system problem, invalid program name, etc. |
WaitingForScheduling |
The list of nodes assigned to the job. |
Job Dependencies
SLURM supports the ability to submit a job with constraints that will keep it
running until these dependencies are met. A simple example is where job X
cannot execute until job Y completes. Dependencies are specified with the
-d
option to Slurm.
Flag |
Meaning |
---|---|
|
The job can start after the specified jobs start or are cancelled. The optional +time argument is a number of minutes. If specified, the job cannot start until that many minutes have passed since the listed jobs start/are cancelled. If not specified, there is no delay. |
|
The job can start after the specified jobs have ended, regardless of exit state. |
|
The job can start after the specified jobs terminate in a failed (non-zero) state. |
|
The job can start after the specified jobs complete successfully |
|
Job can begin after any previously-launched job with the same name and from the same user have completed. In other words, serialize the running jobs based on username+jobname pairs. |
Srun
Your job scripts will usually call srun
to run an executable on multiple
nodes.
$ srun [OPTIONS... [executable [args...]]]
srun
accepts the following options:
Option |
Description |
---|---|
|
Number of nodes to use. |
|
Total number of MPI tasks (default is 1). |
|
Logical cores per MPI task (default is 1). When used with
|
|
In task layout, use the specified maximum number of hardware threads per
core. Must also be set in |
|
If used without |
Heterogeneous Jobs
A heterogeneous job is a job in which each component has virtually all job options available including partition, account and QOS (Quality Of Service). For example, part of a job might require four cores and 4 GB for each of 128 tasks while another part of the job would require 16 GB of memory and one CPU.
To run a heterogeneous job use srun
and separate the different components
with the colon (:
) character. This is similar to mpirun
.
srun --ntasks=1 --cpus-per-task=32 ./executable : --ntasks=128 --cpus-per-task=1 ./executable
Monitoring Jobs
The commands squeue
, scontrol
and scancel
from the common
slurm commands table will allow users to view,
monitor, cancel, and discover information about their jobs on the system.
Show Pending and Running Jobs
Use the squeue
command to view a list of current jobs in the queue. See
man squeue
for more information.
$ squeue -a
To list jobs that belong to a specific user
$ squeue -u <userid>
Show Completed Jobs
Slurm does not keep completed jobs in squeue
.
$ sacct -S 2019-03-01 -E now -a
If you don’t specify -S
and -E
options sacct
gives you data from
the current day.
Use the sacct
command option to list jobs that have run within the last 24
hours and to see their statuses (State). A full list of sacct
options and
job states can be found on the sacct
man page.
$ sacct --user $USER --starttime `date --date="yesterday" +%F` -X --format=JobID,JobName%30,Partition,Account,AllocCPUS,State,Elapsed,QOS
Getting Details About a Job
Slurm only keeps information about completed jobs available via scontrol
for 5 minutes after a job completes. After that time, sacct
is the way of
getting information about completed jobs.
$ scontrol show job <jobid>
Getting Information About Your Projects
The RDHPCS system administrators have supplied additional tools to help the users gather information concerning their jobs, job’s fairshare, and allocation usage. The tools listed in this section may not be available on all RDHPCS systems.
saccount_params
The saccount_params
will show your current:
Home File System usage/quota (MB)
For each of your projects
Compute: FairShare priority value, (FairShare rank vs all other projects), partition access and available QOS’s for all your projects. Include -l (for long) if you want to see current 30-day allocation, last 30-day usage, and FairShare to 6 digits(
saccount_params -l
).Scratch disk usage/quota (GB), files on disk and file count quota.
Note
Projects with a windfall allocation of 1 will show an allocation of 0, but you will see the correct Available QOS: windfall. Projects with an allocation of 2 will show an allocation of 1, but you will see the correct Available QOS: Batch, debug, etc.
Note
saccount_params
is only available on Hera, Jet, Orion.
$ saccount_params
Account Params -- Information regarding project associations for userid
Home Quota (/home/userid) Used: 4149 MB Quota: 5120 MB
Project: projid
FairShare=1.000 (91/91)
Partition Access: ALL
Available QOSes: gpuwf,windfall
Directory: /scratch[12]/[portfolio]/projid DiskInUse=206372 GB, Quota=255000 GB, Files=5721717, FileQUota=51000000
Generating Reports
Several of the Slurm utilities can be used to generate usage reports. Slurm supplies Sreport. The RDHPCS team supplies Shpcrpt. Both tools can generate similar reports.
See the following for more information:
References
Further information about Slurm, and the Slurm commands are available using
man <command>
on all RDHPCS systems, or on the SchedMD documentation site.