Overview
Slurm is an open-source cluster management and job scheduler, originally developed at the Lawrence Livermore National Laboratory. Commercial support is now provided by SchedMD. The information provided in this document is a basic guide for some of the most useful commands, along with specific information for the RDHPCS systems. The SchedMD site maintains full documentation and basic tutorials.
Some common Slurm commands are summarized in the table below.
Command |
Action/Task |
---|---|
|
Show the current queue |
|
Submit a batch script |
|
Submit an interactive job |
|
Launch a parallel job |
|
Show node/partition info |
|
View accounting information for jobs/job steps |
|
View account information |
|
Cancel a job or job step |
|
View or modify job configuration. |
All Slurm commands have on-line manual pages viewable via the man
command
(e.g., man sbatch
) and extensive usage information using the --help
option (e.g., sinfo --help
). See References for links to the
SchedMD documentation.
Running a Job
Computational work on the RDHPCS is performed by jobs. Jobs typically consist of several components:
A batch submission script
A binary executable
A set of input files for the executable
A set of output files created by the executable
In general, the process for running a job is to:
Prepare executables and input files.
Write a batch script.
Submit the batch script to the batch scheduler.
Optionally monitor the job before and during execution.
In addition, users can perform interactive work on the compute nodes using the
salloc
command.
Batch Scripts
The most common way to interact with the batch system is via batch scripts. A batch script is simply a shell script with added directives to request various resources from or provide certain information to the scheduling system. Aside from these directives, the batch script is simply the series of commands needed to set up and run your job.
To submit a batch script, use the command sbatch myjob.sl
.
Consider the following batch script:
1 #!/bin/bash
2 #SBATCH -A ABC123
3 #SBATCH -J RunSim123
4 #SBATCH -o %x-%j.out
5 #SBATCH -t 1:00:00
6 #SBATCH -p hera
7 #SBATCH -N 1024
8
9 cd $MEMBERWORK/abc123/Run.456
10 cp $PROJWORK/abc123/RunData/Input.456 ./Input.456
11 srun ...
12 cp my_output_file $PROJWORK/abc123/RunData/Output.456
In the script, Slurm directives are preceded by #SBATCH
, making them appear
as comments to the shell. Slurm looks for these directives through the first
non-comment, non-whitespace line. Options after that will be ignored by Slurm
(and the shell).
Line |
Description |
---|---|
1 |
Shell interpreter line |
2 |
Project to charge |
3 |
Job name |
4 |
Job standard output file ( |
5 |
Walltime requested (in |
6 |
Partition (queue) to use |
7 |
Number of compute nodes requested |
8 |
Blank line |
9 |
Change into the run directory |
10 |
Copy the input file into place |
11 |
Run the job ( add layout details ) |
12 |
Copy the output file to an appropriate location. |
Note
The environment variables used in the above script example are used to indicate locations as specified in Summary of Storage Areas, and are not available on any RDHPCS system.
Loading Modules in a batch script
If you loaded modules when building your code, they must be loaded when the job runs as well. This means that you must put the same module commands in your batch scripts that you ran before building your code.
Loading modules in a batch script requires one additional line to make the module commands available in the script:
1 #!/bin/bash
2 #SBATCH ( put directives here )
3
4 source $MODULESHOME/init/bash
5 module load THING1 THING2
Line |
Description |
---|---|
1 |
Shell interpreter line |
2 |
A placeholder for needed SBATCH directives |
4 |
The command that will make the module commands available |
5 |
An example line for module load |
Module Loading Best Practices
Note
Do Not Load Modules at Shell Initialization.
Upon user interactive login, running batch jobs, running cron scripts, and running command line scripts, a shell is invoked.
Loading modules in shell initialization scripts can lead to unintended consequences, as the shell’s environment may be different than the one expected. The wrong libraries can be loaded, the wrong tools can be used, the wrong version of tools can be used, and even tools provided with the operating system may no longer work properly or provide strange error messages. For these reasons, we highly recommend that you do not add module loads to your shell’s initialization scripts.
Instead, we recommend that you remove module loads from shell initialization scripts and do one or more of the following:
Add the module loads directly to your batch script or cron scripts
Create a separate script responsible for loading the desired modules and environment. This script can then be invoked/sourced any time you want to set up this specific environment. A command “alias” can also be added to your shell’s initialization scripts. You can then run the alias command to invoke the desired shell environment.
Create a script as described above and have all members of your project invoke/source the exact same script. This will ensure that the exact same modules are used by all users. You can even add “module purge” to the beginning of the script to ensure that only the desired modules are being loaded.
If you need help implementing these methods, open an RDHPCS help ticket. See Getting Help for details.
Interactive Jobs
Most users will find batch jobs an easy way to use the system, as they can “hand off” a job to the scheduler, allowing them to focus on other tasks while their job waits in the queue and eventually runs. Occasionally, it is necessary to run interactively, especially when developing, testing, modifying or debugging a code.
Since all compute resources are managed and scheduled by Slurm, it is not
possible to simply log into the system and immediately begin running parallel
codes interactively. Rather, you must request the appropriate resources from
Slurm and, if necessary, wait for them to become available. This is done
through an “interactive batch” job. Interactive batch jobs are submitted with
the salloc
command. Resources are requested via the same options that are
passed via #SBATCH
in a regular batch script (but without the #SBATCH
prefix). For example, to request an interactive batch job with the same
resources that the batch script above requests, you would use salloc -A
ABC123 -J RunSim123 -t 1:00:00 -p batch -N 1024
. Note there is no option for
an output file…you are running interactively, so standard output and standard
error will be displayed to the terminal.
Note
At times it will be useful to use a graphical interface (GUI) while running
an interactive job, for example a graphical debugger. To allow the
interactive job to allow displaying the graphical interface, you must supply
the --x11
option to salloc
.
Submitting an Interactive Job
An interactive job is useful for tasks, such as debugging, that require
interactive access with a program as it runs. With SLURM there are two ways to
run jobs interactively, srun
or salloc
. We recommend that you use
salloc
.
For example, to request two nodes for 30 min (with X11 forwarding so that you can use X-windows based tools) you can do the following:
salloc --x11=first -q debug -t 0:30:00 --nodes=2 -A xxxxx-cpu
When you run the salloc
command, you won’t get a prompt back until the
batch system scheduler can run the job. At that point, the scheduler
will drop you into a login session on the head node allocated to your
interactive job. You will have a prompt and may run commands,
such as your code or debuggers, as desired. In the example above, an srun
command is executed. salloc
is similar to sbatch
in that it creates an
allocation for you to run in. However, only interactive jobs can be run inside
the salloc
allocation.
If you need to display X windows back to your desktop screen from within an
interactive job, you must use ssh -X
when you log in.
If you are using x2go and need to use X windows-based tools, then also do an
ssh -X localhost
before you issue the salloc
command.
Submitting a Job to Run a Command on a Compute Node
Please note, compute-intensive jobs can put a heavy load on the login nodes,
and will affect all interactive users as a result.” The command wgrib
is
one such example.
A better approach is to request an interactive access to a compute node, or simply submit a job to a compute node without the need for a script.
Instead of running the command on a login node interactively as shown below:
wgrib2 grib_file -bin out.bin
one can simply do:
sbatch -A <acct> -n 1 -t 30 -q debug --wrap "wgrib2 grib_file -bin out.bin"
Note
If this command needs more memory than the default, you may
need to add something like --mem=4g
(or whatever memory is appropriate).
To run a command that interacts with the user or generates
graphical output, you can use srun
to run a command on the compute node.
For example, to run a python script on a compute node that generate an image
you can use the following method:
srun --pty --x11 -A nesccmgmt -N 1 -t 30 python myplot.py
See the previous section regarding commands for X11 forwarding.
Submitting a Job with Arguments
If you want to submit a script that accepts arguments you need to add the
arguments after the job file name on the sbatch
command.
This is similar to the Unix method of passing arguments to a script,
as shown in the example below:
sbatch batch.job arg1 arg2
The command above passes arg1
as $1 and arg2
as $2, similar to the Unix
convention of argument passing.
Common sbatch
Options
There are two ways to specify sbatch options. The first is on the command line when using the sbatch command.
$ sbatch --clusters=<cluster> --account=abc123 myrunScript.sh
The second method is to insert directives at the top of the batch script using #SBATCH syntax. For example,
#SBATCH --clusters=<cluster>
#SBATCH --account=abc123
The two methods can be mixed together. However, options specified on the command line always override options specified in the script.
The table below summarizes options for submitted jobs. Check the Slurm Man Pages for a more complete list.
Option |
Example Usage |
Description |
---|---|---|
|
|
Specifies the project to which the job should be charged. |
|
|
Specify a maximum wall clock limit. |
|
|
Set the name of the job. |
|
|
Request the number of nodes be allocated to a job. |
|
|
Request for a number of total tasks. |
|
|
Specify the real memory required per node |
|
|
Request a quality of service for the job. |
|
|
File where the job’s STDOUT will be directed. ( |
|
|
File where the job’s STDERR will be directed. ( |
|
|
Email address to be used for notifications. |
|
|
Clusters to submit the job to. |
Note
Gaea uses a federation of clusters which include the login and dtn cluster
(es), the compute clusters (e.g., c5, c6), and the GFDL post processing and
analysis cluster (gfdl). On gaea, the --clusters
option must be
specified, and should be specified for many of the Slurm commands.
Specifying Partitions and QOS
RDHPCS systems generally have a default partition and QOS. If you do not specify these parameters, your job will be submitted to the default partition and QOS.
If you wish choose a different partition or QOS you will need to specify them as per the table above. Different partitions and QOS are available, depending the resources you need. For example, jobs that require external network connectivity or HPSS access will generally need to be submited to the service partition.
QOS is used to specify a priority for the job.
There are limits, such as number of nodes and tasks,
and wall time limits for each partitions and QOS.
To see what those limits are, run the command sbatch-limits
.
The output of this command also shows which combinations of partition/QOS are permitted, as not all partitions/QOS combinations are allowed.
Run the command sbatch-limits -h
to see the other available options.
Determine and specify a memory limit for your jobs
You can use the report-mem
command in your job to get memory usage
information from your batch job. Please note, this only works if your job runs
successfully to conclusion.
If you don’t know how much memory your application needs, you can “over
estimate”, or use an entire node to get a successful run and include the
report-mem
command in the job. To request all the available memory on the
node for a serial job you can use the --mem=0
option on the sbatch
command.
If your jobs fail with memory errors, it is possible that your application needs more memory than what you were giving for the job.
In the case of serial jobs (which means you may have other jobs running on the
same code and that your job is running), by default you get a certain amount of
memory. If your application needs more memory than the default, you need to
specify the memory needed by your job using the --mem=
option. (For
example, --mem=2g
specifies 2 GB of memory)
In general, for parallel jobs you do not need to specify a memory limit. You
can specify the memory limit on the command line with using --mem
option,
or as an #SBATCH directive within the job file.
For parallel jobs it is not necessary to specify the memory requirement, but if each of your tasks requires more than its share of memory on the node, the only way to get more memory is to spread the same number of tasks on more nodes. The Memory High Water mark information will help you determine how many nodes you would have to use to satisfy the memory requirements of your job. There are a couple of different ways of getting the memory usage information about your job.
Specify a processor layout for your job (uniform layout example)
The simple method of laying out tasks where all the cores on a node are used with one MPI task per core works reasonably well for most applications. These are however cases where the default amount of memory available per core is insufficient and more memory is needed than is available.
In those instances, it is necessary to spread tasks out on more nodes, so that there are fewer MPI tasks on a node than there are cores. The other cores may be left idle, or could be used to speed up the code by using threads.
For example, on a machine with 12 cores per node, the default layout would use all the 12 cores per node. If each task needs twice that amount of memory, you would place 6 MPI tasks on each node.
In the example below, even though there are 12 cores available on the node, only 6 MPI tasks are placed on a node. Each task then gets double the amount of memory
#SBATCH --nodes=4
#SBATCH --ntasks-per-node=6
The --cpus-per-task
option can be used to specify layout for a threaded job
(e.g. OpenMP). For example, a hybrid MPI/OpenMP job where each MPI process uses
2 threads:
#SBATCH --nodes=4
#SBATCH --ntasks-per-node=3
#SBATCH --cpus-per-task=2
export OMP_NUM_THREADS=2 # Note that this is needed too!
srun ./myexe
Other examples:
#SBATCH --nodes=12
#SBATCH --ntasks-per-node=1
This example will start the job on 12 nodes, with one task/thread per node.
Note
--nodes=20
is not the same as --nodes=20 --ntasks-per-node=12
. By
default, one task per node is used. It is best to always explicitly list the
--ntasks-per-node
(or –ntasks) expression that you need.
Note
You must specify a number of tasks, either with
-n
(--ntasks
) or -N
(--nodes
) or both. If you do not specify
the number of tasks, you will get a job submission error.
Using report-mem utility in batch jobs
To get the maximum amount of memory (also called “Memory High Water mark”) used up to a specific point in your job, you can add the following command to your job file:
report-mem
Typically, the best place to put this command is the end of your job file or altered exit points if your jobs are written such that they may exit before the end.
There may be instances where this is not feasible because you
don’t have direct access to the job file. For example, you might be using other
scripts to generate job files on the fly, where users have the option to
specify launch option in a “config” file. In those instances, you can get a
memory report for your parallel jobs using --epilog
option of the srun
command as shown below:
srun --epilog=/apps/local/bin/report-mem wrf.exe
Using report-mem utility on a job that is currently running
If your job is currently running on the system and you would like to find the
Memory High Water Mark up to that point, use
the report-mem
command from a
login node on that job, as in the example below:
hfe03.% report-mem -j 4665051
Peak memory usage summary:
min = 11139788 KB
ave = 11181442 KB
max = 11261556 KB
All nodes sorted by peak memory as percentage of limit: (in KB)
% of user user user total total
Node limit max limit current current phys
h16c50 12.0 11261556 94208000 11259356 14455952 97609020
h25c22 11.9 11208488 94208000 11207184 14486172 97609020
h25c17 11.9 11178112 94208000 11177692 14508136 97609024
h25c40 11.8 11152296 94208000 11151424 14451696 97609024
h25c48 11.8 11148416 94208000 11147668 14445588 97609024
h25c20 11.8 11139788 94208000 11139672 14465660 97609024
hfe03.%
Determining the amount of memory used by a process
The techniques above give you the amount of memory used on each node, rather than the amount of memory used by each task.
To find the amount of memory used by each task, use this method:
Submit the job, but use a full node (using
sbatch -N 1 ...
for example)
If your execute line is:
./myexe
replace it with
/usr/bin/time ./myexe
If you search for the string “elapsed”, you will find a line resembling the following:
1.34user 15.57system 0:22.76elapsed 74%CPU (0avgtext+0avgdata 7822876 maxresident)k
which shows that this process used approximately 7.8 GB of memory.
Note
The calculation in this case is: (7,822,876 * 1k = 7,822,876,000 bytes ~ 7.8GB)
When you are ready to run the job in production you can request one task and the appropriate amount of memory by doing something like the following:
sbatch --ntasks=1 --mem=8000M ... jobfile
While the suffixes M and G both work, the number specified must be an integer. If you would prefer that the single-core job allocates the entire node, use one of the following options:
#SBATCH --exclusive
or
#SBATCH --nodes=1
The same technique is used for parallel jobs. The main difference will be that you need to replace the launch line as follows.
If your mpi launch command is:
srun ./wrf
you should change that to:
srun -l /usr/bin/time ./wrf
The report wil list the amount of memory used by each task. You can calculate the memory used on each node by determining how many tasks were placed on each node.
Shown below is a sample report using the grep command to filter and show only output of interest, sorted by rank order:
hfe03.% grep maxresident osu-osu_mbw_mr-0002-04.o4885268 | sort
0: 15.98user 3.06system 0:19.67elapsed 96%CPU (0avgtext+0avgdata 23928maxresident)k
1: 16.23user 2.68system 0:19.67elapsed 96%CPU (0avgtext+0avgdata 23984maxresident)k
2: 16.42user 2.62system 0:19.67elapsed 96%CPU (0avgtext+0avgdata 23984maxresident)k
3: 16.35user 2.55system 0:19.67elapsed 96%CPU (0avgtext+0avgdata 23868maxresident)k
4: 15.99user 3.13system 0:19.64elapsed 97%CPU (0avgtext+0avgdata 21976maxresident)k
5: 16.24user 2.67system 0:19.64elapsed 96%CPU (0avgtext+0avgdata 23996maxresident)k
6: 16.45user 2.67system 0:19.64elapsed 97%CPU (0avgtext+0avgdata 21952maxresident)k
7: 16.40user 2.57system 0:19.64elapsed 96%CPU (0avgtext+0avgdata 24020maxresident)k
hfe03.%
In this example, each task used approximately 23900 KB (or 23 MB) of memory.
Slurm Environment Variables
Slurm reads a number of environment variables, many of which can provide the same information as the job options noted above. We recommend using the job options rather than environment variables to specify job options, as it allows you to have everything self-contained within the job submission script (rather than having to remember what options you set for a given job).
Slurm also provides a number of environment variables within your running job. The following table summarizes those that may be particularly useful within your job (e.g. for naming output log files):
Variable |
Description |
---|---|
|
The directory from which the batch job was
submitted. By default, a new job starts in your
home directory. You can get back to the
directory of job submission with
|
|
The job’s full identifier. A common use for
|
|
The number of nodes requested. |
|
The job name supplied by the user. |
|
The list of nodes assigned to the job. |
State Codes
A job will transition through several states during its lifetime. Common ones include:
State Code |
Description |
|
---|---|---|
CA |
Cancelled |
The job was explicitly cancelled by the user or system administrator |
CD |
Completed |
Job has terminated all processes on all nodes. Exit code of zero. |
F |
Failed |
Job terminated with non-zero exit code or other failure condition. |
R |
Running |
Job currently has an allocation. |
TO |
Timeout |
Job terminated upon reaching its time limit. |
PD |
Pending |
Job is awaiting resource allocation. |
OOM |
Out Of Memory |
Job experienced out of memory error. |
NF |
Node Fail |
The list of nodes assigned to the job. |
Job Reason Codes
Reason |
Meaning |
---|---|
InvalidQOS |
The job’s QOS is invalid. |
InvalidAccount |
The job’s account is invalid |
NonZeroExitCode |
The job terminated with a non-zero exit code. |
NodeDown |
A node required by the job is down. |
TimeLimit |
The job exhausted its time limit |
SystemFailure |
Failure of the Slurm system, a file system, the network, etc. |
JobLaunchFailure |
The job cannot be launched. This may be due to a file system problem, invalid program name, etc. |
WaitingForScheduling |
The list of nodes assigned to the job. |
Job Dependencies
SLURM supports the ability to submit a job with constraints that will keep it
running until these dependencies are met. A simple example is where job X
cannot execute until job Y completes. Dependencies are specified with the
-d
option to Slurm.
Flag |
Meaning |
---|---|
|
The job can start after the specified jobs start or are cancelled. The optional +time argument is a number of minutes. If specified, the job cannot start until that many minutes have passed since the listed jobs start/are cancelled. If not specified, there is no delay. |
|
The job can start after the specified jobs have ended, regardless of exit state. |
|
The job can start after the specified jobs terminate in a failed (non-zero) state. |
|
The job can start after the specified jobs complete successfully |
|
Job can begin after any previously-launched job with the same name and from the same user have completed. In other words, serialize the running jobs based on username+jobname pairs. |
Srun
Your job scripts will usually call srun
to run an executable on multiple
nodes.
$ srun [OPTIONS... [executable [args...]]]
srun
accepts the following options:
Option |
Description |
---|---|
|
Number of nodes to use. |
|
Total number of MPI tasks (default is 1). |
|
Logical cores per MPI task (default is 1). When used with
|
|
In task layout, use the specified maximum number of hardware threads per
core. Must also be set in |
|
If used without |
Heterogeneous Jobs
A heterogeneous job is a job in which each component has virtually all job options available including partition, account and QOS (Quality Of Service). For example, part of a job might require four cores and 4 GB for each of 128 tasks while another part of the job would require 16 GB of memory and one CPU.
To run a heterogeneous job use srun
and separate the different components
with the colon (:
) character. This is similar to mpirun
.
srun --ntasks=1 --cpus-per-task=32 ./executable : --ntasks=128 --cpus-per-task=1 ./executable
Monitoring Jobs
The commands squeue
, scontrol
and scancel
from the common
slurm commands table will allow users to view,
monitor, cancel, and discover information about their jobs on the system.
Show Pending and Running Jobs
Use the squeue
command to view a list of current jobs in the queue. See
man squeue
for more information.
$ squeue -a
To list jobs that belong to a specific user
$ squeue -u <userid>
Show Completed Jobs
Slurm does not keep completed jobs in squeue
.
$ sacct -S 2019-03-01 -E now -a
If you don’t specify -S
and -E
options sacct
gives you data from
the current day.
Use the sacct
command option to list jobs that have run within the last 24
hours and to see their statuses (State). A full list of sacct
options and
job states can be found on the sacct
man page.
$ sacct --user $USER --starttime `date --date="yesterday" +%F` -X --format=JobID,JobName%30,Partition,Account,AllocCPUS,State,Elapsed,QOS
Getting Details About a Job
Slurm only keeps information about completed jobs available via scontrol
for 5 minutes after a job completes. After that time, sacct
is the way of
getting information about completed jobs.
$ scontrol show job <jobid>
Getting Information About Your Projects
The RDHPCS system administrators have supplied additional tools to help the users gather information concerning their jobs, job’s fairshare, and allocation usage. The tools listed in this section may not be available on all RDHPCS systems.
saccount_params
The saccount_params
will show your current:
Home File System usage/quota (MB)
For each of your projects
Compute: FairShare priority value, (FairShare rank vs all other projects), partition access and available QOS’s for all your projects. Include -l (for long) if you want to see current 30-day allocation, last 30-day usage, and FairShare to 6 digits(
saccount_params -l
).Scratch disk usage/quota (GB), files on disk and file count quota.
Note
Projects with a windfall allocation of 1 will show an allocation of 0, but you will see the correct Available QOS: windfall. Projects with an allocation of 2 will show an allocation of 1, but you will see the correct Available QOS: Batch, debug, etc.
Note
saccount_params
is only available on Hera, Jet, Orion.
$ saccount_params
Account Params -- Information regarding project associations for userid
Home Quota (/home/userid) Used: 4149 MB Quota: 5120 MB
Project: projid
FairShare=1.000 (91/91)
Partition Access: ALL
Available QOSes: gpuwf,windfall
Directory: /scratch[12]/[portfolio]/projid DiskInUse=206372 GB, Quota=255000 GB, Files=5721717, FileQUota=51000000
Generating Reports
Several of the Slurm utilities can be used to generate usage reports. Slurm supplies Sreport. The RDHPCS team supplies Shpcrpt. Both tools can generate similar reports.
See the following for more information:
References
Further information about Slurm, and the Slurm commands are available using
man <command>
on all RDHPCS systems, or on the SchedMD documentation site.