Gaea Batch Job Overview

This simple overview explains the elements of a basic job in Gaea. It includes compiling, running, combining, data transfer, and allocation.

Attention

As of 2024, this is dated information referencing Gaea system

versions that have been retired. Use this only as reference material until this document has been revised for the C5 and C6 generations and the GPFS filesystems.

Compiling

Gaea offers PrgEnv-intel, Prg-Env-pgi, and several other modules that make it as easy as possible to get your programs running . PrgEnv-pgi is loaded by default. You compile by calling either cc or ftn, according to the language your code is written in. See Compilers for more detail, especially for compiling multithreaded applications.

You may either compile live in your login shell on a Gaea login node, or in a job in the eslogin queue in the es partition of Gaea’s batch system. To tell a job script to run on the login nodes, specify the following in your script:

#SBATCH --clusters=es
#SBATCH --partition=eslogin
#SBATCH --ntasks=1 `

or, from the sbatch command line:

sbatch --clusters=es --partition=eslogin --ntasks=1 /path/to/compile_script`

Running

Once your executable is compiled and in place with your data on F2, you are ready to submit your compute job. Please submit your compute job to either c3 or c4.

#SBATCH --clusters=c3
#SBATCH --nodes=4
#SBATCH --ntasks-per-node=32 # Gaea charges for node use.  Nodes are 32 cores on c3 and 36 core on c4.

This example will get charged for 4 nodes.

or, from the sbatch command line:

sbatch --clusters=c3 --nodes=4 --ntasks-per-node=32 /path/to/run_script

Your compute job script will run on one of the compute nodes allocated to your job. To run your executable on them use the srun. Your executable and data must reside on F2 as only lustre filesystems are mounted on the compute nodes. Also, your job’s PWD should be in F2 when you run srun. A simple example is shown below:

cd /lustre/f2/scratch/$USER/
srun --nodes=128 --ntasks-per-node=32 /lustre/f2/scratch/$USER/path/to/executable`

Staging/Combining

Staging data to and from model run directories is a common task on Gaea. So is combining model output when your model uses multiple output writers for scalability of your MPI communications. The Local Data Transfer Nodes (LDTNs) are the resource provided for these tasks. Please keep these tasks off of the c3/c4 compute clusters and eslogin nodes. There is a NOAA-developed tool called gcp which is available for data transfers on Gaea. To tell a job script to run on the LDTN nodes, specify the following in your script:

#SBATCH --clusters=es
#SBATCH --partition=ldtn
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1 #set ntasks-per-node to the number of cores your job will need, up to 16

or, from the sbatch command line:

sbatch --clusters=es --partition=ldtn --nodes=1 --ntasks-per-node=1 /path/to/staging_script

Transferring Data to/from Gaea

Data transfers between Gaea and the world outside of Gaea should be performed on the Remote Data Transfer Nodes (RDTNs). There is a NOAA-developed tool called gcp, which is available for data transfers on Gaea. HPSS users are only able to access HPSS from jobs on the RDTNs. To tell a job script to run on the login nodes, specify the following in your script:

#SBATCH --clusters=es
#SBATCH --partition=rdtn
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1 #set ntasks-per-node to the number of cores your job will need, up to 8

or, from the sbatch command line:

sbatch --clusters=es --partition=rdtn --nodes=1 --ntasks-per-node=1 /path/to/trasfer_script

Allocation

Gaea users have default projects. If you are only a member of a single project, or if your experiments always run under your default project, you don’t need to do anything special to run. Users who are members of more than one project need to enter their preferred project via the –account option to sbatch to correctly charge to each experiment’s project.

You can use AIM to request access to new projects. Once access is granted in AIM it can take up to two days to be reflected in Gaea’s Slurm scheduler. If you still don’t have the granted access after two days, please put in a help desk ticket so admins can investigate your issue. To determine your Slurm account memberships, run the command: sacctmgr list associations user=First.Last

To submit jobs to the scheduler under a specific account do the following from the sbatch command line:

sbatch --account=gfdl_z

or add the following to your job script’s #SBATCH headers:

#SBATCH --account=gfdl_z

Running a Simple Job

Here’s an example of a basic script to run on Gaea. It is a skeleton script for c1:c2 to help users who don’t have access to, or prefer not to use, a workflow manager. This script copies everything in the experiment subdirectory from ltfs to fs, runs the experiment, and then copies the changed and new files from fs to ltfs.

Running the Script

This script assumes that the data and executable are staged to /lustre/ltfs/scratch/$USER/$experiment_subdir. The scripts and data are located at /usw/user_scripts/. Use gcp to get the skeleton script from /usw/user_scripts/c1_c2_skeleton to your local home directory:

gcp /usw/user_scripts/c1_c2_skeleton ~$USER/

Use gcp to get other files from /usw/user_scripts/ to your f1 directory

gcp -r /usw/user_scripts/ /lustre/f1/$USER/c1_c2_skeleton

Open the skeleton script. (The comments in the script will help you understand what each item does.)

vim ~$USER/c1_c2_skeleton

Users MUST modify the paths in the ‘#PBS -d’ line and the walltime in the ‘#PBS -l walltime’ line. (i.e /lustre/f1/First.Last/ for -d, and walltime can be set to 20 min for this tutorial) WARNING do not use environment variables like $USER in setting the directory as it will not be available at run time for the script now go to your home directory and submit your job msub c1_c2_skeleton

Once the job is submitted

Once the job is submitted, you can use these commands to check on your job: To view the status of your job

showq -u $USER

The -c flag will show jobs that have completed with exit codes showq -u $USER -c.

To to check a detailed status of your job replace “jobid” with your job’s id. For example: checkjob gaea.123456789. Additionally you add an option, -v, to get more information. checkjob jobid

Once the job is Finished

Once your job is finished you will have a output file in your directory /lustre/f1/$USER You should have a log file (ex. c1_c2_skeleton_gaea.8279963) You should have a folder with the output files (ex. 6307731.c2-sys0.ncrc.gov/c1_c2_skeleton_gaea.8279963/)