Hera User Guide

../_images/Hera.jpg

About NESCC

The NOAA Environmental Security Computing Center (NESCC), located in Fairmont, West Virginia, is the location of NOAA’s newest High Performance Computing Data Center. This site provides computing resources to support NOAA’s research in Weather and Climate modeling as well as its other environmental research areas.

There are currently two major systems at NESCC:

  • Hera* A 760 Tflop Cray Compute Cluster high performance computing system.

  • HPSS* A 50 Petabyte IBM/Oracle hierarchical storage management system.

In 2025, two systems are scheduled for addition to NESCC:

  • Ursa, based on AMD 9654 with DDN Lustre, should be in place early in 2025.

  • Rhea, based on AMD Turin with DDN Lustre & VAST File Systems, is expected in the second half of the calendar year.

These slides present the schedule and configuration of Ursa and Rhea.

System Overview

  • Capacity of 3,270 trillion floating point operations per second – or 3.27 petaFLOPS

  • The Fine Grain Graphical Processing Units have a total capacity of 2,000 trillion floating point operations per second, or 2.0 petaFLOPS

  • 45 million hours per month with 63,840 cores and a total scratch disk capacity of 18.5 Petabytes.

NESCC is also home to Niagara, a cloud-based computing resource. In addition, Test and Development systems are available through NESCC for system and application testing.

System Configuration

Hera TCA

Hera FGA

Juno TCA

Juno FGA

CPU Type

Intel SkyLake

Intel Haswell

Intel SkyLake

Intel Haswell

CPU Speed (GHz)

2.40

2.46

2.40

2.46

Reg Compute Nodes

1,328

100

14

2

Cores/Node

40

20

40

20

Total Cores

53,120

2,000

560

40

Memory/Core (GB)

96

256

90

256

Peak FLOPS/Node

12

N/A

12

N/A

Service Code Memory (GB)

187

N/A

187

N/A

Total BigMem Nodes

268

N/A

268

N/A

BigMem Node Memory (GB)

384

N/A

384

N/A

CPU FLOPS (TFLOPS)

2,672

83.1

28

1.6

GPUs/Node

N/A

8 x P100

N/A

8 x P100

Total GPUs

N/A

800

N/A

16

GPU FLOPS/GPU

N/A

4.7

N/A

4.7

Interconnect

HDR-100 IB

FDR-10 (40 Gbps)

HDR-100

FDR-10 (40 Gbps)

Total GPU FLOPS (TFLOPS)

N/A

3,760

N/A

75

Note

  • The Skylake 6148 CPU has two AVX-512 units and hence a theoretical peak of 32 double precision floating point operations per cycle with a base clock rate for floating point operations of 1.6 GHz.

  • Total FLOPS is a measure of peak, and doesn’t necessarily represent actual performance.

  • Juno is the Test and Development System. Users must be granted specific access to the system for use.

  • The nodes with GPUs are the same as what was on Theia; But the network has been upgraded to EDR.

Hera Partitions

To specify a partition, use the command partition -p. For example:

sbatch -p batch ...

The following partitions are defined for Hera:

Partition

QOS Allowed

Billable TRes per Core Performance Factor

Description

fge

gpu, gpuwf

158

For jobs that require nodes with GPUs. See the Specifying QOS table below for more details. There are 100 Haswell nodes, each containing 8 P100 GPUs. Each P100 has 16GB of memory.

hera

batch,windfall, debug, urgent

165

General compute resource. Default if no partition is specified

bigmem

batch,windfall, debug, urgent

165

For large memory jobs; 268 nodes, each with 40 cores and 384 GB of memory

novel

novel

165

Partition to run novel or experimental jobs where nearly the full system is required. If you need to run a novel job, please submit a help ticket and tell us what you want to do. We will normally have to arrange for some time for the job to go through, and we would like to plan the process with you. Also, please note that if you use novel partition you also need to specify novel QoS.

service

batch,windfall, debug, urgent

0

Serial jobs (max 1 core), with a 24 hr limit. Jobs will be run on front end nodes that have external network connectivity. Useful for data transfers or access to external resources like databases. If your workflow requires pushing or pulling data to/from the HSMS(HPSS), it should be run there. See the Login (Front End) Node Usage Policy for important information about using Login nodes.

To see a list of available partitions use the command

$ sinfo -O partition
fge
hera*
service
bigmem
novel

An asterisk (*) indicates that default partition, where your job will be submitted to if you do not specify a partition name at job submission.

General compute jobs: To assure the systems are used most efficiently, specify the use of all general compute resource partitions. This allows the batch scheduler to put your jobs on the first available resource.

Lustre File System Usage

Lustre is a parallel, distributed file system often used to support the requirements for high-performance I/O in large scale clusters by supporting a parallel I/O framework that scales to thousands of nodes and petabytes of storage. Lustre features include high-availability and POSIX compliance.

On the RDHPCS Hera cluster there are two Lustre file systems available for use: /scratch1 and /scratch2

The serial transfer rate of a single stream is generally greater than 1 GB/s but can easily increase to 6.5 GB/s from a single client, and more than 10 GB/s if performed in a properly configured parallel operation.

Lustre Volume and File Count

For efficient resource usage, Hera’s /scratch1 and /scratch2 Lustre file systems have project based volume and file count quotas. Each project has an assigned quota which is shared by all users on the project. File count quotas are implemented to preserve the increased performance of the 2-tier storage architecture, where the first 128 KB of each file is stored on SSD and the remainder if any on HDD. Historical data from Jet show that the average file count per GB is ~100. By default, projects on Hera are given a file count quota of 200 files per GB of volume quota or 100,000 files, whichever is higher. Users will receive warning emails when their quota is exceeded. When either the volume or file count quota is exceed by more than 1.2x, writes will not be allowed.

Summary and detailed information on finding your project’s disk volume and file count quota and usage is found :ref: here <Getting_Information_about_your_Projects>.

Volume Quota Increase

If you are approaching your quota, you should first delete old files and/or move files to HPSS tape systems as appropriate. If more volume is still needed, open a Help ticket to request a volume quota increase. Send email to rdhpcs.hera.help@noaa.gov, with the subject line Quota Increase, and a justification, including:

  • Project name.

  • Requested quota. Is the increase request temporary or permanent? If temporary, for how long?

  • Justification, including an analysis of your workload detailing the volume needed

File Count Quota Increase

If you are approaching your quota or your file count quota or are running over 200 files/GB, you should first delete old small files. If you want to keep them around but they are not accessed frequently, you should tar up many small files into one big file. If you have an exceptional situation and believe you need a quota increase, open a Help ticket. Send email to rdhpcs.hera.help@noaa.gov that includes the following information:

  • Project name.

  • Justification, including an analysis of your workload detailing the files/GB needed.

  • Requested quota. Is the increase request temporary or permanent? And if temporary, for how long?

It will save time if the request comes directly from the or Portfolio Manager. Once requests are approved by the PI they will be reviewed by the Hera resource manager.

Lustre

Lustre functionality is divided among four primary components:

  • MDS* Metadata Server

  • MDT* Metadata Target

  • OSS* Object Storage Server

  • OST* Object Storage Target

An MDS server assigns and tracks all of the storage locations associated with each file in order to direct fileI/O requests to the correct set of OSTs and corresponding OSSs.

An MDT stores the metadata, filenames, directories, permissions and file layout.

An OSS manages a small set of OSTs by controlling I/O accessand handling network requests to them.

An OST is a block storage device, often several disks in a RAID configuration.

Hera Lustre Configuration

All nodes (login and compute) access the lustre file-systems mounted at /scratch1 and /scratch2. Each user has access to one or more directories based on theproject which they are a member of, such as:

/scratch[1,2]/${PORTFOLIO}/${PROJECT}/${TASK}

where ${TASK} is often, but not necessarily, the individual user’s login ID, as defined by the project lead.

The number of servers and targets on each of the two Hera file systems is:

  • 2 MDSs (active/active)

  • 2 MDTs

  • 16 OSSs (active/active, embedded in DDN SFA 18k storage controllers)

  • 122 OSTs (106 are HDDs, 16 are SSDs)

  • 9.1 PiB of usable disk space (df*hP /scratch{1,2})

Since each file system has two metadata targets, each project directory is configured to use one of MDTs, and they are spread roughly evenly between the two MDTs. This means that approximately 25% of all Hera projects share metadata resources.

File Operations

When a compute node needs to create or access a file, it requests the associated storage locations from the MDS and the associated MDT. I/O operations then occur directly with the OSSs and OSTs associated with the file, bypassing the MDS. For read operations file data flows from the OSTs to the compute node.

Types of file I/O

With Lustre, an application accesses data in the following ways:

  • Single stream

  • Single stream through a master

  • Parallel

File Striping

A file is split into segments and consecutive segments arestored on different physical storage devices (OSTs).

Aligned vs Unaligned Stripes

Aligned stripes is where each segment fits fully onto a single OST. Processes accessing the file do so at corresponding stripe boundaries. Unaligned stripes means that some file segments are split across OSTs.

Progressive File Layouts

The /scratch1 and /scratch2 file systems are enabled with a feature called Progressive File Layouts (PFL), which is efficient for the vast majority of use cases. It uses a single stripe count for small files (reducing overhead) and increases the striping as the file gets bigger (increasing bandwidth and balancing capacity), all without any user involvement. These file systems are also augmented by a set of SSD OSTs (described above) and with the PFL capability is further optimized for small file performance. By default, smaller files are stored completely in SSD, which further decreases random operation latency and allows the HDDs to run more efficiently for streaming reads and writes. The default configuration will automatically stripe and place files in a generally optimal fashion to improve I/O performance for varying file sizes, including the use of SSDs for better small-file performance. The defaults also attempt to makethe best use of the SSD targets (which are faster, but have much less capacity than HDDs). More details on PFL are available in the Lustre documentation.

Note

The PFL feature makes much of the information documented below regarding customized striping unnecessary.

Users should not need to adjust stripe count and size on /scratch1 and /scratch2. With PFL enabled, setting your own stripe layout may reduce I/O performance for your files and the overall I/O performance of the file system. If you have already used lfs setstripe commands documented below, you should probably remove the striping that may have already been set.

Here are the steps you should follow if you have any directories that had explicitly set non-default striping:

  1. Remove all lfs setstripe commands from your scripts.

  2. Run the following command which changes the stiping back to default for each of the directories on which you may have set striping:

    $ lfs setstripe -d <dir>
    
  3. Open a help ticket with the subject line /scratchX/<portfolio>/<project> striped directories. We will examine the files and assist with migrating files to an optimal layout if necessary.

Userspace Commands

Lustre provides the lfs utility to query and set access to the file system. For a complete list of available options run lfs help. To get more information on a specific lfs option, run lfs help <option>.

Checking Diskspace

Hera file system allocations are project based. Lustre quotas are tracked and limited by Project ID (usually the same as group ID and directory name). The Project ID is assigned to top-level project directories and will be inherited for all new subdirectories. Tracking and enforcement includes maximum file count, not just capacity. To check your usage details:

  1. Look up your project ID number (not the name)

  2. Query your usage and limits using that number, for a given file system.

$ lfs quota -p <project ID number> /scratchX

User and Group usage (capacity and file count) is tracked but not limited. You can also find your usage and your Unix group’s usage:

$ lfs quota -u <User.Name> /scratchX
$ lfs quota -g <groupname> /scratchX

Note

This is the group that owns the data, regardless of where it is stored in the file system directory hierarchy.

For example, to get a summary of the disk usage for project rtnim:

$ id
uid=5088(rtfim) gid=10052(rtfim) groups=10052(rtfim)...
$ lfs quota -p 10052 /scratch1
Disk quotas for prj 10052 (pid 10052):
Filesystem  kbytes   quota   limit   grace   files   quota   limit   grace
/scratch1       4  1048576 1258291      *      1  100000  120000      *
("kbytes" = usage, "quota" = soft quota, "limit" = hard quota)

Finding Files

The lfs find command is more efficient than the standard find, and may be faster too. For example, to find fortran source files accessed within the last day:

$ lfs find . -atime -1 -name '*.f90'

Striping Information

You can view the file striping layout with the command:

$ lfs getstripe <filename>

The Hera default configuration uses Progressive File Layout (PFL).

  • The first part of each file is stored on SSD

  • Up to 256 KB, single stripe

  • As the file grows bigger, it overflows to disks and it stripes it across more disks and more disks

  • Up to 32 MB on HDD, single stripe

  • Up to 1 GB on HDD, 4-way stripe

  • Up to 32 GB on HDD, 8-way stripe

  • > 32 GB on HDD, 32-way stripe, larger object size

So small files reside on SSDs, big files get striped progressively wider. The lfs getstripe command above shows the full layout. Typically not all components are instantiated. Only the extents which have l_ost_idx (object storage target index) and l_fid (file identifier) listed actually have created objects on the OSTs.

Warning

Do not attempt to set striping!! If you think the default is not working for you, submit a help ticket to let us know and assist.

Other lfs Commands

  • lfs cp – to copy files.

  • lfs ls – to list directories and files.

These commands are often quicker as they reduce the number of stat and remote procedure calls needed.

Read Only Access

If a file is only going to be read, open it as O_RDONLY. If you don’t care about the access time, open it as O_RDONLY or O_NOATIME. If you need access time information and you are doing parallel IO, let the master open it as O_RDONLY and all other ranks as O_RDONLY or O_NOATIME.

Avoid Wild Cards

The tar and rm commands are inefficient when operating on a large set of files on Lustre. The reason lies in the time it takes to expand the wildcard. Performing rm -rf * on millions of files could take days,and impact all other users. (And you shouldn’t do just * anyway, it is dangerous. Instead, generate a list of files to be removed ortar-ed, and to act them one at a time, or in small sets.

$ lfs find /path/to/old/dir/ -t f -print0 | xargs -0 -P 8 rm -f

Broadcast Stat Between MPI or OpenMP Tasks

If many processes need the information from stat(), do it once, as follows:

  1. Have the master process perform the stat() call.

  2. Then broadcast it to all processes.

Tuning Stripe Count (not typically needed)

Note

The following steps are not typically needed on the Hera Lustre file systems. See the Progressive File Layouts description above. Please open a help ticket prior to changing stripe parameters on your /scratch1 or /scratch2 files.

General Guidelines

It is beneficial to stripe a file when…

  • Your program reads a single large input file and performs the input operation from many nodes at the same time.

  • Your program reads or writes different parts of the same file at the same time.

    • You should stripe these files to prevent all the nodes from reading from the same OST at the same time. This will avoid creating a bottleneck in which your processes try to read from a single set of disks.

    • Your program waits while a large output file is written.

  • You should stripe this large file so that it can perform the operation in parallel. The write will complete sooner and the amount of time the processors are idle will be reduced.

  • You have a large file that will not be accessed very frequently. You should stripe this file widely (with a larger stripe count), to balance the capacity across more OSTs. * This (in current Lustre version) requires rewriting the file.

It is not always necessary to stripe files.

If your program periodically writes several small files from each processor, you don’t need to stripe the files because they will be randomly distributed across the OSTs.

Striping Best Practices
  • Newly created files and directories inherit the stripe settings of their parent directories.

  • You can take advantage of this feature by organizing your large and small files into separate directories, then setting a stripe count on the large-file directory so that all new files created in the directory will be automatically striped.

  • For example, to create a directory called dir1 with a stripe size of 1 MB and a stripe count of 8, run:

$ mkdir dir1
$ lfs setstripe -c 8 dir1

You can pre-create a file as a zero-length striped file by running lfs setstripe as part of your job script or as part of the I/O routine in your program. You can then write to that file later. For example, to pre-create the file bigdir.tar with a stripe count of 20, and then add data from the large directory bigdir, run:

$ lfs setstripe*c 20 bigdir.tar
$ tar cf bigdir.tar bigdir

Globally efficient I/O, from a system viewpoint, on a Lustre file system is similar to computational load balancing in a leader-worker programming model, from a user application viewpoint. The Lustre file system can be called upon to service many requests across a striped file system asynchronously, and this works best if best practices, outlined above, are followed. A very large file that is only striped across one or two OSTs can degrade the performance of the entire Lustre system by filling up OSTs unnecessarily. By striping a large file over many OSTs, you increase bandwidth to access the file and can benefit from having many processes operating on a single file concurrently. If all large files accessed by all users are striped, I/O performance levels can be enhanced for all users. Small files should never be striped with large stripe counts, if they are striped at all. A good practice is to make sure small files are written to a directory with a stripe count of 1, effectively no striping.

Increase Stripe Count for Large Files

Set the stripe count of the directory to a large value. This spreads the reads/writes across more OSTs, balancing the load and data.

$ lfs setstripe -c 30 /scratchN/your_project_dir/path/large_files/
Use a Small Stripe Count for Small Files

Place small files on a single OST. Small files will then not be spread out across OSTs.

$ lfs setstripe -c 1 /scratchN/your_project_dir/path/small_files/
Parallel IO Stripe Count

Single shared files should have a stripe count equal to, or a factor of, the number of processes which access the file. If the number of processes in your application is greater than 106 (the number of HDD OSTs), use ‘-c 1’ to use all of the OSTs. The stripe size should be set to allow as much stripe alignment as possible. Try to keep each process accessing as few OSTs as possible.

$ lfs setstripe -s 32m -c 24 /scratchN/your_project_dir/path/parallel/

You can specify the stripe count and size programmatically, by creating an MPI info object.

Single Stream IO
  • Set the stripe count to 1 on a directory.

  • Write all files in this directory.

  • Compute

  • Otherwise set the stripe count to 1 for the file.

$ lfs setstripe -s 1m -c 1 /scratchN/your_project_dir/path/serial/

Applications and Libraries

A number of applications are available on Hera. They should be run on a compute node. They are serial tasks, not parallel, and thus, a single core may be sufficient. If your memory demands are large, it may be appropriate to use an entire node even though you are using only a single core.

Using Anaconda Python on Hera

See Installing Miniconda for installation instructions.

Warning

RDHPCS support staff does not have the available resources to support or maintain these packages. You will be responsible for the installation and troubleshooting of the packages you choose to install. Due to architectural and software differences some of the functionality in these packages may not work.

MATLAB

Information is available TBD

Using IDL on Hera

The IDL task can require considerable resources. It should not be run on a frontend node. It is recommended that you run IDL on a compute node either in a job or via interactive job. Take a whole node and there is no need to use the --mem=<memory> parameter. If you request a single task you would get a shared node and in those instances you should consider using --mem=<memory> option (since IDL is memory intensive).

To run IDL on an interactive queue:

$ salloc -x11=first -ntasks=40 -t 60 -A <account>
$ cd <your working directory>
$ module load idl
$ idl      # or idled

IDL can be run from a normal batch job as well.

Multi-Threading in IDL

IDL is a multi-threaded program. By default, the number of threads is set to the number of CPUs present in the underlying hardware. The default number of threads for Hera compute nodes is 48 (the number of virtual CPUs). It should not be run as a serial job with the default thread number, as the threaded program will affect other jobs on the same node.

The number of threads needs to be set to 1 if a job is going to be submitted as a serial job, which can be achieved by setting the environment variable IDL_CPU_TPOOL_NTHREADS to 1, or setting it with the CPU procedure in IDL: CPU, TPOOL_NTHREADS = 1. If a job requires larger than 10 GB memory, you should run the job on either the bigmem node or a whole node.

Using ImageMagick on Hera

The ImageMagick module can be loaded on Hera with the following command:

$ module load imagemagick

The modules set an environment variable and paths in your environment to access the files.

$MAGICK_HOME:

is set to the base directory

$MAGICK_HOME/bin:

is added to your search path

$MAGICK_HOME/man:

is added to your MANPATH

$MAGICK_HOME/lib:

is added to your LD_LIBRARY_PATH

ImageMagick, and the utilities that are part of this package including convert, should be run on a compute node for gang processing of many files, either via a normal batch job or via an interactive job.

Using R on Hera

R is a software environment for statistical computing and graphics. It is available on Hera as a module within the Intel module families. The R module can be loaded on Hera with the following commands:

$ module load intel
$ module load R

R has many contributed packages that can be added to standard R. CRAN, the global repository of open-source packages that extend the capabiltiies of R, has a complete list of R packages as well as the packages for download.

Due to access restrictions from Hera to the CRAN repository, you may need to download an R package to your local workstation first, then copy it to your space on Hera to install the package as detailed below.

To install a package from the command line:

$ R CMD INSTALL <path_to_file>

To install a package from within R

> install.packages("path_to_file", repos = NULL, type="source")

where path_to_file would represent the full path and file name.

When you try to install a package for the first time, you may get a message similar to:

'lib = "/apps/R/3.2.0-intel-mkl/lib64/R/library"' is not writable
Would you like to use a personal library instead?  (y/n)

Reply with y and it will prompt you for a location.

Libraries

A number of libraries are available on Hera. The following command can be used to list all the available libraries and utilities:

module spider

Using Modules

Hera uses the LMOD hierarchical modules system. LMOD is a Lua based module system that makes it easy to place modules in a hierarchical arrangement. So you may not see all the available modules when you type the module avail command.

See Modules

Using MPI

Loading the MPI module

There are two MPI implementations available on Hera: Intel MPI and MVAPICH2. We recommend one of the following two combinations:

  • IntelMPI with the Intel compiler

  • MVAPICH2 with the PGI compiler.

At least one of the MPI modules must be loaded before compiling and running MPI applications. These modules must be loaded before compiliing applications as well in your batch jobs before executing a parallel job.

Working with Intel Compilers and IntelMPI

At least one of the MPI modules must be loaded before compiling and running MPI applications. This is done as follows:

$ module load intel impi
Compiling and Linking MPI applications with IntelMPI

For the primary MPI library, IntelMPI, the easiest way to compile applications is to use the appropriate wrappers: mpiifort, mpiicc, and mpiicpc.

$ mpiifort -o hellof hellof.f90
$ mpiicc -o helloc helloc.c
$ mpiicp -o hellocpp hellocpp.cpp

Note

Please note the extra “i” in mpiifort. mpiicc, and mpiicp commands.

Launching MPI applications with IntelMPI

For instructions on how to run MPI applications please refer to Running and Monitoring Jobs.

Launching an MPMD application with intel-mpi-library-documentation

For instructions on how to run MPI applications please refer to Running and Monitoring Jobs.

Launching OpenMP/MPI hybrid jobs with IntelMPI

For instructions on how to run MPI applications please refer to Running and Monitoring Jobs.

Note about MPI-IO and Intel MPI

Intel MPI doesn’t detect the underlying file system by default when using MPI-IO. You have to pass the following variables on to your application:

export I_MPI_EXTRA_FILESYSTEM=on
export I_MPI_EXTRA_FILESYSTEM_LIST=lustre
Additional documentation on Intel MPI

The Intel documentation library has extensive documentation, the following are a list of specific documents that may be useful.

Using PGI and mvapich2

At least one of the MPI modules must be loaded before compiling and running MPI applications. This is done with as follows:

module load pgi mvapich2

Compiling and Linking MPI applications with PGI and MVAPICH2

When compiling with the PGI compilers, please use the wrappers: mpif90, mpif77, mpicc, and mpicpp.

$ mpif90 -o hellof hellof.f90
$ mpicc -o helloc helloc.c
$ mpicpp -o hellocpp hellocpp.cpp

Launching MPI applications with MVAPICH2

For instructions on how to run MPI applications please refer to Running and Monitoring Jobs.

Launching OpenMP/MPI hybrid jobs with MVAPICH2 (TBD)

For instructions on how to run MPI applications please refer to Running and Monitoring Jobs.

Additional documentation on using MVAPICH2

See the MVAPICH User Guide.

Tuning MPI (TBD)

Several options can be used to improve the performance of MPI jobs.

Profiling an MPI application with Intel MPI

Add the following variables to get profiling information from your runs:

export I_MPI_STATS=<num>      # Can choose a value up to 10
export I_MPI_STATS_SCOPE=col  # Statistics for collectives only
setenv I_MPI_STATS <num>      # Can choose a value up to 10
setenv I_MPI_STATS_SCOPE col  # Statistics for collectives only

The Intel runtime library has the ability to bind OpenMP threads to physical processing units. The interface is controlled using the KMP_AFFINITY environment variable. Thread affinity can have a dramatic effect on the application speed. It is recommended to set KMP_AFFINITY=scatter to achieve optimal performance for most OpenMP applications. For details, review the information in the Intel documentation library.

Intel Trace Analyzer

Intel Trace Analyzer (formerly known as Vampir Trace) can be used for analyzing and troubleshooting MPI programs. Please refer to the documentation. Even though we have modules created for “itac” for this utility, it may better to follow the instructions from the link above as the instructions for more recent versions may be different than when we created the module.

Debugging Codes

Debugging Intel MPI Applications

When troubleshooting MPI applications using Intel MPI, it may be helpful if the debug versions of the Intel MPI library are used. To do this, use one of the following:

$ mpiifort -O0 -g -traceback -check all -fpe0 -link_mpi=dbg ...             # if you are running non-multithreaded application
$ mpiifort -O0 -g -traceback -check all -fpe0 -link_mpi=dbg_mt -openmp ...  # if you are running multi-threaded application

Using the -link_mpi=dbg makes the wrappers use the debug versions of the MPI library, which may be helpful in getting additional traceback information.

In addition to compiling with the options mentioned above, you may be able to get some additional trace back information and core files if you change the core file size to be unlimited (the default value for core file is zero; hence call filed generation is disabled). In order to enable it you need to have the following in your shell initialization files in your home directory (the file name and the syntax depends on your login shell):

ulimit -c unlimited
limit coredumpsize unlimited

Application Debuggers

A GUI based debugger named DDT by Linary is available on Hera. Linaro has detailed documentation.

Note

Since DDT is GUI debugger, interactions over a wide area network can be extremely slow. You may want to consider using a Remote Desktop which in our environment is X2GO.

Invoking DDT on Hera with Intel IMPI

Getting access to the compute resources for interactive use

For debugging you will need interactive access to the desired set of compute nodes using salloc with the desired set of resources:

$ salloc --x11=first -N 2 --ntasks=4 -A <project> -t 300 -q batch

At this point you are on a compute node.

Load the desired modules

$ module load intel impi forge

The following is a temporary workaround that is currently needed until it is fixed by the vendor.

$ export ALLINEA_DEBUG_SRUN_ARGS "%jobid% --gres=none --mem-per-cpu=0 -I -W0 --cpu-bind=none"
$ setenv ALLINEA_DEBUG_SRUN_ARGS "%jobid% --gres=none --mem-per-cpu=0 -I -W0 --cpu-bind=none"

Launch the application with the debugger

% ddt srun -n 4 ./hello_mpi_c-intel-impi-debug

This will open GUI in which you can do your debugging. Please note that by default it seems to save your current state (breakpoints, etc. are saved for your next debugging session).

Using DDT

Some things should be intuitive, but we recommend you look through the vendor documentation links shown above if you have questions.

Profiling Codes

Linaro Forge

Linaro Forge allows easy profiling of applications. Very brief instructions are included below.

  • Compile with the debug flag

  • Do not move your source files; the path is hardwired and will not found if relocated

  • Load the forge module with module load forge

  • Run by prefixing with map --profile before the launch command

#SBATCH ...
#SBATCH ...

module load intel impi forge

map --profile mpirun -np 8 ./myexe

Then submit the job as you normally do. Once the job has completed, you should file *.map files in your directory.

You have to view those files using the allinea map utility:

module load forge         # If not already loaded
map <map_file>.map

The above command will bring up a graphical viewer to view your profile

Perf-report is another tool that provides the profiling capability.

perf-report srun ./a.out

TAU

The TAU Performance System® is a portable profiling and tracing toolkit for performance analysis of parallel programs written in Fortran, C, C++, Java, and Python. It supports application use of MPI and/or OpenMP, and also supports GPU. Portions of the TAU toolkit are used to instrument code at compile time. Environment variables control a number of things at runtime. A number of controls exist, permitting users to:

  • specify which routines to instrument or to exclude

  • specify loop level instrumentation

  • instrument MPI and/or OpenMP usage

  • throttle controls to limit overhead impact of small, high frequency called routines

  • generate event traces

  • perform memory usage monitoring

The toolkit includes the Paraprof visualizer (a Java app) permitting use on most desk and laptop systems (Linux, MacOS, Windows) to view instumentation data. The 3D display can be very useful. Paraprof supports the creation of user defined metrics based on the metrics directly collected (ex: FLOPS/CYCLE).

The event traces can be displayed with the Vampir, Paraver, or JumpShot tools.

Quick-start Guide for TAU

The Quick-start Guide for TAU only addresses basic usage. Please keep in mind that this is an evolving document!

Find the Quick Start TBD

Tutorial slides for TAU

A set of slides presenting a recipe approach to beginning with Tau is available TBD

MPI and OpenMP support

TAU build supports profiling of both MPI and OpenMP applications.

The Quick-start Guide mentions using Makefile.tau-icpc-papi-mpi-pdt. This supports profiling of MPI applications. You must use Makefile.tau-icpc-papi-mpi-pdt-openmp-opari for OpenMP profiling. Makefile.tau-icpc-papi-mpi-pdt-openmp-opari can be used for either MPI or OpenMP or both.

Managing Contrib Projects

A /contrib package is one that is maintained by a user on the system. The system staff are not responsible for the use or maintenance of these packages. See Contrib for details.

Fine Grain Architecture (FGA) System

The Fine Grain Architecture (FGA) system has been installed as an addition to Hera to facilitate experimentation with emerging architectures. In addition to the traditional processors, each compute node on the FGA system has multiple GPUs on each node.

The part of the system that doesn’t include the GPUs has been generally referred to as the Traditional Computing Architecture (TCA) and these two abbreviation TCA and FGA will be used in this document to refer to these two systems.

System Information

  • The FGA system consists of a total of 100 nodes (named tg001 through tg100)

  • Each node has two 10 core Haswell processors (20 cores per node, referred to as Socket0 and Socket1)

  • Each node has 256 GB of memory

  • Each node has 8 Tesla P100 (Pascal) GPUs.

  • GPUs 0-3 are connected to Socket0, and

  • GPUs 4-7 are connected to Socket1

  • The interconnect fabric is a fat tree network, made up of 1 Mellanox Connect-X 3 IB card connected to Socket1

  • The FGA system has access to all the same file systems that TCA has

Please note that the network fabric on the FGA system has the Mellanox IB cards which are different from the regular Hera (or TCA) which has Intel TrueScale IB cards; this distinction becomes important because the kernel running on these FGA nodes is different from the TCA.

Just as an example about how this may impact users, depending on the application it may be necessary to compile your application on a FGA compute node by getting access to an interactive compute node in the “fge” queue.

Getting an allocation for FGA resources

All projects with an allocation on Hera have windfall access to FGA resources. All FGA projects (RDARCH portfolio) have windfall access to Hera TCA resources. We are soliciting project requests for compute allocations on the FGA system.

Users interested in an allocation on the fine-grain augmentation may request an FGA allocation by sending a couple of paragraphs (through their PIs if they are not a PI) to the help system.

The paragraphs should contain the following information:

  • The number of node-hours requested.

  • Disk space (in terabytes) requested.

  • A brief description of the project in terms of science objectives and computer science objectives.

  • Planned way to exploit (or learning to exploit) the GPUs.

Note that there are approximately 64,000 node-hours (1,270,000 core-hours) available. Since the intent is to use an entire node (including the GPUs) only full nodes will be available for allocation (although the bookkeeping will be done in core-hours).

Using FGA resources without an allocation

Users that do not have allocations on the FGA system will get access to the FGA system at windfall priority. Which means users will be able to submit jobs to the system, but they will only run when the resources are not being used by projects that do have an FGA allocation. This is helpful for users who are in interested in exploring the GPU resources for their applications. To use the system in this mode please submit the jobs to the fgewf partition and windfall QoS by including the following:

sbatch -p fgewf -q windfall ...

User Environment

Since the FGA is part of Hera, there are no separate login nodes for using the FGA. When you log in to Hera you will be connected to one of the front end nodes on Hera.

There are however some additional software packages and their associated modules that are useful only on the FGA. A couple of examples of this are cuda and mvapich2-gdr libraries.

Compiling and Running Codes on the FGA

Please keep in mind that the software stacks on the FGA machines are slightly different from regular Hera TCA nodes (including the FE nodes) as mentioned above. This is because the TCA and FGA nodes have different network cards, which necessitates that we have different images for these two systems.

Note

We recommend that compilation be done for FGA applications only on a compute node after obtaining a shell on one of the FGA compute nodes by submitting an interactive batch job to the fge or the fgews QoS.

Compiling and Running Codes Using CUDA

Compilation for non-MPI applications may be done either on the front-ends or on compute nodes. But generally we recommend compiling on an FGA compute node.

The following module will have to be loaded before compiling and executing cuda programs:

$ module load cuda

Generally you should use the latest cuda available

Note

We have limited experience with cuda.

The following flags were seen in sample codes for compiling codes for the Pascal GPUs

$ nvcc -gencode arch=compute_60,code=sm_60 mycode.cu

Compiling and Running Codes Using Intel MPI

If you’re using Intel MPI (with or without cuda; see the note above if you’re using cuda), compilation may be done on the front-ends or on the compute nodes in an interactive-batch job. We would still recommend compiling on an FGA compute node by submitting an interactive batch job to the “fge” queue.

Please load the following modules before compilation and also load these modules in the batch job before execution:

$ module load intel impi
$ mpiicc -o mycexe mycode.c
$ mpiifort -o myfxex mycode.f90

Note

Specific versions are listed only as examples; you can load any of the available versions

In addition, the following environment variables will have to be set in the job file before execution (using the syntax appropriate for the shell you are using):

$ module load intel impi
$ export I_MPI_FABRICS shm:ofa
$ srun ./myexe
$ module load intel impi
$ setenv I_MPI_FABRICS shm:ofa
$ srun ./myexe

This is necessary because the FGA nodes have Mellanox IB cards as opposed to the Intel IB cards as in the regular Hera nodes. Because of this difference in hardware, the software is also different on the FGA nodes. The FGA nodes do not support the TMI fabric setting which is the default on the regular Hera nodes.

Compiling and Building Codes Using mvapich2-gdr Library

The MVAPICH2-GDR (GDR stands for GPU Direct RDMA) from Ohio State University is available for experimentation and testing on the FGA nodes.

Note

We recommend that compilation be done for FGA applications only on a compute node after obtaining a shell on one of the FGA compute nodes by submitting an interactive batch job to the fge or the fgedebug queue.

Since the wait times for the fge queue are fairly short it should be fine to use just the regular “fge” queue. You need to load the following modules:

$ module load intel cuda mvapich2-gdr    # Please consider using the latest versions of these
$ mpif90 -o myfort.exe myfortcode.f90 -L$CUDALIBDIR -lcuda -lcudart
$ mpicc -o myc.exe    myccode.c

In addition to loading these modules, at execution time you need to set the following environment variables in your job file:

$ module load intel cuda mvapich2-gdr
$ env LD_PRELOAD=$MPIROOT/lib64/libmpi.so
$ mpirun -np $PBS_NP ./myexe

Note

By default the MVAPICH2-GDR lib will use GDRCOPY If for some reason you don’t want to use it, set the the environment variable MV2_USE_GPUDIRECT_GDRCOPY=0.

Compiling and Building Codes Using OpenMPI

The OpenMPI implimentation of MPI is available for experimentation and testing on the FGA nodes. The current installed version is the one that came with the PGI compiler, so PGI examples are shown below.

Load the following modules:

$ module load pgi cuda openmpi     # Please consider loading the latest versions of these
$ mpif90 -o myfort.exe myfortcode.f90 -L$CUDALIBDIR -lcuda -lcudart
$ mpicc  -o myc.exe myccode.c

In addition to loading these modules, at execution time you need to set the following environment variables in your job file:

$ module load pgi cuda openmpi # Please consider loading the latest versions of these
$ mpirun -np $PBS_NP -hostfile $PBS_NODEFILE ./myexe

The following link has additional information on using OpenMPI, particularly for CUDA enabled applications

Compiling codes with OpenACC directives on Hera

OpenACC directive based programming is available with the PGI compilers. It is best to load the most recent PGI compiler available for this. The example below shows how to compile a serial program that has OpenACC directives:

$ module load pgi cuda        # Please consider loading the latest versions of these
$ pgf90 -acc -ta=nvidia,cc60,nofma -Minfo=accel -Msafeptr myprog.f90

Compiling MPI codes with OpenACC directives on Hera

We have limited experience of using these new technologies, so the best we can do with this point is point you to NVIDIA’s web resources. NVIDIA has an advanced OpenACC course that may be useful.

Submitting Batch Jobs to the FGA System

Users who have FGE specific allocation can submit jobs to the fge partition. Other users can submit jobs to the fgewf partition and will run with windfall priority.

One thing to keep in mind is that unlike the TCA, the FGA nodes have a maximum of 20 cores per node (Hera TCA has 24 cores per node).

Hints on Rank Placement/Performance Tuning

Note

This section is included below just as a suggestion and is being updated as we learn more. The following information seems to be applicable only to Intel MPI.

Please keep in mind that there are 4 GPUs connected to the first socket and 4 GPUs connected to the second socket. For best performance it will be necessary to pin the MPI processes such that they’re not moving from core to core on the node during the run.

First a simple script for pinning in a straightforward way is shown below, followed by modified examples that were used in the benchmarking run:

#!/bin/bash
#set*x
#
# Assumptions for this script:
#    1) The arguments are: exe and args to the executable
#    2) Local rank 0 is using GPU0, etc.
#    3) If "offset" environment variable is set, that is added to
#   to lrank.  Generally avoid core 0;
#      * Use an offset of  1 to place on first  socket
#      * Use an offset of 11 to place on second socket
#   Note:
#First 4 GPUs connected to the first socket (cores 0-9)
#Last  4 GPUs connected to the second socket (cores 10-19)

let lrank=$PMI_RANK%$PBS_NUM_PPN
let offset=${offset:-0} # set offset to 10 to place on second socket

let pos=( $lrank + $offset)

numactl*a*l*-physcpubind=$pos $*

The job can be launched by using:

mpirun -np ${nranks} ./place.sh $exe

From the experience from the Cray benchmarking team, a couple of examples that achieve the desired pinning are shown below. In the first example, there are 4 MPI ranks on each node, the goal is to pin the 4 ranks to the first socket and specific cores; Also in this example each rank used 2 threads, and hence 2 cores are specified for each rank:

#!/bin/bash
#location of HPL
HPL_DIR=`pwd`
# set*x
# Number of CPU cores
CPU_CORES_PER_RANK=1

export I_MPI_FABRICS=shm:OFA
export I_MPI_PIN=disable
export OMP_NUM_THREADS=$CPU_CORES_PER_RANK
export MKL_NUM_THREADS=$CPU_CORES_PER_RANK

#export CUDA_DEVICE_MAX_CONNECTIONS=12
export CUDA_DEVICE_MAX_CONNECTIONS=12
export CUDA_COPY_SPLIT_THRESHOLD_MB=1

#APP=./xhpl.intel
APP=$exe

#lrank=$OMPI_COMM_WORLD_LOCAL_RANK
let lrank=$PMI_RANK%4

case ${lrank} in
[0])
  export DEV_ID=0
  numactl*a*l*-physcpubind=2,6 $APP $*
  ;;
[1])
  export DEV_ID=1
  numactl*a*l*-physcpubind=3,7 $APP $*
  ;;
[2])
  export DEV_ID=2
  numactl*a*l*-physcpubind=4,8 $APP $*
  ;;
[3])
  export DEV_ID=3
  numactl*a*l*-physcpubind=5,9 $APP $*
  ;;
esac

This script is used in the mpirun command. In the example above, the name of the executable is passed in the environment variable “exe”.

As a second example a similar script for pinning to the specific cores on the second socket is shown below:

#!/bin/bash
#location of HPL
HPL_DIR=`pwd`
# set*x
# Number of CPU cores
CPU_CORES_PER_RANK=1

export I_MPI_FABRICS=shm:OFA
export I_MPI_PIN=disable
export OMP_NUM_THREADS=$CPU_CORES_PER_RANK
export MKL_NUM_THREADS=$CPU_CORES_PER_RANK

#export CUDA_DEVICE_MAX_CONNECTIONS=12
export CUDA_DEVICE_MAX_CONNECTIONS=12
export CUDA_COPY_SPLIT_THRESHOLD_MB=1

#APP=./xhpl.intel
APP=$exe

#lrank=$OMPI_COMM_WORLD_LOCAL_RANK
let lrank=$PMI_RANK%4

case ${lrank} in
[0])
  export DEV_ID=4
  numactl*a*l*-physcpubind=12,16 $APP $*
  ;;
[1])
  export DEV_ID=5
  numactl*a*l*-physcpubind=13,17 $APP $*
  ;;
[2])
  export DEV_ID=6
  numactl*a*l*-physcpubind=14,18 $APP $*
  ;;
[3])
  export DEV_ID=7
  numactl*a*l*-physcpubind=15,19 $APP $*
  ;;
esac

Rank placement when using mvapich2

For MVAPICH2 the following seems to work to place all the ranks on the second socket. In this example, I’m using two nodes, and trying to run eight tasks, and place them only| on the second socket on each node:

$ setenv MV2_USE_GPUDIRECT_GDRCOPY 1
$ setenv MV2_ENABLE_AFFINITY 1
$ mpirun -np 8 -env MV2_CPU_MAPPING=16:17:18:19 ./$exe | sort -k 4
Hello from rank 00 out of 8; procname = tg001, cpuid = 16
Hello from rank 01 out of 8; procname = tg001, cpuid = 17
Hello from rank 02 out of 8; procname = tg001, cpuid = 18
Hello from rank 03 out of 8; procname = tg001, cpuid = 19
Hello from rank 04 out of 8; procname = tg002, cpuid = 16
Hello from rank 05 out of 8; procname = tg002, cpuid = 17
Hello from rank 06 out of 8; procname = tg002, cpuid = 18
Hello from rank 07 out of 8; procname = tg002, cpuid = 19

Note that the two environment variables shown above are currently not set by default. But this is subject to change and the module may be modified in the future to set it by default.

For more details, see the MVAPICH2 user guide.

Using Nvidia Multi-Process Service

What is MPS

Multi-Process Service (MPS) allows multiple tasks on a node to share a GPU.

On Hera for example, we have 20 cores on a node and only 8 GPU. Under normal circumstances, one could use just 8 MPI tasks on each node, and have each of those tasks to exclusively use 1 GPU.

Sometimes there may not be enough work from one task to keep the GPU busy, in which case it may be beneficial to share the GPU and have more MPI tasks on each node.

The performance benefits of taking this approach are very much application dependent.

How do I use MPS?

In the example below, we describe the simplest use case. (We will update the documentation as we gather more experience.) For the simplest case, we will consider running an MPI application on just one node after getting access to a FGA compute node by submitting an interactive batch job to the fge queue.

Assuming you have obtained an interactive compute node as mentioned above:

  • Load the necessary modules. The MPS services available after the cuda module is loaded:

    $ module load intel/16.1.150 cuda/8.0 mvapich2-gdr/2.2-3-cuda-8.0-intel
    
  • Start the MPS daemon:

    $ setenv CUDA_MPS_LOG_DIRECTORY /tmp/nvidia-log
    $ setenv CUDA_MPS_PIPE_DIRECTORY /tmp/nvidia-pipe
    $ nvidia-cuda-mps-control* -d
    
  • Confirm that MPS daemon is running

    $ ps -elf | grep nvidia-cuda-mps-control | grep -v grep
    1 S User.Id  47724      1  0  80   0*  2666 poll_s 16:56 ?        00:00:00 nvidia-cuda-mps-control -d
    
  • Run some of the MPS commands.

    Please keep in mind that MPS does not have a command prompt, so typically you run the MPS commands as shown below:

    $ echo get_server_list | nvidia-cuda-mps-control
    Server 0 not found
    

    Then, run your application as you normally would. At the end of your session, terminate the deamon by running the command:

    $ echo quit | nvidia-cuda-mps-control
    

Documentation for MPS

For additional details see the Overview

Compiling and Building Codes With The Cray Programming Environment

A custom built version of mvapich2 must be used when compiling and running with the Cray Programming Environment (CrayPE). To run an MPI program using the CrayPE, you must first set up the proper environment. This has been rolled into a single module load command that brings in all required modules:

Note

Because of a compatibility issue between regular Modules and Lmod (which Hera uses), the CrayPE modules don’t work with tcsh. Hence all of these examples are shown with bash.

$ bash -l
$ module purge
$ module load craype-hera
$ module list

Currently Loaded Modules:
  1) craype-haswell   7) cray-libsci/17.11.1
  2) craype-network-infiniband         8) PrgEnv-cray/1.0.2
  3) craype/2.5.13         9) cray-libsci_acc/17.03.1
  4) cce/8.6.410) craype-accel-nvidia60
  5) cudatoolkit/8.0.44   11) perftools-base/6.5.2
  6) mvapich2_cce/2.2rc1.0.3_noslurm  12) craype-hera/8.6.4

Then compile the program. The compiler drivers are

cc:

c code

ftn:

fortran

CC:

c++ code

Note

Do not use the “mpi” drivers associated with the mvapich2 library.

Note

The sample programs and scripts used in the examples below can be found in the directory on Hera: /apps/local/examples/craype/XTHI_SIMPLE

$ cc -homp -o xthi xthi.c  # (-homp is default, so not explicitly needed)

To run the executable, secure the appropriate compute node(s) and set the environment:

$ module load craype-hera
$ export LD_LIBRARY_PATH=${CRAY_LD_LIBRARY_PATH}:${LD_LIBRARY_PATH}
$ cc -homp -o xthi xthi.c
$ mpirun -env OMP_NUM_THREADS 1 -n 4 -machinefile $PBS_NODEFILE ./xthi
Warning: Process to core binding is enabled and OMP_NUM_THREADS is set to non-zero (1) value
If your program has OpenMP sections, this can cause over-subscription of cores and consequently poor performance
To avoid this, please re-run your application after setting MV2_ENABLE_AFFINITY=0
Use MV2_USE_THREAD_WARNING=0 to suppress this message
Hello from rank 0, thread 0, on sg001. (core affinity = 20)
Hello from rank 1, thread 0, on sg001. (core affinity = 21)
Hello from rank 2, thread 0, on sg002. (core affinity = 20)
Hello from rank 3, thread 0, on sg002. (core affinity = 21)

All MPI ranks are running on unique cores in the fge queue. Alternatively, if you want to place ranks on specific cores, you can use the MV2_CPU_MAPPING environment variable:

$ mpirun -env OMP_NUM_THREADS 1 -env MV2_CPU_MAPPING=0:10 -n 2 -machinefile $PBS_NODEFILE ./xthi
Warning: Process to core binding is enabled and OMP_NUM_THREADS is set to non-zero (1) value
If your program has OpenMP sections, this can cause over-subscription of cores and consequently poor performance
To avoid this, please re-run your application after setting MV2_ENABLE_AFFINITY=0
Use MV2_USE_THREAD_WARNING=0 to suppress this message
Hello from rank 1, thread 0, on sg001. (core affinity = 10)
Hello from rank 0, thread 0, on sg001. (core affinity = 0)

Here, each rank is running on its own socket. If this strategy is used with OpenMP threaded codes, all threads will be placed on the same core as the master thread, leading to contention and reduced performance.

$ mpirun -env OMP_NUM_THREADS 4 -n 1 -machinefile $PBS_NODEFILE ./xthi
Warning: Process to core binding is enabled and OMP_NUM_THREADS is set to non-zero (4) value
If your program has OpenMP sections, this can cause over-subscription of cores and consequently poor performance
To avoid this, please re-run your application after setting MV2_ENABLE_AFFINITY=0
Use MV2_USE_THREAD_WARNING=0 to suppress this message
WARNING: Requested total thread count and/or thread affinity may result in
oversubscription of available CPU resources!  Performance may be degraded.
Set OMP_WAIT_POLICY=PASSIVE to reduce resource consumption of idle threads.
Set CRAY_OMP_CHECK_AFFINITY=TRUE to print detailed thread-affinity messages.
Hello from rank 0, thread 2, on sg001. (core affinity = 0)
Hello from rank 0, thread 0, on sg001. (core affinity = 0)
Hello from rank 0, thread 3, on sg001. (core affinity = 0)
Hello from rank 0, thread 1, on sg001. (core affinity = 0)

Each thread is placed on core.0 with the master thread. To avoid contention, the application must be launched with numactl as in this script (r4.sh in the example below):

#!/bin/bash
HPL_DIR=`pwd`
CPU_CORES_PER_RANK=4
export OMP_NUM_THREADS=$CPU_CORES_PER_RANK
export MV2_ENABLE_AFFINITY=0
export OMP_WAIT_POLICY=PASSIVE
APP=./xthi #-craype-silene #./xthi_test
let lrank=$PMI_RANK%8
echo "PMI_RANK: $PMI_RANK"
echo "lrank:    $lrank"
export I_MPI_EAGER_THRESHOLD=524288
export OMP_WAIT_POLICY=active
export OMP_SCHEDULE=dynamic,1
export RANKS_PER_SOCKET=1
export CUDA_COPY_SPLIT_THRESHOLD_MB=1
export ICHUNK_SIZE=768
export CHUNK_SIZE=2688
export TRSM_CUTOFF=9990000
export TEST_SYSTEM_PARAMS=1
case ${lrank} in
[0])
#  export CUDA_VISIBLE_DEVICES=0
#  numactl*a*l*-physcpubind=2,6 $APP
  numactl*a*l*-physcpubind=0,1,2,3 $APP
  ;;
[1])
#  export CUDA_VISIBLE_DEVICES=1
#  numactl*a*l*-physcpubind=3,7 $APP
  numactl*a*l*-physcpubind=10,11,12,13 $APP
  ;;
[2])
#  export CUDA_VISIBLE_DEVICES=2
#  numactl*a*l*-physcpubind=4,8 $APP
  numactl*a*l*-physcpubind=2 $APP
  ;;
[3])
#  export CUDA_VISIBLE_DEVICES=3
#  numactl*a*l*-physcpubind=5,9 $APP
  numactl*a*l*-physcpubind=3 $APP
  ;;
[4])
#  export CUDA_VISIBLE_DEVICES=4
#  numactl*a*l*-physcpubind=12,16 $APP
  numactl*a*l*-physcpubind=4 $APP
  ;;
[5])
#  export CUDA_VISIBLE_DEVICES=5
#  numactl*a*l*-physcpubind=13,17 $APP
  numactl*a*l*-physcpubind=5 $APP
  ;;
[6])
#  export CUDA_VISIBLE_DEVICES=6
#  numactl*a*l*-physcpubind=14,18 $APP
  numactl*a*l*-physcpubind=6 $APP
  ;;
[7])
#  export CUDA_VISIBLE_DEVICES=7
#  numactl*a*l*-physcpubind=15,19 $APP
  numactl*a*l*-physcpubind=7 $APP
  ;;
esac

In this case, we have a single node with two MPI ranks running, each spawning 4 OpenMP threads. The threads are placed such that each set is running on its own socket.

$ mpirun -env OMP_NUM_THREADS 4 -n 2 -machinefile $PBS_NODEFILE ./r4.sh
PMI_RANK: 1
lrank:    1
PMI_RANK: 0
lrank:    0
Hello from rank 0, thread 0, on sg001. (core affinity = 0-3)
Hello from rank 0, thread 3, on sg001. (core affinity = 0-3)
Hello from rank 0, thread 2, on sg001. (core affinity = 0-3)
Hello from rank 0, thread 1, on sg001. (core affinity = 0-3)
Hello from rank 1, thread 0, on sg001. (core affinity = 10-13)
Hello from rank 1, thread 1, on sg001. (core affinity = 10-13)
Hello from rank 1, thread 2, on sg001. (core affinity = 10-13)
Hello from rank 1, thread 3, on sg001. (core affinity = 10-13)

Using this as a template, it is easy to place ranks and threads in many different ways. This example only uses the lrank=0,1 case branches but the user is encouraged to exeriment with other placement strategies.

Some helpful web resources

Getting Help

As with any Hera issue, open a help request.