Sapphire Rapid Nodes with High Bandwidth Memory

These new nodes entered general service in October 2024.

This set of Sapphire Rapid Nodes have High Bandwidth Memory built in and are optimised for applications requiring large data bandwidth.

These nodes are for the exclusive use of the UKAEA (UK Atomic Energy Agency).

Slurm partitions

  • These nodes are named according to the scheme cpu-r-hbm-[1-84].
  • These nodes are split equally between two Slurm partitions, ukaea-spr-hbm (nodes 1-42) and ukaea-spr-hbm-flat (nodes 43-84), corresponding to two memory configuration modes, HBM-only for the former and Flat(1LM) for the latter. See the Intel configuration manual for more details about these memory setups.
Existing UKAEA-CPU projects will be able to submit jobs to both of these.
  • These nodes have 112 cpus (1 cpu = 1 core), and 9.1 GiB RAM per cpu for a grand total of 1 TiB RAM per node. The nodes on the -flat partition have even more memory: 10.3 GiB RAM per CPU for a grand total of 1.15 TiB RAM.
  • These nodes are interconnected by Mellanox NDR200 Infiniband.
  • These nodes are running Rocky Linux 9, which is a rebuild of Red Hat Enterprise Linux 9 (RHEL9).

Recommendations for running on Sapphire Rapids HBM

Since the cpu-r-hbm nodes are running Rocky 9, you will want to recompile your applications if they are coming from a different operating system or a different OS version.

It is possible to log into a login node with compatible OS/software by logging into login-sapphire.hpc.cam.ac.uk, which will land you on one of the login-s-* nodes.

Alternatively, request an interactive node using sintr:

sintr -t 1:0:0 -N1 -n38 -A YOURPROJECT-CPU -p ukaea-spr-hbm(-flat) -q INTR

to rebuild and re-optimise your applications to get the best from these specialist nodes. The -q INTR option used above requests a higher priority interactive jobs, but does not allow for more than 1h wall time. If you need a longer interactive session, you will need to remove the option -q INTR from your sintr command line and queue for longer.

The per-job wallclock time limits are 36 hours and 12 hours for SL1/2 and SL3 respectively.

The per-job, per-user cpu limits are now 4256 and 448 cpus for SL1/2 and SL3 respectively.

These limits should be regarded as provisional and may be revised.

Default submission script for Sapphire Rapids

There is no default job submission template dedicated to Sapphire Rapid nodes, but you might be able to make one for yourself by tweaking an Icelake template.

You should find a symbolic link to an Icelake default job submission script modified for the icelake nodes in your home directory, called:

slurm_submit.peta4-icelake

This is set up for non-MPI jobs using icelake, but can be modified for other types of job. If you prefer to modify your existing job scripts, please see the following sections for guidance.

Jobs not using MPI and using less than 9200 MiB per cpu

In this case you should be able to simply specify the sapphire partition to the -p sbatch directive, e.g.:

#SBATCH -p ukaea-spr-hbm

will submit a job able to run on the first nodes available in the ukaea-spr-hbm partition.

Jobs not using MPI and requiring more than 9200 MiB per cpu

In the case of larger memory requirements, it is most efficient to submit instead to the ukaea-spr-hbm-flat partition which will allocate 10300 MiB per cpu:

#SBATCH -p ukaea-spr-hbm-flat

If this amount of memory per cpu is insufficient you will need to specify either the –mem= or –cpus-per-task= directive, in addition to the -p directive, in order to make sure you have enough memory at run time. Note that in this case, Slurm will satisfy the memory requirement by allocating (and charging for) more cpus if necessary. E.g.:

#SBATCH -p ukaea-spr-hbm-flat
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --mem=15000

In the above example we are requesting 15000 MiB of memory per node, but only one task. Slurm will usually allocate one cpu per task, but here, because it enforces 10300 MiB per cpu for the partition ukaea-spr-hbm-flat, it will allocate 2 cpus to the single task in order to satisfy the job requirements of 15000 MiB. Note that this increases the number of cpu core hours consumed by the job and hence the charge. Also note that since each cpu receives 10300 MiB by default anyway, the user would lose nothing by requesting 20000 MiB instead of 15000 MiB here.

At this point, the memory setup becomes relevant. While the HBM-only mode in the ukaea-spr-hbm partition uses only HBM memory and does not require specific optimisation, the “Flat(1LM)” mode uses DDR memory to supplement the HBM memory (hence the higher memory available per node). This heterogeneous memory layout requires explicit memory mapping when running an MPI application to achieve the best memory rates and use the nodes optimally.

The Intel configuration manual suggests using the linux command numactl to perform the mapping, e.g:

numactl --preferred <hbm_node> <application>

where <hbm_node> can be obtained through the command numactl -H

Jobs requiring MPI

We currently recommend using Intel MPI 2021.12.1 on Sapphire Rapid nodes. There are other, related changes to the default environment seen by jobs running on the sapphire nodes. If you wish to recompile or test against the Sapphire Rapid HBM environment, the best option is to request an interactive node and work on a relevant partition node directly.

For reference, the default environment on the sapphire HBM nodes is provided by loading a module as follows:

module purge
module load rhel9/default-sar

If you are running your MPI job on the ukaea-spr-hbm-flat partition, then the mapping described above in the previous section becomes:

mpirun  -genv I_MPI_HBW_POLICY <mode> <other mpirun options> <application>

where <mode> is one of: hbw_bind, hbw_preferred, hbw_interleave. See the Intel configuration manual or the Lenovo press page on HBM implementation for more details.