Sapphire Rapid Nodes with High Bandwidth Memory
===============================================

*These new nodes entered general service in October 2024.*

This set of Sapphire Rapid Nodes have `High Bandwidth Memory`_ built in and are optimised for applications requiring large data bandwidth.

**These nodes are for the exclusive use of the UKAEA (UK Atomic Energy Agency).**

.. _`High Bandwidth Memory`: https://en.wikipedia.org/wiki/High_Bandwidth_Memory

Slurm partitions
----------------------------
* These nodes are named according to the scheme *cpu-r-hbm-[1-84]*.

* These nodes are split equally between two Slurm partitions, **ukaea-spr-hbm** (nodes 1-42) and **ukaea-spr-hbm-flat** (nodes 43-84), corresponding to two memory configuration modes, *HBM-only* for the former and  *Flat(1LM)* for the latter. See the `Intel configuration manual`_ for more details about these memory setups.

.. _`Intel configuration manual`: https://cdrdv2-public.intel.com/787743/354227-intel-xeon-cpu-max-series-configuration-and-tuning-guide-rev3.pdf

  Existing UKAEA-CPU projects will be able to submit jobs to both of these. 

* These nodes have **112 cpus** (1 cpu = 1 core), and  **9.1 GiB RAM per cpu** for a grand total of 1 TiB RAM per node. The nodes on the **-flat** partition have even more memory: **10.3 GiB RAM per CPU** for a grand total of 1.15 TiB RAM.

* These nodes are interconnected by Mellanox NDR200 Infiniband. 

* These nodes are running `Rocky Linux 9`_, which is a rebuild of Red Hat Enterprise Linux 9 (RHEL9). 

.. _`Rocky Linux 9`: https://rockylinux.org/


Recommendations for running on Sapphire Rapids HBM
--------------------------------------------------
Since the cpu-r-hbm nodes are running Rocky 9, you will want to recompile
your applications if they are coming from a different operating system or a different OS version.

It is possible to log into a login node with compatible OS/software by logging into *login-sapphire.hpc.cam.ac.uk*,
which will land you on one of the *login-s-** nodes.
 
Alternatively, request an interactive node using *sintr*::

    sintr -t 1:0:0 -N1 -n38 -A YOURPROJECT-CPU -p ukaea-spr-hbm(-flat) -q INTR

to rebuild and re-optimise your applications to get the best from these specialist nodes. The ``-q INTR`` option used above requests a higher priority interactive jobs, but does not allow for more than 1h wall time. If you need a longer interactive session, you will need to remove the option ``-q INTR`` from your ``sintr`` command line and queue for longer.

The per-job wallclock time limits are 36 hours and 12 hours for SL1/2 and SL3
respectively.

The per-job, per-user cpu limits are now 4256 and 448 cpus for SL1/2
and SL3 respectively.

These limits should be regarded as provisional and may be revised.


Default submission script for Sapphire Rapids
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
There is no default job submission template dedicated to Sapphire Rapid nodes, but you might be able to make one for yourself by tweaking an Icelake template.

You should find a symbolic link to an Icelake default job submission script modified for the icelake nodes in your home directory, called::

 slurm_submit.peta4-icelake

This is set up for non-MPI jobs using icelake, but can be modified for
other types of job. If you prefer to modify your existing job scripts,
please see the following sections for guidance.


Jobs not using MPI and using less than 9200 MiB per cpu
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
In this case you should be able to simply specify the sapphire partition to
the -p sbatch directive, e.g.::

 #SBATCH -p ukaea-spr-hbm

will submit a job able to run on the first nodes available in the
ukaea-spr-hbm partition.

Jobs not using MPI and requiring more than 9200 MiB per cpu
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
In the case of larger memory requirements, it is most efficient to submit
instead to the ukaea-spr-hbm-flat partition which will allocate 10300 MiB per cpu::

 #SBATCH -p ukaea-spr-hbm-flat

If this amount of memory per cpu is insufficient you will need to
specify either the *--mem=* or *--cpus-per-task=* directive, in
addition to the *-p* directive, in order to make sure you have enough
memory at run time. Note that in this case, Slurm will satisfy the
memory requirement by allocating (and charging for) more cpus if
necessary. E.g.::

 #SBATCH -p ukaea-spr-hbm-flat
 #SBATCH --nodes=1
 #SBATCH --ntasks=1
 #SBATCH --mem=15000

In the above example we are requesting 15000 MiB of memory per node, but only one task. Slurm will usually allocate one cpu per task, but here, because it enforces 10300 MiB per cpu for the partition ukaea-spr-hbm-flat, it will allocate 2 cpus to the single task in order to satisfy the job requirements of 15000 MiB. Note that this increases the number of cpu core hours consumed by the job and hence the charge. Also note that since each cpu receives 10300 MiB by default anyway, the user would lose nothing by requesting 20000 MiB instead of 15000 MiB here.

At this point, the memory setup becomes relevant. While the HBM-only mode in the **ukaea-spr-hbm** partition uses only HBM memory and does not require specific optimisation, the "Flat(1LM)" mode uses DDR memory to supplement the HBM memory (hence the higher memory available per node). This heterogeneous memory layout requires explicit memory mapping when running an MPI application to achieve the best memory rates and use the nodes optimally.

The `Intel configuration manual`_ suggests using the linux command ``numactl`` to perform the mapping, e.g::
  
  numactl --preferred <hbm_node> <application>

where ``<hbm_node>`` can be obtained through the command ``numactl -H``


Jobs requiring MPI
^^^^^^^^^^^^^^^^^^
We currently recommend using *Intel MPI 2021.12.1* on Sapphire Rapid nodes. There are
other, related changes to the default environment seen by jobs running
on the sapphire nodes. If you wish to recompile or test against the
Sapphire Rapid HBM environment, the best option is to  request an
`interactive node`__ and work on a relevant partition node directly.

For reference, the default environment on the sapphire HBM nodes is provided by loading a module as follows::

 module purge
 module load rhel9/default-sar 

If you are running your MPI job on the **ukaea-spr-hbm-flat** partition, then the mapping described above in the previous section becomes::
  
  mpirun  -genv I_MPI_HBW_POLICY <mode> <other mpirun options> <application>

where ``<mode>`` is one of: *hbw_bind, hbw_preferred, hbw_interleave*. See the `Intel configuration manual`_ or the `Lenovo press page on HBM implementation`_ for more details.

.. __: `Recommendations for running on Sapphire Rapids HBM`_

.. _`Lenovo press page on HBM implementation`: https://lenovopress.lenovo.com/lp1738-implementing-intel-high-bandwidth-memory#figure-numactl-flat-snc4