Running jobs on DAWN ============================ Software -------- At the time of writing, we recommend logging in initially to the CSD3 login-dawn nodes (login-dawn.hpc.cam.ac.uk). To ensure your environment is clean and set up correctly for Dawn, please purge your modules and load the base Dawn environment: .. code-block:: bash module purge module load rhel9/default-dawn The PVC nodes run `Rocky Linux 9`_, which is a rebuild of Red Hat Enterprise Linux 9 (RHEL9). The Sapphire Rapids CPUs on these nodes are also more modern and support newer instructions than most other CSD3 partitions. As we provide a separate set of modules specifically for dawn nodes, in general, we don't support running software built for other CSD3 partitions on Dawn nodes. Therefore you are strongly recommended to rebuild your software on the Dawn nodes rather than try to run binaries previously compiled on CSD3. Be aware that the software environment for Dawn is optimised for its hardware and the binaries may fail to run on other CSD3 compute nodes and all login nodes. If you wish to recompile or test against this new environment, we recommend requesting an interactive node with command:: sintr -t 01:00:00 -A YOURPROJECT-DAWN-GPU -p pvc9 -N 1 -c 24 --gres=gpu:1 The nodes are named according to the scheme *pvc-s-[1-256]*. .. _`Rocky Linux 9`: https://rockylinux.org/ Slurm partition --------------- The PVC (pvc-s) nodes are in a new **pvc** Slurm partition. Dawn Slurm projects follow the CSD3 naming convention for GPU projects and contain units of GPU hours. Additionally Dawn project names follow the pattern NAME-DAWN-GPU. .. The pvc-s nodes have **96 cpus** (1 cpu = 1 core), and 1024 GiB of RAM. This means that Slurm will allocate **24 cpus per GPU**. Recommendations for running on Dawn ----------------------------------- The resource limits are currently set to a maximum of 64 GPUs per user with a maximum wallclock time of 36 hours per job. These limits should be regarded as provisional and may be revised. Default submission script for dawn ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ A template submission script will be provided soon. To submit a job to the Dawn PVC partition, your batch script should look similar to the following example: .. code-block:: bash #!/bin/bash -l #SBATCH --job-name=my-batch-job #SBATCH --account= #SBATCH --partition=pvc9 # Dawn PVC partition #SBATCH -n 4 # Number of tasks (usually number of MPI ranks) #SBATCH -c 24 # Number of cores per task #SBATCH --gres=gpu:1 # Number of requested GPUs per node . /etc/profile.d/modules.sh module purge module load rhel9/default-dawn # Set up environment below for example by loading more modules srun Jobs requiring N GPUs where N < 4 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Although there are 4 GPUs in each node it is possible to request fewer than this, e.g. to request 3 GPUs use:: #SBATCH --nodes=1 #SBATCH --gres=gpu:3 #SBATCH -p pvc9 .. Slurm will enforce allocation of a proportional number of CPUs (24) per GPU. Note that if you either do not specify a number of GPUs per node with *--gres*, or request more than one node with less than 4 GPUs per node, you will receive an error on submission. Jobs requiring multiple nodes ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Multi-node jobs need to request 4 GPUs per node, i.e.:: #SBATCH --gres=gpu:4 Jobs requiring MPI ^^^^^^^^^^^^^^^^^^ We currently recommend using the `Intel MPI Library `_ provided by the oneAPI toolkit: .. code-block:: bash module av intel-oneapi-mpi To use GPU Aware MPI and allow passing device buffers to MPI calls, set the ``I_MPI_OFFLOAD`` environment variable to ``1`` in your submission script: .. code-block:: bash export I_MPI_OFFLOAD=1 If you are sure that your code only involves buffers of the same type (e.g. only GPU buffers or only host buffers) in a single MPI operation you can further optimize MPI communication between GPUs by setting .. code-block:: bash export I_MPI_OFFLOAD_SYMMETRIC=1 This will disable handling of MPI communication between GPU buffers and host buffers. .. Performance considerations for MPI jobs ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ On systems with multiple GPUs and multiple NICs, such as Dawn with 4x HDR NICs and 4x PVC GPUs per node, care should be taken to ensure that GPUs communicate with the closest NIC, to ensure maximum GPU-NIC throughput. Furthermore, each GPU should be assigned to its closest set of CPU cores (NUMA domain). This can be achieved by querying the topology of the machine you are running on. .. (using nvidia-smi topo -m), and then instrumenting your MPI and/or run script to ensure correct placement. On Wilkes3, each pair of GPUs shares a NIC, so we need to ensure that the local NIC to each pair is used for all non-peer-to-peer communication. TODO: Binding script for Dawn Given the above binding script (assume it's name is run.sh), the corresponding MPI launch command can be modified to:: mpirun -npernode $mpi_tasks_per_node -np $np --bind-to none ./run.sh $application $options Note that this approach requires exclusive access to a node. Multithreading jobs ^^^^^^^^^^^^^^^^^^^ If your code uses multithreading (e.g. host-based OpenMP), you will need to specify the number of threads per process in your Slurm batch script using the `cores-per-task` parameter. For example, to run a hybrid MPI-OpenMP application using 24 processes and 4 threads per task: .. code-block:: bash #SBATCH -n 24 # or --ntasks #SBATCH -c 4 # or --cores-per-task If you do _not_ specify the `cores-per-task` parameter Slurm will pin the threads to the same core, reducing performance. Recommended Compilers ^^^^^^^^^^^^^^^^^^^^^ We recommend using the `Intel oneAPI `_ compilers for C, C++ and Fortran: .. code-block:: bash module avail intel-oneapi-compilers These compilers support both standard, host-based code as well as SYCL for C++ codes, and OpenMP offload in C, C++ and Fortran. Please note that the 'classic' Intel compilers (icc, icpc and ifort) have been deprecated or removed; only the 'new' compilers (icx, icpx and ifx) are supported and are the only ones that support GPUs. To enable SYCL support: .. code-block:: bash icpx -fsycl For OpenMP offload (note `-fiopenmp`, not `-fopenmp`): .. code-block:: bash # C icx -fiopenmp -fopenmp-targets=spir64 # Fortran ifx -fiopenmp -fopenmp-targets=spir64 Both Intel MPI and the oneMKL performance libraries support both CPU and the PVC GPUs, and can be found as follows: .. code-block:: bash module av intel-oneapi-mpi module av intel-oneapi-mkl Other recommendations ^^^^^^^^^^^^^^^^^^^ Further useful information about running on Intel GPUs can be found in Intel's `oneAPI GPU Optimization Guide `_.