Performance Tips¶
Compiler Information and Options¶
The manual pages for the different compiler suites are available:
- GCC
- Fortran
man gfortran
- C/C++
man gcc
- Intel
- Fortran
man ifort
- C/C++
man icc
Useful compiler options¶
Whilst difference codes will benefit from compiler optimisations in different ways, for reasonable performance, at least initially, we suggest the following compiler options:
- Intel
-O2
- GNU
-O2 -ftree-vectorize -funroll-loops -ffast-math
To target the specific hardware on CSD3 use the following options:
Partition | Intel | GCC |
---|---|---|
cclake | -xCASCADELAKE |
-march=cascadelake |
icelake | -xICELAKE-SERVER |
-march=icelake-server |
sapphire | -xSAPPHIRERAPIDS |
-march=sapphirereapids |
ampere | -march=znver3 |
Alternatively, login to a machine with the same architecture that you will be running on and use
- Intel
-xHost
- GNU
-march=native
When you have a application that you are happy is working correctly and has reasonable performance you may wish to investigate some more aggressive compiler optimisations. Below is a list of some further optimisations that you can try on your application (Note: these optimisations may result in incorrect output for programs that depend on an exact implementation of IEEE or ISO rules/specifications for math functions):
- Intel
-fast
- GNU
-Ofast -funroll-loops
Vectorisation, which is one of the important compiler optimisations for any modern Intel hardware, is enabled by default as follows:
- Intel
- At
-O2
and above - GNU
- At
-O3
and above or when using-ftree-vectorize
To promote integer and real variables from four to eight byte precision for FORTRAN codes the following compiler flags can be used:
- Intel
-real-size 64 -integer-size 64 -xAVX
(Sometimes the Intel compiler incorrectly generates AVX2 instructions if the-real-size 64
or-r8
options are set. Using the-xAVX
option prevents this.)- GNU
-freal-4-real-8 -finteger-4-integer-8
GPU Direct (GDR)¶
One of the key technologies to get the most performance out of the GPU system is GDR. This allows GPUs to communicate via MPI without waiting for the host CPU. This is implemented in both OpenMPI (the default) and MVAPICH2.
The functionality and performance of GDR can be tested using the OSU micro-benchmarks suite. For example, using OpenMPI:
module purge
module load rhel7/default-gpu
OSU_HOME=$HOME/osu-micro-benchmarks-5.4.2
unset OMP_NUM_THREADS
echo WITH GDR
mpirun -np 2 --map-by ppr:1:node \
--mca mtl ^mxm --mca pml ^yalla --mca btl self,openib --mca btl_openib_want_cuda_gdr 1 \
$OSU_HOME/get_local_rank $OSU_HOME/mpi/pt2pt/osu_latency -f D D
and using MVAPICH2:
module purge
module load rhel7/default-gpu
module unload gcc-5.4.0-gcc-4.8.5-fis24gg openmpi-1.10.7-gcc-5.4.0-jdc7f4f
module load mvapich2-GDR/gnu/2.3a_cuda-8.0
OSU_HOME="${MPI_HOME}/libexec/osu-micro-benchmarks"
unset OMP_NUM_THREADS
export MV2_ENABLE_AFFINITY 1
export MV2_USE_CUDA 1
export MV2_USE_GPUDIRECT 1
mpirun -np 2 -ppn 1 -genvall $OSU_HOME/get_local_rank $OSU_HOME/mpi/pt2pt/osu_latency D D
MVAPICH2 includes a copy of the benchmark suite in the distribution whereas with OpenMPI a custom version has been downloaded and built.