CPU-Centric HPC Benchmarking with miniFE and GROMACS

Rahul Bapat
Jun 16
5 min read

Benchmarks are vital for evaluating High-Performance Computing (HPC) system performance, guiding hardware choices, and optimizing software. This whitepaper focuses on understanding and overcoming bottlenecks in HPC benchmarks for CPU environments, specifically considering ARM/AARCH64 architectures, using miniFE and GROMACS as examples.

1. Introduction to miniFE and GROMACS Benchmarks

1.1. miniFE: A Finite Element Mini-Application

miniFE, part of the Mantevo suite, simulates implicit finite element applications. It solves sparse linear systems, with its core kernels focused on element-operator computation, assembly, sparse matrix-vector products (SpMV), and basic vector operations. It's excellent for benchmarking systems handling sparse linear algebra and iterative solvers.

To run miniFE, you typically compile it with an MPI-enabled compiler. Execution involves specifying problem dimensions and MPI processes.

# Example for a single node with 16 MPI tasks

srun -N 1 -n 16 miniFE.x -nx 264 -ny 256 -nz 256

# Example for a multi-node run (adjust N and n)

srun -N 4 -n 64 miniFE.x -nx 528 -ny 512 -nz 512

Note: srun is for SLURM; mpirun or similar for other systems.

1.2. GROMACS: Molecular Dynamics Simulation Software

GROMACS (GROningen MAchine for Chemical Simulations) is a highly optimized open-source software for molecular dynamics (MD) simulations. It models atomic and molecular movements, particularly for biochemical systems, and is efficient in calculating non-bonded interactions.

A typical GROMACS workflow prepares input files, then runs the simulation.

# Step 1: Prepare the run input file (.tpr)

gmx grompp -f pme.mdp -c conf.gro -p topol.top -o topol.tpr

# Step 2: Run the molecular dynamics simulation

mpirun -np 4 gmx_mpi mdrun -s topol.tpr -ntomp 4

# To run a specific benchmark system (e.g., 'benchPEP-h')

mpirun -np <num_MPI_ranks> gmx_mpi mdrun -s benchPEP-h.tpr -ntomp <num_OMP_threads>

Note: Tune MPI processes (-np) and OpenMP threads (-ntomp) to your hardware.

1.2. GROMACS: Molecular Dynamics Simulation Software

GROMACS (GROningen MAchine for Chemical Simulations) is a highly optimized open-source software for molecular dynamics (MD) simulations. It models atomic and molecular movements, particularly for biochemical systems, and is efficient in calculating non-bonded interactions.

A typical GROMACS workflow prepares input files, then runs the simulation.

# Step 1: Prepare the run input file (.tpr)

gmx grompp -f pme.mdp -c conf.gro -p topol.top -o topol.tpr

# Step 2: Run the molecular dynamics simulation

mpirun -np 4 gmx_mpi mdrun -s topol.tpr -ntomp 4

# To run a specific benchmark system (e.g., 'benchPEP-h')

mpirun -np <num_MPI_ranks> gmx_mpi mdrun -s benchPEP-h.tpr -ntomp <num_OMP_threads>

Note: Tune MPI processes (-np) and OpenMP threads (-ntomp) to your hardware.

2. Interpreting Performance Output (Benchmarking POV)

Understanding benchmark output is crucial for evaluating HPC system throughput and efficiency.

2.1. miniFE Performance Metrics

miniFE outputs performance data, primarily focused on:

Total CG Mflops (Mega-Floating Point Operations per Second for Conjugate Gradient solve): The main Figure of Merit (FOM). Higher values indicate better performance, reflecting the system's efficiency in sparse linear algebra, often limited by memory bandwidth and FPU throughput.

2.2. GROMACS Performance Metrics

GROMACS provides detailed output, with the key metric being:

ns/day (nanoseconds per day): The standard performance metric for GROMACS. It shows how many nanoseconds of simulated time can be computed per real-world day. A higher ns/day means faster simulation. This metric is ideal for comparing different CPU architectures or configurations.

Other useful outputs include Total Wall Time and a breakdown of time spent in different force calculations, which helps pinpoint specific bottlenecks.

3. Bottlenecks in Running HPC Benchmarks

Achieving peak HPC performance requires identifying and mitigating bottlenecks that limit system throughput.

3.1. miniFE Specific Bottlenecks

miniFE is particularly sensitive to:

Memory Bandwidth: The Sparse Matrix-Vector Product (SpMV) is highly memory-bandwidth bound due to irregular memory access patterns.

Cache Misses: Irregular accesses lead to frequent cache misses, increasing data retrieval latency.
Inter-node Communication (for large problems): For distributed problems, communication during assembly and the Conjugate Gradient solver can be limited by network latency and bandwidth.

3.2. GROMACS Specific Bottlenecks

For GROMACS, key bottlenecks include:

CPU Core Performance & Threading: The number of cores and their individual performance (Instructions Per Cycle (IPC), clock speed) directly impact ns/day. Optimal balance between MPI ranks and OpenMP threads per rank is crucial.
Memory Bandwidth: The CPU needs to access large datasets frequently for force calculations.
SIMD Vectorization: GROMACS heavily relies on CPU SIMD instructions (e.g., NEON). If the CPU architecture or compiler doesn't fully exploit these, performance will suffer.
Cache Utilization: Efficient cache usage is critical for the main simulation loop.
Inter-node Communication: For large systems simulated across multiple nodes, MPI communication for domain decomposition and force summation can be a significant bottleneck, even with fast interconnects.
NUMA Effects: Proper process and memory binding is crucial on multi-socket systems to minimize cross-socket memory access latency.
Load Imbalance: Uneven workload distribution across PP and PME leads to idle compute units.

3.3. Dynamic Monitoring for Bottleneck Analysis (Frequency, Power, Temperature)

Beyond static analysis, dynamic monitoring of CPU frequency, power consumption, and temperature during benchmark execution provides invaluable insights for root-causing performance bottlenecks. This data, when mapped over the run duration, can reveal transient issues that logs alone might miss.

Application-Specific Context:

For miniFE, if memory bandwidth is the primary bottleneck, the CPU might not be fully utilized, leading to lower-than-expected power consumption and temperatures, even if the frequency remains high. Conversely, if the SpMV operations push the CPU's compute capabilities, sustained high power and temperature could be observed. Any sudden dips in Mflops alongside frequency drops would directly point to thermal or power throttling.

For GROMACS, which can be highly compute-intensive, sustained high power consumption and temperatures are common. Analyzing frequency, power, and temperature trends can reveal if the ns/day performance is being limited by the CPU's ability to maintain its turbo frequencies due to thermal constraints or if it's hitting a configured power envelope. Discrepancies between expected maximum performance and observed ns/day can often be explained by these dynamic system responses.

Tools for Monitoring: Various tools can collect this data, including vendor-specific utilities (e.g., Intel's pcm, AMD's uProf), Linux tools (perf, turbostat, sensors), or IPMI/BMC interfaces for server-level metrics. Correlating these dynamic metrics with the benchmark's reported performance can significantly aid in precise bottleneck identification and system optimization.

Conclusion

Effective HPC benchmarking goes beyond simply running an application and reporting a single performance number. As demonstrated with miniFE and GROMACS in a CPU-centric environment, a deep understanding of the benchmark's computational characteristics is essential. Identifying whether a workload is memory-bound, compute-bound, or communication-bound is the first step toward optimizing performance. Furthermore, leveraging dynamic monitoring of CPU frequency, power consumption, and temperature provides invaluable diagnostic data. By integrating performance metrics with detailed system telemetry, HPC administrators and researchers can precisely pinpoint bottlenecks, fine-tune system configurations, and ultimately extract the highest possible performance.