SPDK AIO bdevperf Performance Report: Analyzing Workload on AWS Graviton4
- Rahul Bapat

- Aug 11, 2025
- 4 min read
Updated: Dec 3, 2025

We conducted SPDK bdevperf tests on an AWS EC2 r8gd.metal-24xl instance,
focusing on single CPU core performance under high I/O load. Our objective was to
demonstrate a CPU-bound workload. Results show low I/O wait and high CPU
utilization, confirming the CPU is the limiting factor. The 2-disk configuration achieved the highest throughput, indicating a CPU saturation point.
1. Performance Results Summary
(100-second duration)
Below is a consolidated view of our 100-second bdevperf runs across 1, 2, and 3 local NVMe disks. These figures include throughput, latency, CPU utilization (from mpstat), and Instructions Per Cycle (IPC, from perf stat) for the dedicated CPU core.
2. Test Setup and Environment
Tests were conducted on an AWS EC2 r8gd.metal-24xl instance, a bare-metal
machine.
Processor: AWS Graviton4 (ARM-based).
Local Storage: Three 1900 GiB NVMe SSDs (Instance Store).
bdevperf parameters used to focus on CPU utilization include:
SPDK Driver: AIO (Asynchronous I/O), which uses the Linux kernel's native AIO interfaces.
Queue Depth (QD): 384, set high to keep storage busy.
I/O Size (IO_SIZE): 4096 bytes (4 KiB), a block size for transactional workloads.
Workload Type: randrw (Mixed Random Read/Write) with 70% Reads / 30% Writes.
Test Duration: 100 seconds per run.
CPU Core Dedication: bdevperf was affinity-set to a single CPU core (core 0) to measure that core's I/O processing capacity.
3. Our Script's Test Methodology
Our custom automation script executes and monitors bdevperf tests as follows:
1. Device Identification & Selection: The script identifies available, unmounted
NVMe block devices, excluding system partitions. We then select devices for testing.
2. SPDK AIO bdev Configuration: For each test (incrementally adding selected disks), a JSON configuration file is generated. This configures SPDK to use AIO block devices from the physical NVMe drives.
3. Performance Execution with Monitoring:
bdevperf runs with the generated configuration.
mpstat Monitoring: mpstat concurrently monitors the dedicated CPU core(s), capturing CPU utilization percentages (User, System, Idle, I/O Wait).
perf stat Monitoring: perf stat wraps bdevperf, targeting the dedicated CPU core(s). It collects hardware performance counter data (instructions, cycles) and directly extracts Instructions Per Cycle (IPC), a measure of CPU efficiency.
All raw outputs are saved to a unique, timestamped directory.
4. Results Aggregation & Summary: After each test, the script parses bdevperf
(IOPS, throughput, latency) and CPU metrics. A summary table is presented,
highlighting the configuration with the highest total CPU utilization.
4. Key Findings: CPU-Bound Workload Confirmation
Our tests confirm that the workload is CPU-bound on the single dedicated Graviton4 core, not bottlenecked by NVMe storage.

Low I/O Wait: Multi-disk configurations show I/O Wait at 0.02% to 0.06%, indicating NVMe storage provides data faster than the CPU can process it. The single disk I/O wait is 5.49%.
High CPU Utilization: Total CPU Utilization on the dedicated core remained high (84.87% for 1-disk, nearly 99% for 2-disk and 3-disk), confirming the single core as the performance bottleneck.
Dominant System CPU: High System CPU (76-87%) is expected with SPDK AIO bdevs under heavy load, reflecting kernel overhead in processing numerous asynchronous I/O requests.
IPC Values: IPC values (2.21 to 2.53) indicate the Graviton4 core's efficiency. The slight IPC increase with more disks suggests improved pipeline utilization as CPU saturation increases.
5. Performance Dynamics: 2 Disks vs. 3 Disks - Optimal Point Identification


The comparison between 2-disk and 3-disk scenarios shows the CPU's saturation
point:
2 Disks: Optimal Throughput: This configuration achieved the highest
throughput (658.59 kIOPS / 20.09 Gbps), with nearly 99% Total CPU Utilization
and minimal I/O Wait. Two NVMe devices provide I/O that a single Graviton4 core
can optimally handle, maximizing throughput without excessive contention.
3 Disks: Beyond Optimal Saturation: Adding a third disk maintained high CPU
utilization (98.93%) and low I/O Wait (0.02%). However, total throughput slightly
decreased (638.92 kIOPS / 19.50 Gbps), and average latency significantly increased to 1802.86 μs (from 1166.00 μs). This indicates that beyond CPU
saturation, additional I/O sources increase contention and queueing, leading to higher latencies without throughput gain.
6. Limitations and Future Work: The Role of SPDK Drivers (VFIO/UIO)
Our current methodology utilizes the SPDK AIO bdev driver, which passes all I/O
through the Linux kernel's I/O stack. This incurs kernel overhead, contributing to our
observed high System CPU utilization.
SPDK offers VFIO (Virtual Function I/O) and UIO (Userspace I/O) drivers for direct,
zero-copy access to NVMe devices from user space, bypassing kernel overhead.
These drivers typically offer higher IOPS and lower latency.
We were unable to utilize VFIO or UIO drivers in this test series due to setup
constraints. Using these drivers could yield higher performance (more User CPU, less
System CPU), further pushing the single Graviton4 core's capabilities.
Future Work: Investigating SPDK performance with VFIO or UIO drivers to fully assess the r8gd.metal-24xl instance's potential by minimizing kernel involvement.
Conclusion
Our experiments confirm the CPU-bound nature of the SPDK AIO bdevperf workload
on a single Graviton4 core. The r8gd.metal-24xl instance's local NVMe storage is
suffcient to saturate a single CPU core with high-volume, small-block random I/O. The
2-disk configuration represents the optimal point for throughput before latency
increases. Future tests with user-space drivers like VFIO or UIO could demonstrate
even higher performance.





Comments