ResNet50 Performance Study
- Archana Barve
- Jun 1
- 3 min read
using PyTorch on ARM and x86 CPUs
CPU Inference Benchmarking · ARM vs x86 · W8A8 Quantization
Overview
ResNet50 is a convolutional neural network built using bottleneck residual blocks of the form:
1×1 Conv → 3×3 Conv → 1×1 Conv + Skip ConnectionAmong these layers, the 3×3 convolution layers dominate execution time, making Conv2d the primary hotspot during inference.
Although ResNet50 performs ~4 GFLOPs per inference, it is not compute-heavy enough to fully utilize modern CPUs. Instead, the workload is highly memory-traffic intensive, where cache efficiency, memory bandwidth, and tensor layout become critical performance factors.
Key Study Axes
The following dimensions were studied while analyzing ResNet50 inference performance on ARM and x86 systems using PyTorch:

Latency vs Batch Size
Precision Study (FP32 / FP16 / INT8)
Thread Scaling
Process Scaling
Memory Format Study (channels_last)
1. Latency vs Batch Size
Increasing batch size improves throughput initially, but after a threshold (typically batch size 32–64), throughput gains flatten while latency increases significantly.
This behavior indicates:
cache overflow
increased DRAM traffic
memory bandwidth saturation
Hence, ResNet50 behaves primarily as a memory-bound workload on CPUs.
2. Precision Study
FP16 inference reduces tensor size and memory bandwidth requirements. However, on CPUs—especially ARM CPUs—the gains from FP16 are generally modest compared to GPUs.
Observed behavior:

FP16 gives limited latency improvement
Benefits come mainly from reduced memory traffic
INT8/W8A8 inference provides larger gains
3. Thread Scaling
Increasing thread count improves performance only up to a point. Beyond that:
cache contention increases
memory bandwidth becomes saturated
scaling efficiency drops
This is evident from reduced parallel efficiency at higher core counts.
4. Process Scaling
Running multiple inference processes increases resource contention:
cache thrashing
NUMA pressure
memory bandwidth contention
Typically, a moderate number of processes provides the best latency/throughput tradeoff.
5. Memory Format Study (channels_last) — Critical Observation
This area still requires deeper study but already shows strong potential.
PyTorch tensors by default use the NCHW (channels-first) layout:
[N, C, H, W]In this format:

channel blocks are contiguous
spatial access patterns become inefficient for convolution kernels
CPUs frequently "jump around memory"
This leads to:
poor spatial locality
higher cache misses
lower SIMD/vector efficiency
channels_last (NHWC)
Using:
changes tensor layout to:
[N, H, W, C]In this layout:
pixels (H×W) are contiguous
memory access becomes more sequential
cache prefetching improves
SIMD/vector units are utilized more efficiently
Intuition
NCHW → "jump around memory to compute"
NHWC → "stream through memory smoothly"
CPUs generally prefer streaming memory access patterns.
channels_last is often the single easiest optimization to unlock 20–40% performance improvement for convolution-heavy workloads like ResNet50.
This is particularly important on ARM systems where workloads are strongly memory-bandwidth bound.
ResNet50 W8A8 — Latency & Throughput
Workload: ImageNet validation (imagenette2-320/val)
Batch Size: 1
Platform | Variant | Cores | Latency (ms) | Throughput (img/s) | Speedup vs 1-core | Parallel Efficiency |
ARM | Unfused | 1 | 584.8 | 1.7 | 1.00× | 100.0% |
ARM | Unfused | 4 | 202.4 | 4.9 | 2.89× | 72.2% |
ARM | Unfused | 8 | 133.7 | 7.5 | 4.37× | 54.7% |
ARM | Fused | 1 | 422.6 | 2.4 | 1.00× | 100.0% |
ARM | Fused | 4 | 115.7 | 8.6 | 3.65× | 91.3% |
ARM | Fused | 8 | 62.7 | 15.9 | 6.74× | 84.3% |
AMD | Unfused | 1 | 19.6 | 51.0 | 1.00× | 100.0% |
AMD | Unfused | 4 | 7.1 | 140.8 | 2.76× | 69.0% |
AMD | Unfused | 8 | 5.9 | 169.5 | 3.32× | 41.5% |
Fusion Benefit — ARM Neoverse-N1
Conv + BatchNorm + ReLU Fusion
Cores | Unfused (ms) | Fused (ms) | Speedup | Ops Tracked |
1 | 584.8 | 422.6 | 1.38× | 160 → 41 |
4 | 202.4 | 115.7 | 1.75× | 160 → 41 |
8 | 133.7 | 62.7 | 2.13× | 160 → 41 |
Fusion significantly reduces:
operator dispatch overhead
intermediate memory movement
synchronization points
The benefits become more pronounced at higher core counts.
Cross-Platform Comparison — AMD vs ARM
Cores | AMD (ms) | ARM Unfused (ms) | ARM Fused (ms) | AMD Faster than Unfused | AMD Faster than Fused |
1 | 19.6 | 584.8 | 422.6 | 29.8× | 21.6× |
4 | 7.1 | 202.4 | 115.7 | 28.5× | 16.3× |
8 | 5.9 | 133.7 | 62.7 | 22.7× | 10.6× |
Even after fusion optimizations, x86/AMD platforms continue to outperform ARM significantly for this workload. However, fusion and memory-layout optimizations substantially improve ARM scaling efficiency.
Conclusion
ResNet50 inference on CPUs is largely memory-bound rather than compute-bound. Performance is strongly influenced by:
tensor memory layout
cache locality
operator fusion
memory bandwidth utilization
Among all optimizations studied, operator fusion and channels_last memory format stand out as the most impactful CPU-side improvements, especially on ARM systems where memory behavior dominates overall performance.
