ResNet50 Performance Study

Archana Barve
Jun 1
3 min read

using PyTorch on ARM and x86 CPUs

CPU Inference Benchmarking · ARM vs x86 · W8A8 Quantization

Overview

ResNet50 is a convolutional neural network built using bottleneck residual blocks of the form:

1×1 Conv → 3×3 Conv → 1×1 Conv + Skip Connection

Among these layers, the 3×3 convolution layers dominate execution time, making Conv2d the primary hotspot during inference.

Although ResNet50 performs ~4 GFLOPs per inference, it is not compute-heavy enough to fully utilize modern CPUs. Instead, the workload is highly memory-traffic intensive, where cache efficiency, memory bandwidth, and tensor layout become critical performance factors.

Key Study Axes

The following dimensions were studied while analyzing ResNet50 inference performance on ARM and x86 systems using PyTorch:

Latency vs Batch Size
Precision Study (FP32 / FP16 / INT8)
Thread Scaling
Process Scaling
Memory Format Study (channels_last)

1. Latency vs Batch Size

Increasing batch size improves throughput initially, but after a threshold (typically batch size 32–64), throughput gains flatten while latency increases significantly.

This behavior indicates:

cache overflow
increased DRAM traffic
memory bandwidth saturation

Hence, ResNet50 behaves primarily as a memory-bound workload on CPUs.

2. Precision Study

FP16 inference reduces tensor size and memory bandwidth requirements. However, on CPUs—especially ARM CPUs—the gains from FP16 are generally modest compared to GPUs.

Observed behavior:

FP16 gives limited latency improvement
Benefits come mainly from reduced memory traffic
INT8/W8A8 inference provides larger gains

3. Thread Scaling

Increasing thread count improves performance only up to a point. Beyond that:

cache contention increases
memory bandwidth becomes saturated
scaling efficiency drops

This is evident from reduced parallel efficiency at higher core counts.

4. Process Scaling

Running multiple inference processes increases resource contention:

cache thrashing
NUMA pressure
memory bandwidth contention

Typically, a moderate number of processes provides the best latency/throughput tradeoff.

5. Memory Format Study (channels_last) — Critical Observation

This area still requires deeper study but already shows strong potential.

PyTorch tensors by default use the NCHW (channels-first) layout:

[N, C, H, W]

In this format:

channel blocks are contiguous
spatial access patterns become inefficient for convolution kernels
CPUs frequently "jump around memory"

This leads to:

poor spatial locality
higher cache misses
lower SIMD/vector efficiency

channels_last (NHWC)

Using:

model = model.to(memory_format=torch.channels_last)inp = inp.to(memory_format=torch.channels_last)

changes tensor layout to:

[N, H, W, C]

In this layout:

pixels (H×W) are contiguous
memory access becomes more sequential
cache prefetching improves
SIMD/vector units are utilized more efficiently

Intuition

NCHW → "jump around memory to compute"
NHWC → "stream through memory smoothly"

CPUs generally prefer streaming memory access patterns.

channels_last is often the single easiest optimization to unlock 20–40% performance improvement for convolution-heavy workloads like ResNet50.

This is particularly important on ARM systems where workloads are strongly memory-bandwidth bound.

ResNet50 W8A8 — Latency & Throughput

Workload: ImageNet validation (imagenette2-320/val)

Batch Size: 1

Platform	Variant	Cores	Latency (ms)	Throughput (img/s)	Speedup vs 1-core	Parallel Efficiency
ARM	Unfused	1	584.8	1.7	1.00×	100.0%
ARM	Unfused	4	202.4	4.9	2.89×	72.2%
ARM	Unfused	8	133.7	7.5	4.37×	54.7%
ARM	Fused	1	422.6	2.4	1.00×	100.0%
ARM	Fused	4	115.7	8.6	3.65×	91.3%
ARM	Fused	8	62.7	15.9	6.74×	84.3%
AMD	Unfused	1	19.6	51.0	1.00×	100.0%
AMD	Unfused	4	7.1	140.8	2.76×	69.0%
AMD	Unfused	8	5.9	169.5	3.32×	41.5%

Fusion Benefit — ARM Neoverse-N1

Conv + BatchNorm + ReLU Fusion

Cores	Unfused (ms)	Fused (ms)	Speedup	Ops Tracked
1	584.8	422.6	1.38×	160 → 41
4	202.4	115.7	1.75×	160 → 41
8	133.7	62.7	2.13×	160 → 41

Fusion significantly reduces:

operator dispatch overhead
intermediate memory movement
synchronization points

The benefits become more pronounced at higher core counts.

Cross-Platform Comparison — AMD vs ARM

Cores	AMD (ms)	ARM Unfused (ms)	ARM Fused (ms)	AMD Faster than Unfused	AMD Faster than Fused
1	19.6	584.8	422.6	29.8×	21.6×
4	7.1	202.4	115.7	28.5×	16.3×
8	5.9	133.7	62.7	22.7×	10.6×

Even after fusion optimizations, x86/AMD platforms continue to outperform ARM significantly for this workload. However, fusion and memory-layout optimizations substantially improve ARM scaling efficiency.

Conclusion

ResNet50 inference on CPUs is largely memory-bound rather than compute-bound. Performance is strongly influenced by:

tensor memory layout
cache locality
operator fusion
memory bandwidth utilization

Among all optimizations studied, operator fusion and channels_last memory format stand out as the most impactful CPU-side improvements, especially on ARM systems where memory behavior dominates overall performance.