top of page

ResNet50 Performance Study

  • Writer: Archana Barve
    Archana Barve
  • Jun 1
  • 3 min read

using PyTorch on ARM and x86 CPUs

CPU Inference Benchmarking  ·  ARM vs x86  ·  W8A8 Quantization

Overview

ResNet50 is a convolutional neural network built using bottleneck residual blocks of the form:

1×1 Conv → 3×3 Conv → 1×1 Conv + Skip Connection

Among these layers, the 3×3 convolution layers dominate execution time, making Conv2d the primary hotspot during inference.

Although ResNet50 performs ~4 GFLOPs per inference, it is not compute-heavy enough to fully utilize modern CPUs. Instead, the workload is highly memory-traffic intensive, where cache efficiency, memory bandwidth, and tensor layout become critical performance factors.

Key Study Axes

The following dimensions were studied while analyzing ResNet50 inference performance on ARM and x86 systems using PyTorch:

  • Latency vs Batch Size

  • Precision Study (FP32 / FP16 / INT8)

  • Thread Scaling

  • Process Scaling

  • Memory Format Study (channels_last)

1. Latency vs Batch Size

Increasing batch size improves throughput initially, but after a threshold (typically batch size 32–64), throughput gains flatten while latency increases significantly.

This behavior indicates:

  • cache overflow

  • increased DRAM traffic

  • memory bandwidth saturation

Hence, ResNet50 behaves primarily as a memory-bound workload on CPUs.

2. Precision Study

FP16 inference reduces tensor size and memory bandwidth requirements. However, on CPUs—especially ARM CPUs—the gains from FP16 are generally modest compared to GPUs.

Observed behavior:

  • FP16 gives limited latency improvement

  • Benefits come mainly from reduced memory traffic

  • INT8/W8A8 inference provides larger gains

3. Thread Scaling

Increasing thread count improves performance only up to a point. Beyond that:

  • cache contention increases

  • memory bandwidth becomes saturated

  • scaling efficiency drops

This is evident from reduced parallel efficiency at higher core counts.

4. Process Scaling

Running multiple inference processes increases resource contention:

  • cache thrashing

  • NUMA pressure

  • memory bandwidth contention

Typically, a moderate number of processes provides the best latency/throughput tradeoff.


5. Memory Format Study (channels_last) — Critical Observation

This area still requires deeper study but already shows strong potential.

PyTorch tensors by default use the NCHW (channels-first) layout:

[N, C, H, W]

In this format:


  • channel blocks are contiguous

  • spatial access patterns become inefficient for convolution kernels

  • CPUs frequently "jump around memory"

This leads to:

  • poor spatial locality

  • higher cache misses

  • lower SIMD/vector efficiency


channels_last (NHWC)

Using:

model = model.to(memory_format=torch.channels_last)inp = inp.to(memory_format=torch.channels_last)

changes tensor layout to:

[N, H, W, C]

In this layout:

  • pixels (H×W) are contiguous

  • memory access becomes more sequential

  • cache prefetching improves

  • SIMD/vector units are utilized more efficiently

Intuition

  • NCHW → "jump around memory to compute"

  • NHWC → "stream through memory smoothly"

CPUs generally prefer streaming memory access patterns.

channels_last is often the single easiest optimization to unlock 20–40% performance improvement for convolution-heavy workloads like ResNet50.

This is particularly important on ARM systems where workloads are strongly memory-bandwidth bound.


ResNet50 W8A8 — Latency & Throughput

Workload: ImageNet validation (imagenette2-320/val)

Batch Size: 1

Platform

Variant

Cores

Latency (ms)

Throughput (img/s)

Speedup vs 1-core

Parallel Efficiency

ARM

Unfused

1

584.8

1.7

1.00×

100.0%

ARM

Unfused

4

202.4

4.9

2.89×

72.2%

ARM

Unfused

8

133.7

7.5

4.37×

54.7%

ARM

Fused

1

422.6

2.4

1.00×

100.0%

ARM

Fused

4

115.7

8.6

3.65×

91.3%

ARM

Fused

8

62.7

15.9

6.74×

84.3%

AMD

Unfused

1

19.6

51.0

1.00×

100.0%

AMD

Unfused

4

7.1

140.8

2.76×

69.0%

AMD

Unfused

8

5.9

169.5

3.32×

41.5%

Fusion Benefit — ARM Neoverse-N1

Conv + BatchNorm + ReLU Fusion

Cores

Unfused (ms)

Fused (ms)

Speedup

Ops Tracked

1

584.8

422.6

1.38×

160 → 41

4

202.4

115.7

1.75×

160 → 41

8

133.7

62.7

2.13×

160 → 41

Fusion significantly reduces:

  • operator dispatch overhead

  • intermediate memory movement

  • synchronization points

The benefits become more pronounced at higher core counts.


Cross-Platform Comparison — AMD vs ARM

Cores

AMD (ms)

ARM Unfused (ms)

ARM Fused (ms)

AMD Faster than Unfused

AMD Faster than Fused

1

19.6

584.8

422.6

29.8×

21.6×

4

7.1

202.4

115.7

28.5×

16.3×

8

5.9

133.7

62.7

22.7×

10.6×

Even after fusion optimizations, x86/AMD platforms continue to outperform ARM significantly for this workload. However, fusion and memory-layout optimizations substantially improve ARM scaling efficiency.


Conclusion

ResNet50 inference on CPUs is largely memory-bound rather than compute-bound. Performance is strongly influenced by:

  • tensor memory layout

  • cache locality

  • operator fusion

  • memory bandwidth utilization

Among all optimizations studied, operator fusion and channels_last memory format stand out as the most impactful CPU-side improvements, especially on ARM systems where memory behavior dominates overall performance.

bottom of page