To get maximum tokens generated for target CPU

Archana Barve
Jun 9
1 min read

Updated: Jun 13

LLMs are Getting Better and Smaller

Let’s look at Llama as an example. The rapid evolution of these models highlights a key trend in AI: prioritizing efficiency and performance.

When Llama 2 70B launched in August 2023, it was considered a top-tier foundational model. However, its massive size demanded powerful hardware like the NVIDIA H100 accelerator. Less than nine months later, Meta introduced Llama 3 8B, shrinking the model by almost 9x. This enabled it to run on smaller AI accelerators and even optimized CPUs, drastically reducing the required hardware costs and power usage. Impressively, Llama 3 8B surpassed its larger predecessor in accuracy benchmarks.

Setup details

Tested with llama.cpp on

Machine: Gv4 r8g.24xlarge
OS: ubuntu 2204
kernel: 6.8.AWS
Model: Meta-Llama-3.1-8B-Instruct- Q8_0.gguf

Test sweep

nproc x nthreads x bs [1-32]

Graphs with observations highlighting benefits

Token generation is done in an auto-regressive manner and is highly sensitive to the length of output needed to be generated. Arm optimizations help here with larger batch sizes, increasing the throughput by more than 2x.