top of page

To get maximum tokens generated for target CPU

  • Writer: Archana Barve
    Archana Barve
  • Jun 9
  • 1 min read

Updated: 3 days ago

LLMs are Getting Better and Smaller

Let’s look at Llama as an example. The rapid evolution of these models highlights a key trend in AI: prioritizing efficiency and performance.

When Llama 2 70B launched in August 2023, it was considered a top-tier foundational model. However, its massive size demanded powerful hardware like the NVIDIA H100 accelerator. Less than nine months later, Meta introduced Llama 3 8B, shrinking the model by almost 9x. This enabled it to run on smaller AI accelerators and even optimized CPUs, drastically reducing the required hardware costs and power usage. Impressively, Llama 3 8B surpassed its larger predecessor in accuracy benchmarks.


Setup details


Tested with llama.cpp on

  • Machine: Gv4 r8g.24xlarge 

  • OS: ubuntu 2204

  • kernel: 6.8.AWS

  • Model: Meta-Llama-3.1-8B-Instruct-  Q8_0.gguf

Test sweep

  • nproc x nthreads x bs [1-32]

Graphs with observations highlighting benefits


Token generation is done in an auto-regressive manner and is highly sensitive to the length of output needed to be generated. Arm optimizations help here with larger batch sizes, increasing the throughput by more than 2x.





Conclusion


For Meta-Llama-3.1-8B-Instruct- Q8_0.gguf, Graviton4 can generate 161 tokens per sec which translates to 102,486 tokens per dollar.

Komentarze


bottom of page