Benchmarking and Validation of Workloads on Emulators

Sayali Tamane
Jun 30, 2025
2 min read

Updated: Nov 24, 2025

In this case study, we describe our systematic approach to benchmarking and validating workloads on FPGA platforms using HAPS (High-performance ASIC Prototyping System) models. The workflow involves compiling and cross-compiling a diverse set of workloads using both native QEMU and the open source toolchain, executing them on FPGA hardware, and capturing detailed performance metrics such as instructions executed and cycle counts.

1. Benchmark Preparation and Build Process

We classify our benchmarks into the following categories:

High-Performance Computing (HPC) Benchmarks: Includes matrix multiplication, FFT, and other numerical kernels.
Synthetic Benchmarks: Includes whetstone, dhrystone, and other CPU stress tests.
Algorithmic Benchmarks: Includes sorting algorithms, graph traversal, and numerical integration.
Cryptography and Security Benchmarks: AES, RSA, SHA-based microbenchmarks (in future pipeline).
Memory and I/O Benchmarks: Includes stream, memcpy stressors, and file read/write tests.
Industry-standard Benchmarks: SPEC CPU2017 for INT and FP tracks.

All benchmarks are first built or cross-compiled depending on their compatibility:

Native Build: Performed on QEMU-based emulation environment where toolchain compatibility allows.
Cross Compilation: Done using toolchain targeting the architecture for cases where native build fails or is time-prohibitive.

2. Deployment and Execution on FPGA

The compiled binaries are deployed to the FPGA via HAPS models configured with a soft-core. Execution is controlled using a lightweight shell interface or boot script. We utilize a custom performance monitoring utility (akc_counter_capture) to gather the following metrics:

Total instruction count
Cycle count

These values are stored for each benchmark run and are used in performance comparisons.

3. Workload Example

1: DGEMM (Double-Precision General Matrix Multiply)

DGEMM is a key linear algebra kernel from the BLAS library. We compiled and executed the DGEMM kernel using double-precision arithmetic with matrix size NxN, where N=256. Performance was evaluated using instruction count, cycle count, and IPC (Instructions Per Cycle).

2: N-Queens Problem

The N-Queens benchmark is a classic example of combinatorial search used to evaluate control-flow-heavy algorithm performance. It computes all valid arrangements of N queens on an N×N chessboard such that no two queens attack each other.

We verified correctness by comparing the total number of valid solutions for standard board sizes (e.g., N=12 and N=14), which matched precisely across architectures. The benchmark’s output was deterministic, and no deviations were observed across multiple FPGA runs.

3: Red-Black Tree (RBTree) Manipulation

Red-Black Tree (RBTree) manipulation represents a memory-bound and pointer-intensive workload that tests dynamic memory access patterns and data structure balancing algorithms. This benchmark was compiled using both the Embedded toolchain and natively on QEMU for consistency.

Validation involved verifying the in-order traversal of the tree after bulk insertions and deletions. RBTree serves as a robust test of both instruction scheduling and memory subsystem behavior.

Conclusion

Our approach demonstrates that workloads can be effectively compiled, executed, and validated on FPGA platforms using HAPS models.

Benchmarking and Validation of Workloads on Emulators

1. Benchmark Preparation and Build Process

2. Deployment and Execution on FPGA

3. Workload Example

Conclusion

Recent Posts

Comments