72 results found with an empty search
- AWS Graviton4 vs. GCP Axion
This blog post dives into a head-to-head performance comparison of two leading contenders: AWS Graviton4 (powering AWS r8g instances) and Google Axion (powering GCP Axion instances), both built on the advanced Arm Neoverse-V2 architecture. We'll examine their performance with Valkey 8.0.1, a popular in-memory data store. The Contenders: AWS Graviton4 and Google Axion AWS Graviton and Google Axion represent the latest generation of ARM-based server processors from Amazon and Google. Both leverage the Arm Neoverse-V2 CPU architecture, which is specifically designed for cloud computing, machine learning, and high-performance computing (HPC). These custom chips aim to provide superior performance and energy efficiency compared to traditional x86-based alternatives. The Benchmark: Valkey 8.0.1 To conduct a meaningful comparison, we chose Valkey 8.0.1, a high-performance, open-source in-memory data structure store. Valkey is a fork of Redis and is widely used for caching, session management, and real-time analytics, making it an excellent workload for testing the raw processing and memory capabilities of these instances. Our benchmark setup was configured to ensure a fair comparison: Valkey Server Cores: The Valkey server was pinned to cores 2 through 7. Request Parameters: Each experiment used 100 million parallel requests, 256 clients, and a payload size of 1024 KiB. Performance Metrics: We focused on two key metrics: Requests Per Second (RPS) for throughput and P99 latency (the 99th percentile) for responsiveness. Experiment 1: Network Performance The first experiment tested the performance when the Valkey client and server instances were running on separate hosts within the same cluster network. This scenario highlights the efficiency of the underlying network virtualization and interconnects, which are critical for many distributed workloads. IRQ Pinning: For this test, we used IRQ pinning on cores 0 and 1. This dedicates specific CPU cores to handling network interrupts, preventing them from interfering with the Valkey server's workload and ensuring a more stable and accurate network performance measurement. Distributed Application AWS r8g Instance Results GCP Axion Instance Results SET RPS 925,860 790,020 SET P99 Latency 0.431 ms 0.655 ms GET RPS 941,802 870,920 GET P99 Latency 0.415 ms 0.543 ms In this network-bound test, the AWS r8g instances consistently outperformed GCP Axion in both SET and GET operations, with higher throughput and lower P99 latency. This suggests that the AWS Nitro System's networking capabilities, which are tightly integrated with the Graviton4 processor, provide a notable advantage for distributed, network-sensitive applications. Experiment 2: Same-Host Performance The second experiment evaluated the raw processing power by running both the Valkey client and server on the same host. This test minimizes network overhead and focuses on CPU and memory performance. Same-Host Application AWS r8g Instance Results: GCP Axion Instance Results: SET RPS 1,024,894 894,262 SET P99 Latency 0.407 ms 0.367 ms GET RPS 1,060,186 942,720 GET P99 Latency 0.359 ms 0.303 ms Here, the results reveal a more nuanced picture. While AWS r8g instances again delivered higher overall throughput (RPS), the GCP Axion instances demonstrated lower P99 latency for both SET and GET operations. This indicates that while AWS's architecture may be optimized for achieving maximum throughput, Google's design seems to prioritize low-latency performance, which is a key characteristic of Valkey's command execution model. Conclusion and Analysis The benchmark results paint a clear picture: for this specific workload, AWS Graviton4-based r8g instances lead in raw throughput, while Google Axion instances excel in latency. AWS r8g (Graviton4): The higher RPS in both experiments suggests that AWS's implementation is highly optimized for parallel and high-throughput workloads, likely due to a tight integration with the AWS Nitro System. GCP Axion: The lower P99 latency on the same-host test is a significant indicator. It suggests that Google's Axion processor might have a more efficient core design or cache structure that benefits workloads where low-latency performance is paramount.
- RISCV Fuzzer for GCC and LLVM
Fuzzing RISC-V compilers like GCC and LLVM is a crucial practice for ensuring the correctness and security of the entire software ecosystem built on this architecture. It's not about finding vulnerabilities in the final compiled code, but rather about discovering bugs within the compiler itself that could lead to incorrect code generation, unexpected behavior, or even exploitable flaws. Why Compiler Fuzzing is a Unique Challenge Fuzzing compilers is different from fuzzing a typical application. Instead of feeding random data to a program, you're generating random, yet syntactically valid, source code to feed to the compiler. A dumb fuzzer that just mutates bytes will quickly generate code that can't even be parsed, missing deeper bugs. The primary goal of compiler fuzzing is to detect two main types of bugs: Crashes and Panics: The fuzzer generates code that causes the compiler to crash, hang, or throw a fatal error during compilation. This indicates a compiler bug that needs to be fixed. Miscompilations: This is the most dangerous type of bug. The compiler successfully compiles the fuzzed code, but the generated machine code (the RISC-V assembly) is incorrect. This can lead to silent data corruption, security vulnerabilities, or unpredictable program behavior. Finding these requires a technique called differential fuzzing . The Power of Differential Fuzzing for RISC-V Compilers Differential fuzzing is an exceptionally powerful technique for finding miscompilations in RISC-V compilers. Here's how it works: A fuzzer, often one that generates valid C or C++ code (like csmith), creates a unique program. This program is compiled by at least two different compilers (e.g., GCC and LLVM) or with different optimization flags (e.g., -O0 and -O3). The compiled binaries are then executed, and their outputs are compared. If the outputs don't match, it means at least one of the compilers has a miscompilation bug. The fuzzer then saves this specific source code as a test case for a developer to analyze. This method effectively uses a "test oracle" to automatically identify bugs without needing to know the correct output beforehand. It's a key reason why so many compiler bugs have been found in both GCC and LLVM. Key Tools and Repositories for RISC-V Compiler Fuzzing While many of the general-purpose fuzzers mentioned before (like AFL++) can be used to fuzz a compiler's source code, specialized tools are often needed to effectively generate and test valid RISC-V-specific code. csmith: This is a well-known, randomized test case generator for C programs. It creates complex, valid C code that is a perfect input for differential testing of C compilers like GCC and LLVM. While not RISC-V-specific, it's an essential part of the workflow for fuzzing any C compiler's RISC-V backend. GitHub Repo: https://github.com/csmith-project/csmith RISCV-DV: Maintained by the RISC-V community, this tool is primarily for design verification of RISC-V processors, but it can be used to generate complex instruction sequences for testing compiler backends. It's highly configurable and can target specific ISA extensions. GitHub Repo: https://github.com/google/riscv-dv IRFuzzer: A specialized fuzzer for the LLVM backend . Instead of generating C/C++ source code, it generates LLVM's Intermediate Representation (IR), allowing it to directly test the backend code generation without worrying about frontend bugs. This is a very targeted approach for finding issues in LLVM's RISC-V code generator. GitHub: As a research tool, you can often find resources on arXiv and university websites. Searching for "IRFuzzer" on GitHub will lead to related projects. RISCV-Vector-Intrinsic-Fuzzing (RIF): A specific fuzzer designed to generate random code using the RISC-V Vector Extension (RVV) intrinsics. This is crucial for verifying that compilers like GCC and LLVM correctly implement this complex and performance-critical part of the RISC-V ISA. GitHub Repo: https://github.com/sifive/riscv-vector-intrinsic-fuzzing Patrick-rivos/compiler-fuzz-ci: This GitHub repository provides a great example of a Continuous Integration (CI) setup for fuzzing RISC-V compilers. It demonstrates how to combine tools like csmith with a CI pipeline to automatically fuzz GCC and LLVM and report bugs. GitHub Repo: https://github.com/patrick-rivos/compiler-fuzz-ci Fuzzing RISC-V compilers is an ongoing and critical effort. It ensures that the software developers are building on top of is reliable, secure, and correctly translated to the underlying hardware, strengthening the entire RISC-V ecosystem.
- From Classroom to Code: Our Transformative Journey as Interns at WhileOne
The Leap into the Unknown Stepping out of the academic bubble and into the professional world is often painted as a daunting transition. For us, it was less a leap of faith and more an excited dive into the deep end, specifically, into the innovative waters of WhileOne.Our motivation to join was simple yet profound: we sought a place where curiosity was celebrated, challenges were seen as growth opportunities, and real-world impact was a daily pursuit. Little did we know that this internship would not just introduce us to company life but fundamentally reshape our technical acumen and career outlook. Interns Interning Unpacking the Technical Toolkit: What we've Learned: Tanaya Ajgar: Diagnostic Configuration Dashboard for BeagleBone Black: I developed a diagnostic configuration page for the BeagleBone Black board, which gave me valuable hands-on experience in full-stack development. I worked on building a responsive React + HTML frontend and integrated it with a Python Flask backend to enable seamless communication with the board. Through this, I learned how to design and implement interactive dashboards that allow users to configure system parameters such as IP addresses and ensure that the updates persist at the system level. I also gained practical knowledge of storing configuration results in both SQLite databases and JSON files for reliability and easy retrieval. This project helped me strengthen my understanding of REST APIs, data flow between frontend and backend, and the importance of efficient database integration in embedded system applications. I also improved my debugging skills while resolving real-time hardware-software interaction issues. Additionally, I learned how to apply UI/UX practices to make technical dashboards more intuitive and user-friendly. Overall, the project enhanced my skills in embedded system integration, web technologies, and problem-solving in a real-world scenario. Soham Gargote: During my internship, I had the valuable opportunity to contribute to two diverse and impactful projects. I delved into low-level systems programming by extending the open-source GDB debugger to enable support for RISC-V vector instructions. In parallel, I was instrumental in creating a new internal tool for benchmark management, where I developed the backend for its UI/UX visualization capabilities. This dual exposure to both open-source contributions and internal tool development made for an incredibly fun and enriching learning experience, significantly strengthening my software engineering skills. Saee Gade : RISC-V Toolchain Validation & Compiler Fuzzing As an intern, my work on RISC-V toolchain validation taught me the immense value of compiler fuzzing and its role in software reliability. I gained hands-on experience using tools like Csmith to automatically generate complex test cases and uncover hidden bugs. Beyond just bug hunting, the project's most significant takeaway for me was the process of creating high-quality, actionable bug reports. I learned the critical skill of creating minimal, reproducible test cases and effectively communicating findings to developers on platforms like Bugzilla. Contributing to major open-source projects like GCC and LLVM showed me the real-world dynamics of collaborative development and the tangible impact my work could have on improving the stability of a key toolchain for an entire ecosystem. Ruchi Joshi : Technical takeaways from Benchmarking project: Through hands-on experience with benchmarking, I learned to evaluate system performance using industry-standard HPC benchmarks like MiniFE and HPCG. I gained practical skills in writing automated shell scripts to test CPU and GPU performance across different architectures. This project also provided me with the opportunity to work with low-level hardware performance counters, learning to collect data on retired instructions, cache misses, and branch mispredictions. I now understand how to translate this raw data into higher-level microarchitecture insights using Intel's Top-Down Microarchitecture Analysis Methodology (TMAM) to identify critical bottlenecks. Furthermore, I explored ARM's Performance Monitoring Unit (PMU), which gave me insight into the distinct tooling and counter availability between Intel and ARM ecosystems. This holistic experience has provided me with a comprehensive understanding of performance analysis from high-level application benchmarks down to low-level hardware counters. The WhileOne Way: A Glimpse into Company Life Our general experience as interns has been overwhelmingly positive. The atmosphere at WhileOne is one of collaborative energy, where questions are encouraged, and mentorship is readily available. It’s a far cry from the sometimes solitary nature of academic projects. Joining WhileOne feels like becoming part of a forward-thinking family. There’s a palpable sense of innovation and a shared drive to create impactful solutions. The differences between college and company life are stark but refreshing. In college, deadlines can feel somewhat arbitrary, and projects often exist in a vacuum. Here, every task has a purpose, directly contributing to a product or service. The pace is faster, the stakes are higher, but the support system is robust. Learning is continuous, driven by real-world problems rather than theoretical exercises. Navigating Opportunities and Challenges Our internship presented a wealth of opportunities: Direct contribution to live projects: This was incredibly motivating, seeing our codes go into production. Mentorship from experienced engineers: Their guidance has been instrumental in our growth. Exposure to diverse technologies and methodologies: Expanding our technical horizons significantly. Challenges were equally present and equally valuable: Steep learning curve: Rapidly adapting to new tools and complex systems. Problem-solving under pressure: Learning to debug efficiently and think critically when faced with unexpected issues. Balancing multiple tasks: Juggling different responsibilities and prioritizing effectively. A New Beginning As we reflect on our journeys from curious students to contributing members of the WhileOne team, we are filled with gratitude and excitement. This internship has been more than just a stepping stone; it's been a foundational experience that has shaped our technical abilities, professional outlook, and career aspirations. The transition from classroom concepts to production code has been challenging yet incredibly rewarding. If you're considering an internship, especially one where real impact is made, we wholeheartedly recommend diving in. The future, for us, is bright and brimming with code, collaboration, and continuous learning all tha nks to my transformative time at WhileOne.
- Neoverse -V2 Support to Intel Perfspect
We recently worked on extending Intel Perfspect ( https://github.com/Whileone-Techsoft/PerfSpect/tree/Neoverse-native-support ) , a robust, command-line performance analysis tool that implements the Top-Down Microarchitecture Analysis Method (TMAM). It fully supports the Arm Neoverse-V2 architectures. This project required mapping the Performance Monitoring Unit (PMU) events on the ARM cores to the metrics of TMAM methodology. We can now get the Level 1 breakdown (Frontend Bound, Backend Bound, Retiring, Lost) to pinpoint the bottlenecks on respective systems, which was previously incompatible with this tool. Through debugging the code, it became possible to generate the continuous series graphs to understand if the bottleneck. This extension of Perfspect for Arm (our code allows native compilation on ARM) allowed the capture of the CPU utilization HeatMap generated in Telemetry reports, which shows the distribution of work across all cores over time. The challenge was mapping the Arm events to the TMAM formulae, and then correctly comparing the captured values through the modified Perfspect tool with the values that were manually calculated using the formulae. The key learnings from this project were quickly adapting to a new programming language like Go ( Golang) and making significant changes in the code to get the appropriate results, also adding to the knowledge of TMAM methodology and the specific challenges of cross-architecture analysis, particularly in translating PMU events from the Intel ecosystem to Arm's Neoverse-V2 core.
- Debugging the Debugger: A Deep Dive into GDB and RISC-V
In the world of software development, the GNU Debugger (GDB) is an essential tool for programmers. It allows us to peer inside a running program, find bugs, and understand complex code. As new hardware architectures emerge, it's crucial that our tools keep pace. One such rising star is RISC-V, an open-source instruction set architecture that is rapidly gaining popularity, particularly with its new vector extensions for high-performance computing. The Challenge: An Unknown Instruction Recently, our team took on a task to get a few bug fixes into gdb. The challenge: GDB was unable to recognize or debug vector instructions for the RISC-V architecture. This was a significant gap, hindering developers who were working with advanced RISC-V features. Without this support, debugging modern, high-performance RISC-V applications was a major challenge. My task was to dive into the GDB source code and enable this missing capability. Navigating a Sea of Code The first and most significant hurdle was the sheer scale of the GDB codebase. As a newcomer to such a vast and mature open-source project, understanding the intricate flow of control and finding the right place to intervene was a daunting task. The initial phase involved a lot of learning and exploration, and I'm grateful for the guidance of my colleagues who helped me navigate the complexities and build a mental map of the system. Through a careful process of debugging the debugger itself, we were able to trace the execution path for instruction processing. The breakthrough came when we identified the root cause of the issue: there was a missing function call responsible for reading and interpreting the new vector instructions. The logic was there, but it was never being invoked for this specific case. With the problem identified,we implemented an initial solution. Our contribution involved a hardcoded fix that proved the concept and successfully enabled GDB to recognize the vector instructions. This initial patch paved the way for a more robust and integrated solution that was later refined by other contributors in the open-source community. The result is a direct enhancement to a critical developer tool. Programmers working with RISC-V can now debug vector-based code more effectively, accelerating development and improving software quality within the ecosystem. https://www.sourceware.org/pipermail/gdb-patches/2025-May/217880.html
- Top CPU Performance Benchmarking Toolkits You Should Know
Modern compute platforms - from cloud hyperscale CPUs to edge processors - deliver unprecedented parallelism and instruction-set capabilities. But to truly understand performance, you need the right benchmarking tools. Whether you're comparing cloud instances, evaluating Arm-based servers like Ampere , or validating x86, RISC-V, or AI-accelerated hardware, the ecosystem offers several battle-tested frameworks. In this blog, we explore the most widely-used CPU benchmarking toolkits today - what they do, where they shine, and when to use each. 1. Ampere Performance Toolkit (APT) Ampere’s servers built on Arm architecture are optimized for cloud-native performance and power efficiency. The Ampere Performance Toolkit provides a set of scripts, automation, and recommended benchmarks to evaluate real-world workloads. Key Features Best For ✔ Evaluating Arm server performance ✔ Cloud benchmarking on Ampere instances ✔ Developers migrating workloads from x86 to Arm 2. PerfKit Benchmarker (Google) Originally built by Google, PerfKit Benchmarker (PKB) is the gold standard for cloud performance benchmarking across providers. Key Features Best For ✔ Comparing cloud VM types ✔ Reproducible benchmark automation ✔ Cloud procurement and architectural evaluations Fun fact: PKB has become the foundation for multiple forks and extensions across companies and academia for transparent benchmarking. 3. Phoronix Test Suite (PTS) The Phoronix Test Suite is one of the largest open-source benchmarking ecosystems—great for developers and hardware reviewers. Key Features Best For ✔ Broad CPU and system benchmarking ✔ Linux performance testing ✔ Reviewers, researchers, and enthusiasts 4. SPEC CPU Suite The Standard Performance Evaluation Corporation (SPEC) CPU suites are industry-trusted benchmarks for vendors and OEMs. Key Features Best For ✔ Enterprise-grade server benchmarking ✔ Official vendor comparisons ✔ Performance engineering and compiler tuning Note: Requires paid license. 5. Microbenchmark Suites (Core Latency, Memory, IPC) Sometimes, detailed architectural behavior matters more than high-level scores. Popular Tools Best For ✔ Low-level CPU behavior ✔ Memory latency & bandwidth analysis ✔ Performance debugging ML & AI-Centric Benchmarks (Emerging) Even CPU evaluations increasingly involve AI workloads. Best For ✔ AI inference on CPUs ✔ Edge compute & acceleration evaluations Bonus: Build-Your-Own Benchmark Harness Cloud providers and silicon vendors often implement custom harnesses around: Docker-ized workloads Kubernetes load-generation frameworks Real-app benchmarking (Redis, NGINX, PostgreSQL, Spark) For engineering teams, custom workload pipelines often reveal more than synthetic scores . Summary Table Toolkit Scope Best Use Case Ampere Performance Toolkit Server-class Arm systems Cloud-native Arm benchmarking PerfKit Benchmarker Multi-cloud benchmarking Cloud instance comparisons Phoronix Test Suite Broad system benchmark suite Linux and multi-OS testing SPEC CPU Industry standard CPU benchmarks Formal server performance publication sysbench / lmbench / perf Microbenchmarks & counters CPU profiling & tuning MLPerf / HPL / HPCG AI & HPC performance Compute-heavy + scientific workloads
- Network Latency Study in OCI Cloud
Network testing tools such as netperf can perform latency tests plus throughput tests and more. In netperf, the TCP_RR and UDP_RR (RR=request-response) tests report round-trip latency. With the -o flag, output metrics can be customized to display the exact information. Here’s an example of using the test-specific -o flag so netperf outputs several latency statistics: Google has lots of practical experience in latency benchmarking and as per blog using-netperf-and-ping-to-measure-network-latency , we tried to create own latency benchmarking before and after migrating workloads to the OCI cloud. Which tools and why All the tools in this area do roughly the same thing: measure the round trip time (RTT) of transactions. Ping does this using ICMP packets. ping -c 100 ping command sends one ICMP packet per second to the specified IP address until it has sent 100 packets. netperf -H -t TCP_RR -- -o min_latency,max_latency,mean_latency -H for remote-host and -t for test-name with a test-specific option -o for output-selectors. When we run latency tests at Google in a cloud environment, our tool of choice is PerfKit Benchmarker (PKB). This open-source tool allows to run benchmarks on various cloud providers while automatically setting up and tearing down the virtual infrastructure required for those benchmarks. After setting up, PerfkitBenchmarker, its simple to run ping and netperf benchmarks ./pkb.py --benchmarks=ping --cloud=OCI --zone=us-ashburn-1 ./pkb.py --benchmarks=netperf --cloud=OCI --zone=us-ashburn-1 --netperf_benchmarks=TCP_RR These commands run intra-zone latency benchmarks between two machines in a single zone in a single region. Intra-zone benchmarks like this are useful for showing very low latencies, in microseconds, between machines that work together closely. Latency discrepancies We've set up two VM.Standard.E4.Flex machines running Ubuntu 22.04 in zone us-ashburn-1 , and we'll use Private IP addresses to get the best results. If we run a ping test with default settings and set the packet count to 100, we get the following results: ping -c STDOUT: PING 172.16.60.168 (172.16.60.168) 56(84) bytes of data. 64 bytes from 172.16.60.168: icmp_seq=1 ttl=64 time=0.202 ms 64 bytes from 172.16.60.168: icmp_seq=2 ttl=64 time=0.205 ms … 64 bytes from 172.16.60.168: icmp_seq=99 ttl=64 time=0.329 ms 64 bytes from 172.16.60.168: icmp_seq=100 ttl=64 time=0.365 ms --- 172.16.92.253 ping statistics --- 100 packets transmitted, 100 received, 0% packet loss, time 101353ms rtt min/avg/max/mdev = 0.371/0.450/0.691/0.040 ms By default, ping sends out one request each second. After 100 packets, the summary reports that we observed an average latency of 0.450 milliseconds, or 451 microseconds. For comparison, let’s run netperf TCP_RR with default settings for the same amount of packets. netperf-2.7.0/src/netperf -p {command_port} -j -v2 -t TCP_RR -H 132.145.132.29 -l 60 -- -P ,{data_port} -o THROUGHPUT, THROUGHPUT_UNITS, P50_LATENCY, P90_LATENCY, P99_LATENCY, STDDEV_LATENCY, MIN_LATENCY, MEAN_LATENCY, MAX_LATENCY --num_streams=1 --port_start=20000' --timeout 360 Netperf Results: {'Throughput': '4245.34', 'Throughput Units': 'Trans/s', '50th Percentile Latency Microseconds': '228', '90th Percentile Latency Microseconds': '239', '99th Percentile Latency Microseconds': '372', 'Stddev Latency Microseconds': '92.06', 'Minimum Latency Microseconds': '215', 'Mean Latency Microseconds': '235.08', 'Maximum Latency Microseconds': '21059' Which test can we trust? To explain, this is largely an artefact of the different intervals the two tools used by default. Ping uses an interval of 1 transaction per second while netperf issues the next transaction immediately when the previous transaction is complete. Fortunately, both of these tools allow to manually set the interval time between transactions. For ping, -i flag to set the interval, given in seconds or fractions of a second. On Linux systems, this has a granularity of 1 ms, and rounds down. $ ping -c 100 -i 0.010 For netperf TCP_RR, we can enable some options --enable-spin flag to compile with fine-grained intervals -w flag, to set the interval time, and the -b flag, to set the number of transactions sent per interval. This approach allows to set intervals with much finer granularity, by spinning in a tight loop until the next interval instead of waiting for a timer; this keeps the cpu fully awake. Of course, this precision comes at the cost of much higher CPU utilization as the CPU is spinning while waiting. *Note: Alternatively, setting less fine-grained intervals by compiling with the --enable-intervals flag. Use of the -w and -b options requires building netperf with either the --enable-intervals or --enable-spin flag set. The tests here are performed with the --enable-spin flag set. netperf with an interval of 10 milliseconds using: $ netperf -H -t TCP_RR -w 10ms -b 1 -- -o min_latency,max_latency,mean_latency Now, after aligning the interval time for both ping and netperf to 10 milliseconds, the effects are apparent: Ping result is --- 172.16.92.253 ping statistics --- 1000 packets transmitted, 1000 received, 0% packet loss, time 15981ms rtt min/avg/max/mdev = 0.252/0.306/0.577/0.025 ms Netperf results are Minimum Latency Microseconds,Maximum Latency Microseconds,Mean Latency Microseconds 215,235.08,21059 We have integrated OCI as a provider in Perfkitbenchmarker which we are using to carry out testing. Here are the results of the inter region ping benchmark for A1.Flex2, E4.Flex.1 and S1.Flex vms. Tested netperf for intra region, considered us-ashburn-1 region here. Generally, netperf is recommended over ping for latency tests. This isn't due to any lower reported latency at default settings, though. As a whole, netperf allows greater flexibility with its options and we prefer using TCP over ICMP. TCP is a more common use case and thus tends to be more representative of real-world applications. That being said, the difference between similarly configured runs with these tools is much less across longer path lengths. Also, remember that interval time and other tool settings should be recorded and reported when performing latency tests, especially at lower latencies, because these intervals make a material difference.
- Investigating Performance Discrepancy in HPL Test on ARM64 Machines
Introduction: High-Performance Linpack (HPL) is a widely used benchmark for testing the computational performance of computing systems. In this blog post, we explore an intriguing scenario where we conducted HPL tests on two ARM64 machines. Surprisingly, the Host-2 machine exhibited a 20% lower performance than the Host-1 machine in the HPL test. Intrigued by this result, we embarked on a journey to comprehensively diagnose the underlying cause of this performance discrepancy. Why HPL for Performance Testing? People use the High-Performance Linpack (HPL) benchmark for performance testing because it provides a standardised and demanding workload that measures the peak processing power of computer systems, particularly in terms of floating-point calculations. It helps assess and compare the computational capabilities of different hardware configurations. This benchmark helps in comparing and ranking supercomputers' performance and is often used as a metric for the TOP500 list of the world's most powerful supercomputers. For more information, you can refer to the TOP500 article here: TOP500 List Objective: The primary objective of this investigation was to identify the reason behind the 20% performance difference observed in the HPL test between the Host-1 and Host-2 machines. To comprehensively diagnose the performance discrepancy, we conducted additional benchmark tests, including Stream, Lmbench, and bandwidth tests. 1. System Details: We conducted a fair and controlled experiment using two ARM64 machines, referred to as Host-1 and Host-2. 1.1 Machine Specifications ( Host-1 and Host-2 ): CPU(s): 96 Architecture: aarch64 Total memory: 96 GB Memory speed: 3200 MHz 2. Running HPL Benchmark: To run the HPL benchmark on an arm64 machine, you can refer to the GitHub repository provided: https://github.com/AmpereComputing/HPL-on-Ampere-Altra . This repository likely contains instructions, scripts, and configurations specific to running HPL on Ampere Altra ARM64-based machines. It's important to follow the guidelines provided in the repository to ensure accurate and meaningful benchmarking results. 2.1 HPL Scores: Upon completing the HPL benchmark on both machines, we computed and compared the achieved HPL scores. The Host-1 machine garnered a higher HPL score, signifying better computational performance. Machine Time ( sec ) Score Host-1 619.91 1245 Host-2 784 985 This result raised a critical question: why was there such a substantial performance gap? To delve into the root causes behind this discrepancy, we decided to conduct a series of additional tests to comprehensively investigate the issue. 3. Exploring Additional Tests: We conducted several other benchmark tests to comprehensively investigate the performance discrepancy between the Host-1 and Host-2 ARM64 machines. These tests aimed to shed light on various aspects of the systems' hardware and memory subsystems, providing a holistic understanding of the observed difference. Below, we detail the tests and their findings: 3.1 Stream Benchmark: The Stream benchmark assesses memory bandwidth and measures the system's capability to read from and write to memory. The benchmark consists of four fundamental tests: Copy, Scale, Add, and Triad. Copy: Measures the speed of copying one array to another. Scale: Evaluates the performance of multiplying an array by a constant. Add: Tests the speed of adding two arrays together. Triad: Measures the performance of a combination of operations involving three arrays. The Stream benchmark helps uncover memory bandwidth limitations and assess memory subsystem efficiency. Host-1 machine result : Function Best Rate MB/s Avg time Min time Max time Copy 103837.8 0.367897 0.36192 0.373494 Scale 102739.4 0.369191 0.365789 0.372439 Add 106782.7 0.536131 0.527908 0.542759 Triad 106559.1 0.533549 0.529016 0.537881 Host-2 machine result : Function Best Rate MB/s Avg time Min time Max time Copy 66071.3 0.572721 0.568794 0.575953 Scale 65708.8 0.575758 0.571932 0.580686 Add 67215.5 0.843995 0.838667 0.848371 Triad 67668.1 0.837109 0.833058 0.84079 Best Rate MB/s vs Function Graph In the Stream Benchmark results, Host-1 outperformed Host-2 across all functions (Copy, Scale, Add, Triad). Host-1 demonstrated higher memory bandwidth in each function, achieving significantly faster data transfer rates. This suggests a stronger memory subsystem performance in Host-1 compared to Host-2. 3.2 Lmbench for Memory Latency: Lmbench is a suite of micro-benchmarks designed to provide insights into various aspects of system performance. The suite includes latency tests for system calls, memory accesses, and various operations to quantify the system's responsiveness. Memory access tests include random read/write latency and bandwidth, helping to identify memory subsystem performance. File I/O tests evaluate file system performance, providing insights into storage subsystem capabilities Result Memory Latency: Memory latency refers to the time it takes for the CPU to access a specific memory location. Lower latency values indicate better performance, as data can be fetched more quickly. size (MB) latency (ns)-HOST-1 latency (ns)-HOST-2 0.00049 1.43 1.429 —----- —----- —----- —----- —----- —----- 2 32.355 32.786 3 34.503 36.012 4 37.403 37.932 6 39.982 52.922 8 41.007 54.001 12 44.315 55.466 16 65.52 73.016 24 95.131 117.278 32 115.081 138.945 48 126.796 151.945 64 129.558 159.225 96 134.413 166.359 128 136.239 167.788 192 136.245 168.689 256 136.366 170.464 384 137.732 170.461 —----- —----- —----- —----- —----- —----- 2048 135.61 149.809 4. Analysis and Findings: After conducting these benchmark tests, we observed the Host-2 machine consistently exhibited lower performance across different tests compared to the Host-1 machine. The most significant finding came from the Lmbench test, which revealed that the Host-2 machine's RAM had notably higher latency compared to the Host-1 machine. Notably, an additional factor was identified—the RAM rank. The Host-1 machine is equipped with Dual-Rank RAM, while the Host-1 machine has Single-Rank RAM. This RAM rank difference could contribute to the performance discrepancy. The observation is in line with findings from various other studies that have examined the influence of RAM rank on system performance. To gain a more comprehensive understanding of this subject, the following articles could be of interest: Single vs. Dual-Rank RAM: Which Memory Type Will Boost Performance? - This article provides a thorough comparison between single and dual-rank RAM, aiding in comprehending the disparities between these two RAM types, methods to distinguish them, and guidance on selecting the most suitable option for your needs. ( LINK ) Single Rank vs Dual Rank RAM: Differences & Performance Impact : This article delves into the differences between Single Rank and Dual Rank RAM modules, investigating their structural dissimilarities and assessing the respective impacts on performance. ( LINK ) 5. Conclusion: After conducting an extensive series of benchmark tests, we have pinpointed certain factors that contribute to the performance disparity observed in the HPL test between the two ARM64 machines. In the Stream Benchmark results, Host-1 outperformed Host-2 across all functions (Copy, Scale, Add, Triad). Host-1 demonstrated higher memory bandwidth in each function, achieving significantly faster data transfer rates. Additionally, the higher memory latency in the Host-2 machine's RAM was identified as a key contributor to the performance gap. This latency impacted the efficiency of memory operations and had a cascading effect on overall performance. Another significant factor was the difference in RAM rank configurations — Host-1 had Dual-Rank RAM, while Host-2 had Single-Rank RAM. This divergence likely contributed to the varying memory access speeds between the two machines. 6. Future Scope: In the context of further exploration, it is recommended to extend the investigation by including additional benchmark tests, specifically focusing on the Lmbench memory bandwidth test. This test would provide deeper insights into the memory subsystem's performance on both the Host-1 and Host-2 machines. Additionally, an interesting avenue for investigation could involve modifying the RAM configuration in one of the machines and assessing its impact on performance. This would provide valuable information about the role of memory specifications in influencing the overall system performance.
- Root causing a memory corruption on Arm64 VMs
We recently migrated one of our websites to Azure Arm64 VMs. However, as soon as we pushed the infrastructure change in production, we started to observe our server process being restarted infrequently. These restarts may happen within a few seconds sometimes while not occurring for hours at other times. While the redundancy in our setup ensured minimal end-user impact, we wanted to quickly address the issue at hand. Looking at the logs A quick look at the logs showed the following error before process restarts: malloc(): corrupted top sizeAborted (core dumped) This is a Node.js based Next.js website with nothing memory intensive being performed. So, we were surprised to see a memory related issue. A quick look at the top also suggested we had adequate memory available for our running processes. So, this definitely looked like a memory corruption. Our next challenge was to identify what caused the memory corruption. On analyzing the logs further, it did not appear that there was a single website url causing this issue. Reproducing the issue With this information at hand, we went back to our test environment (which was also running on Azure Arm64 VM) and setup a more detailed logging. We then visited a large number of our website urls to see if we could reproduce the restart. Eventually, we did find a couple of urls where the Node.js process would exit with the corrupted memory error message. Identifying the root cause Once we could reproduce the issue, we narrowed it down the to images loading on these pages. Our images were being served by Next.js next/image library. This library internally leverages the `sharp` package to optimize the images being served. So, it appeared that for some images (not all), the sharp image optimization logic was resulting in memory corruption causing our Node.js process to exit. Looking at the current & past issues for lovell/sharp on github took us to this issue , which summarized our experience. Issue details & Fix On probing further, we understood that the libspng library being used by lovell/sharp had a memory corruption issue when trying to decode a paletted PNG on Arm64. libspng addressed this issue with v0.7.2 which was picked by lovell/sharp within v0.31.0. On pinning our sharp dependency within the package.json to v0.31.0, we were able to force our next/image to pick-up this version of the sharp library (instead of the older one) for image optimiztaion. With this change, the specific images that were causing Node.js process exit earlier were now being optimized as expected. Once the change went into production, we watched our production Node.js processes for any restarts. With no restarts observed for a couple of days, we were able to mark the issue as addressed.
- Mastering the 5 Essential Performance Engineering Skills for Software Engineers: A Professional Guide
Performance engineering is a vital area in software development that guarantees applications function efficiently and effectively. As modern software systems grow more complex, the need for skilled engineers who understand performance becomes increasingly important. This guide will cover five essential performance engineering skills every software engineer should develop to thrive in their careers. Grasping Performance Requirements To start, software engineers must excel at understanding performance requirements. This means knowing how the system behaves under different loads and the specific performance targets the application must meet. Involved discussions with stakeholders are crucial for defining clear performance metrics early in the development process. Important performance indicators (KPIs) include: Response Time : The time taken for a system to respond to a user request. According to a report, 47% of consumers expect a page to load in two seconds or less. Throughput : The amount of work completed in a given timeframe, often measured in transactions per second (TPS). Resource Utilization : Understanding how effectively system resources are being used, such as CPU, memory, and bandwidth. By setting these performance requirements early on, engineers can make better design decisions, leading to more efficient applications right from the beginning. Expertise in Performance Testing Tools A strong command of performance testing tools is essential. Knowledge of both open-source and proprietary tools enables engineers to simulate user traffic, evaluate system performance, and pinpoint potential problems. Some popular performance testing tools include Apache JMeter, LoadRunner, and Gatling. These tools help engineers create test scenarios that reflect real-world load conditions. For instance, a team using JMeter might simulate 10,000 concurrent users on their e-commerce site to ensure it can handle peak shopping times, like Black Friday. Effectively using performance testing tools helps reveal issues and provides actionable insights for optimization. In fact, organizations that conduct regular performance testing see a 30% improvement in application speed and responsiveness. Capacity Planning and Scalability A third essential skill is capacity planning and scalability. Software engineers must be able to forecast the resources needed to accommodate user growth without compromising performance. This involves analyzing historical usage data and anticipating future demands. For example, if a SaaS application reports a 20% monthly increase in active users, engineers must plan to scale infrastructure accordingly. This scaling can happen in two ways: Vertical Scaling : Adding more resources (like CPU or memory) to a single server. Horizontal Scaling : Adding more servers to distribute the load when the user demand increases. Team members should consistently monitor performance against these plans to refine forecasts and implement necessary adjustments. Mastering this skill enables teams to prevent performance issues and support seamless scaling as user needs change. Appreciating System Architecture A solid understanding of system architecture is crucial for performance engineering. Engineers need to be familiar with various architectural patterns such as microservices, serverless, and monolithic designs. Each architecture has its implications for performance. For example, a microservices architecture can enhance scalability but may lead to communication delays between services. In contrast, a monolithic architecture is easier to manage but might struggle under high loads due to its rigid structure. Understanding how different architectures influence performance helps engineers make informed design choices. For instance, a recent study showed that companies implementing microservices correctly reduced deployment times by 75%. Ongoing Performance Monitoring Lastly, ongoing performance monitoring is a critical skill that software engineers should cultivate. After an application is live, continuous monitoring allows teams to spot performance issues that may arise in real-world settings. Using tools like New Relic, Dynatrace, or Grafana helps engineers monitor application performance consistently. For instance, real-time monitoring can quickly alert teams when server response times exceed predefined limits, preventing user dissatisfaction. By integrating ongoing monitoring into their workflow, engineers foster a culture of performance awareness. Companies that prioritize performance monitoring often see conversion rates improve by up to 20% due to enhanced user experiences. Time to Enhance Performance Engineering Skills Mastering performance engineering skills is a necessity for software engineers, not just an option. With the increasing complexity of software systems, it is essential for engineers to possess the knowledge and tools required to ensure that applications meet crucial performance metrics. Focusing on understanding performance requirements, mastering performance testing tools, capacity planning and scalability, system architecture knowledge, and continuous performance monitoring can significantly boost engineers' effectiveness in this important field. As the demand for high-performance applications continues to rise, developing these skills will enhance individual careers while contributing to the success of software projects. Now is the time for aspiring engineers to invest in their own development and polish these performance engineering skills. Success is found in mastering these elements and effectively applying them to real-world challenges.
- Uncovering the Best: 5 Top Tools for Cutting-Edge Chip Benchmarking
In the fast-paced world of technology, chip benchmarking is vital. It helps engineers and developers measure the performance of semiconductor devices to keep up with advancements. This post dives into the top five tools for chip benchmarking, highlighting their features, benefits, and real-world applications. 1. Geekbench Geekbench stands out as a cross-platform benchmarking tool for assessing CPU and GPU performance. Its versatility allows it to work seamlessly across different operating systems, making it a favorite among developers. With a massive database of devices, Geekbench offers detailed scores that let users compare their hardware easily. It measures both single-core and multi-core performance, crucial for modern chips that handle multiple tasks simultaneously. For instance, Geekbench allows you to see how a chip like the Apple M1 stacks up against Intel's latest processors. Setting up Geekbench is quick and user-friendly. It provides insights into memory and compute performance, making it an essential tool for hardware professionals. In fact, many developers report improvements of up to 30% in their designs after optimizing based on Geekbench results. 2. SPEC CPU Benchmark The SPEC CPU Benchmark suite is trusted in the industry for evaluating CPU performance. Created by the Standard Performance Evaluation Corporation, it includes a set of diverse workloads assessing integer and floating-point calculations. SPEC offers reliable reports that reveal both efficiency and speed, enabling engineers to make data-driven decisions. For example, analysis from SPEC has helped companies like AMD refine their latest Ryzen processors, enhancing performance by approximately 25%. SPEC's rigorous validation ensures that results are credible for manufacturers and users alike. Its broad application makes it perfect for systems that demand high performance, such as servers running complex applications. 3. 3DMark 3DMark is essential for gamers and graphics professionals. This graphical benchmarking tool primarily evaluates GPU performance in rendering graphics but can also provide key insights into chip performance concerning integrated graphics. 3DMark includes various tests reflecting real-world gaming scenarios. Users can examine frame rates and rendering speeds, helping them understand how their hardware performs under strain. For instance, the “Fire Strike” test can assess how well a system handles intensive gaming tasks, highlighting up to a 15% difference in performance between competing GPUs. Additionally, the "Time Spy" test evaluates DirectX 12 performance. These visual benchmarks not only present performance data engagingly but also help users spot design flaws in their chips. 4. LLaMA Benchmarks LLaMA (Large Language Model Meta AI) benchmarks are designed to evaluate the performance of various language models across multiple tasks. These benchmarks provide a standardized way to measure the capabilities of models in understanding and generating human-like text. The benchmarks include a wide range of tasks, such as text completion, question answering, and summarization, allowing researchers to assess the models' effectiveness in real-world applications. For instance, recent evaluations have shown that LLaMA models outperform previous iterations in generating coherent and contextually relevant text. One of the key features of LLaMA benchmarks is their focus on zero-shot and few-shot learning capabilities. This aspect enables models to perform well on tasks they have not been explicitly trained for, showcasing their adaptability and generalization abilities. 5. GPT-3 Benchmarks GPT-3 benchmarks provide a comprehensive framework for assessing the performance of the GPT-3 language model across various linguistic tasks. These benchmarks measure aspects such as fluency, coherence, and relevance in generated text. The evaluation includes a variety of tasks, including language translation, text generation, and creative writing, allowing for a holistic view of the model's capabilities. For example, companies utilizing GPT-3 for content creation have reported significant improvements in engagement and quality due to the insights gained from these benchmarks. The user-friendly interface of the benchmarking tools associated with GPT-3 ensures that both novice and experienced users can easily interpret the results. This accessibility has led to its widespread adoption in industries seeking to leverage advanced natural language processing technologies. Making the Right Choice Selecting the appropriate tool for chip benchmarking is crucial. Whether it's the adaptable Geekbench, the trusted SPEC CPU Benchmark, the detailed 3DMark, the versatile SiSoftware Sandra, or the all-encompassing PassMark PerformanceTest, each tool provides unique insights that can foster innovation in chip development. By effectively utilizing these tools, professionals can enhance the performance of semiconductor devices, keeping pace with rapid technological changes. Investing in the right benchmarking tools is not just beneficial; it is vital for success in chip development.
- CPU-Centric HPC Benchmarking with miniFE and GROMACS
Benchmarks are vital for evaluating High-Performance Computing (HPC) system performance, guiding hardware choices, and optimizing software. This whitepaper focuses on understanding and overcoming bottlenecks in HPC benchmarks for CPU environments, specifically considering ARM/AARCH64 architectures, using miniFE and GROMACS as examples. 1. Introduction to miniFE and GROMACS Benchmarks 1.1. miniFE: A Finite Element Mini-Application miniFE, part of the Mantevo suite, simulates implicit finite element applications. It solves sparse linear systems, with its core kernels focused on element-operator computation, assembly, sparse matrix-vector products (SpMV), and basic vector operations. It's excellent for benchmarking systems handling sparse linear algebra and iterative solvers. To run miniFE, you typically compile it with an MPI-enabled compiler. Execution involves specifying problem dimensions and MPI processes. # Example for a single node with 16 MPI tasks srun -N 1 -n 16 miniFE.x -nx 264 -ny 256 -nz 256 # Example for a multi-node run (adjust N and n) srun -N 4 -n 64 miniFE.x -nx 528 -ny 512 -nz 512 Note: srun is for SLURM; mpirun or similar for other systems. 1.2. GROMACS: Molecular Dynamics Simulation Software GROMACS (GROningen MAchine for Chemical Simulations) is a highly optimized open-source software for molecular dynamics (MD) simulations. It models atomic and molecular movements, particularly for biochemical systems, and is efficient in calculating non-bonded interactions. A typical GROMACS workflow prepares input files, then runs the simulation. # Step 1: Prepare the run input file (.tpr) gmx grompp -f pme.mdp -c conf.gro -p topol.top -o topol.tpr # Step 2: Run the molecular dynamics simulation mpirun -np 4 gmx_mpi mdrun -s topol.tpr -ntomp 4 # To run a specific benchmark system (e.g., 'benchPEP-h') mpirun -np gmx_mpi mdrun -s benchPEP-h.tpr -ntomp Note: Tune MPI processes (-np) and OpenMP threads (-ntomp) to your hardware. 1.2. GROMACS: Molecular Dynamics Simulation Software GROMACS (GROningen MAchine for Chemical Simulations) is a highly optimized open-source software for molecular dynamics (MD) simulations. It models atomic and molecular movements, particularly for biochemical systems, and is efficient in calculating non-bonded interactions. A typical GROMACS workflow prepares input files, then runs the simulation. # Step 1: Prepare the run input file (.tpr) gmx grompp -f pme.mdp -c conf.gro -p topol.top -o topol.tpr # Step 2: Run the molecular dynamics simulation mpirun -np 4 gmx_mpi mdrun -s topol.tpr -ntomp 4 # To run a specific benchmark system (e.g., 'benchPEP-h') mpirun -np gmx_mpi mdrun -s benchPEP-h.tpr -ntomp Note: Tune MPI processes (-np) and OpenMP threads (-ntomp) to your hardware. 2. Interpreting Performance Output (Benchmarking POV) Understanding benchmark output is crucial for evaluating HPC system throughput and efficiency. 2.1. miniFE Performance Metrics miniFE outputs performance data, primarily focused on: Total CG Mflops (Mega-Floating Point Operations per Second for Conjugate Gradient solve): The main Figure of Merit (FOM). Higher values indicate better performance, reflecting the system's efficiency in sparse linear algebra, often limited by memory bandwidth and FPU throughput. 2.2. GROMACS Performance Metrics GROMACS provides detailed output, with the key metric being: ns/day (nanoseconds per day): The standard performance metric for GROMACS. It shows how many nanoseconds of simulated time can be computed per real-world day. A higher ns/day means faster simulation. This metric is ideal for comparing different CPU architectures or configurations. Other useful outputs include Total Wall Time and a breakdown of time spent in different force calculations, which helps pinpoint specific bottlenecks. 3. Bottlenecks in Running HPC Benchmarks Achieving peak HPC performance requires identifying and mitigating bottlenecks that limit system throughput. 3.1. miniFE Specific Bottlenecks miniFE is particularly sensitive to: Memory Bandwidth: The Sparse Matrix-Vector Product (SpMV) is highly memory-bandwidth bound due to irregular memory access patterns. Cache Misses: Irregular accesses lead to frequent cache misses, increasing data retrieval latency. Inter-node Communication (for large problems): For distributed problems, communication during assembly and the Conjugate Gradient solver can be limited by network latency and bandwidth. 3.2. GROMACS Specific Bottlenecks For GROMACS, key bottlenecks include: CPU Core Performance & Threading: The number of cores and their individual performance (Instructions Per Cycle (IPC), clock speed) directly impact ns/day. Optimal balance between MPI ranks and OpenMP threads per rank is crucial. Memory Bandwidth: The CPU needs to access large datasets frequently for force calculations. SIMD Vectorization: GROMACS heavily relies on CPU SIMD instructions (e.g., NEON). If the CPU architecture or compiler doesn't fully exploit these, performance will suffer. Cache Utilization: Efficient cache usage is critical for the main simulation loop. Inter-node Communication: For large systems simulated across multiple nodes, MPI communication for domain decomposition and force summation can be a significant bottleneck, even with fast interconnects. NUMA Effects: Proper process and memory binding is crucial on multi-socket systems to minimize cross-socket memory access latency. Load Imbalance: Uneven workload distribution across PP and PME leads to idle compute units. 3.3. Dynamic Monitoring for Bottleneck Analysis (Frequency, Power, Temperature) Beyond static analysis, dynamic monitoring of CPU frequency, power consumption, and temperature during benchmark execution provides invaluable insights for root-causing performance bottlenecks. This data, when mapped over the run duration, can reveal transient issues that logs alone might miss. Application-Specific Context: For miniFE, if memory bandwidth is the primary bottleneck, the CPU might not be fully utilized, leading to lower-than-expected power consumption and temperatures, even if the frequency remains high. Conversely, if the SpMV operations push the CPU's compute capabilities, sustained high power and temperature could be observed. Any sudden dips in Mflops alongside frequency drops would directly point to thermal or power throttling. For GROMACS, which can be highly compute-intensive, sustained high power consumption and temperatures are common. Analyzing frequency, power, and temperature trends can reveal if the ns/day performance is being limited by the CPU's ability to maintain its turbo frequencies due to thermal constraints or if it's hitting a configured power envelope. Discrepancies between expected maximum performance and observed ns/day can often be explained by these dynamic system responses. Tools for Monitoring: Various tools can collect this data, including vendor-specific utilities (e.g., Intel's pcm, AMD's uProf), Linux tools (perf, turbostat, sensors), or IPMI/BMC interfaces for server-level metrics. Correlating these dynamic metrics with the benchmark's reported performance can significantly aid in precise bottleneck identification and system optimization. Conclusion Effective HPC benchmarking goes beyond simply running an application and reporting a single performance number. As demonstrated with miniFE and GROMACS in a CPU-centric environment, a deep understanding of the benchmark's computational characteristics is essential. Identifying whether a workload is memory-bound, compute-bound, or communication-bound is the first step toward optimizing performance. Furthermore, leveraging dynamic monitoring of CPU frequency, power consumption, and temperature provides invaluable diagnostic data. By integrating performance metrics with detailed system telemetry, HPC administrators and researchers can precisely pinpoint bottlenecks, fine-tune system configurations, and ultimately extract the highest possible performance.












