79 results found with an empty search
- Neoverse -V2 Support to Intel Perfspect
We recently worked on extending Intel Perfspect ( https://github.com/Whileone-Techsoft/PerfSpect/tree/Neoverse-native-support ) , a robust, command-line performance analysis tool that implements the Top-Down Microarchitecture Analysis Method (TMAM). It fully supports the Arm Neoverse-V2 architectures. This project required mapping the Performance Monitoring Unit (PMU) events on the ARM cores to the metrics of TMAM methodology. We can now get the Level 1 breakdown (Frontend Bound, Backend Bound, Retiring, Lost) to pinpoint the bottlenecks on respective systems, which was previously incompatible with this tool. Through debugging the code, it became possible to generate the continuous series graphs to understand if the bottleneck. This extension of Perfspect for Arm (our code allows native compilation on ARM) allowed the capture of the CPU utilization HeatMap generated in Telemetry reports, which shows the distribution of work across all cores over time. The challenge was mapping the Arm events to the TMAM formulae, and then correctly comparing the captured values through the modified Perfspect tool with the values that were manually calculated using the formulae. The key learnings from this project were quickly adapting to a new programming language like Go ( Golang) and making significant changes in the code to get the appropriate results, also adding to the knowledge of TMAM methodology and the specific challenges of cross-architecture analysis, particularly in translating PMU events from the Intel ecosystem to Arm's Neoverse-V2 core.
- Debugging the Debugger: A Deep Dive into GDB and RISC-V
In the world of software development, the GNU Debugger (GDB) is an essential tool for programmers. It allows us to peer inside a running program, find bugs, and understand complex code. As new hardware architectures emerge, it's crucial that our tools keep pace. One such rising star is RISC-V, an open-source instruction set architecture that is rapidly gaining popularity, particularly with its new vector extensions for high-performance computing. The Challenge: An Unknown Instruction Recently, our team took on a task to get a few bug fixes into gdb. The challenge: GDB was unable to recognize or debug vector instructions for the RISC-V architecture. This was a significant gap, hindering developers who were working with advanced RISC-V features. Without this support, debugging modern, high-performance RISC-V applications was a major challenge. My task was to dive into the GDB source code and enable this missing capability. Navigating a Sea of Code The first and most significant hurdle was the sheer scale of the GDB codebase. As a newcomer to such a vast and mature open-source project, understanding the intricate flow of control and finding the right place to intervene was a daunting task. The initial phase involved a lot of learning and exploration, and I'm grateful for the guidance of my colleagues who helped me navigate the complexities and build a mental map of the system. Through a careful process of debugging the debugger itself, we were able to trace the execution path for instruction processing. The breakthrough came when we identified the root cause of the issue: there was a missing function call responsible for reading and interpreting the new vector instructions. The logic was there, but it was never being invoked for this specific case. With the problem identified,we implemented an initial solution. Our contribution involved a hardcoded fix that proved the concept and successfully enabled GDB to recognize the vector instructions. This initial patch paved the way for a more robust and integrated solution that was later refined by other contributors in the open-source community. The result is a direct enhancement to a critical developer tool. Programmers working with RISC-V can now debug vector-based code more effectively, accelerating development and improving software quality within the ecosystem. https://www.sourceware.org/pipermail/gdb-patches/2025-May/217880.html
- Top CPU Performance Benchmarking Toolkits You Should Know
Modern compute platforms - from cloud hyperscale CPUs to edge processors - deliver unprecedented parallelism and instruction-set capabilities. But to truly understand performance, you need the right benchmarking tools. Whether you're comparing cloud instances, evaluating Arm-based servers like Ampere , or validating x86, RISC-V, or AI-accelerated hardware, the ecosystem offers several battle-tested frameworks. In this blog, we explore the most widely-used CPU benchmarking toolkits today - what they do, where they shine, and when to use each. 1. Ampere Performance Toolkit (APT) Ampere’s servers built on Arm architecture are optimized for cloud-native performance and power efficiency. The Ampere Performance Toolkit provides a set of scripts, automation, and recommended benchmarks to evaluate real-world workloads. Key Features Best For ✔ Evaluating Arm server performance ✔ Cloud benchmarking on Ampere instances ✔ Developers migrating workloads from x86 to Arm 2. PerfKit Benchmarker (Google) Originally built by Google, PerfKit Benchmarker (PKB) is the gold standard for cloud performance benchmarking across providers. Key Features Best For ✔ Comparing cloud VM types ✔ Reproducible benchmark automation ✔ Cloud procurement and architectural evaluations Fun fact: PKB has become the foundation for multiple forks and extensions across companies and academia for transparent benchmarking. 3. Phoronix Test Suite (PTS) The Phoronix Test Suite is one of the largest open-source benchmarking ecosystems—great for developers and hardware reviewers. Key Features Best For ✔ Broad CPU and system benchmarking ✔ Linux performance testing ✔ Reviewers, researchers, and enthusiasts 4. SPEC CPU Suite The Standard Performance Evaluation Corporation (SPEC) CPU suites are industry-trusted benchmarks for vendors and OEMs. Key Features Best For ✔ Enterprise-grade server benchmarking ✔ Official vendor comparisons ✔ Performance engineering and compiler tuning Note: Requires paid license. 5. Microbenchmark Suites (Core Latency, Memory, IPC) Sometimes, detailed architectural behavior matters more than high-level scores. Popular Tools Best For ✔ Low-level CPU behavior ✔ Memory latency & bandwidth analysis ✔ Performance debugging ML & AI-Centric Benchmarks (Emerging) Even CPU evaluations increasingly involve AI workloads. Best For ✔ AI inference on CPUs ✔ Edge compute & acceleration evaluations Bonus: Build-Your-Own Benchmark Harness Cloud providers and silicon vendors often implement custom harnesses around: Docker-ized workloads Kubernetes load-generation frameworks Real-app benchmarking (Redis, NGINX, PostgreSQL, Spark) For engineering teams, custom workload pipelines often reveal more than synthetic scores . Summary Table Toolkit Scope Best Use Case Ampere Performance Toolkit Server-class Arm systems Cloud-native Arm benchmarking PerfKit Benchmarker Multi-cloud benchmarking Cloud instance comparisons Phoronix Test Suite Broad system benchmark suite Linux and multi-OS testing SPEC CPU Industry standard CPU benchmarks Formal server performance publication sysbench / lmbench / perf Microbenchmarks & counters CPU profiling & tuning MLPerf / HPL / HPCG AI & HPC performance Compute-heavy + scientific workloads
- Network Latency Study in OCI Cloud
Network testing tools such as netperf can perform latency tests plus throughput tests and more. In netperf, the TCP_RR and UDP_RR (RR=request-response) tests report round-trip latency. With the -o flag, output metrics can be customized to display the exact information. Here’s an example of using the test-specific -o flag so netperf outputs several latency statistics: Google has lots of practical experience in latency benchmarking and as per blog using-netperf-and-ping-to-measure-network-latency , we tried to create own latency benchmarking before and after migrating workloads to the OCI cloud. Which tools and why All the tools in this area do roughly the same thing: measure the round trip time (RTT) of transactions. Ping does this using ICMP packets. ping -c 100 ping command sends one ICMP packet per second to the specified IP address until it has sent 100 packets. netperf -H -t TCP_RR -- -o min_latency,max_latency,mean_latency -H for remote-host and -t for test-name with a test-specific option -o for output-selectors. When we run latency tests at Google in a cloud environment, our tool of choice is PerfKit Benchmarker (PKB). This open-source tool allows to run benchmarks on various cloud providers while automatically setting up and tearing down the virtual infrastructure required for those benchmarks. After setting up, PerfkitBenchmarker, its simple to run ping and netperf benchmarks ./pkb.py --benchmarks=ping --cloud=OCI --zone=us-ashburn-1 ./pkb.py --benchmarks=netperf --cloud=OCI --zone=us-ashburn-1 --netperf_benchmarks=TCP_RR These commands run intra-zone latency benchmarks between two machines in a single zone in a single region. Intra-zone benchmarks like this are useful for showing very low latencies, in microseconds, between machines that work together closely. Latency discrepancies We've set up two VM.Standard.E4.Flex machines running Ubuntu 22.04 in zone us-ashburn-1 , and we'll use Private IP addresses to get the best results. If we run a ping test with default settings and set the packet count to 100, we get the following results: ping -c STDOUT: PING 172.16.60.168 (172.16.60.168) 56(84) bytes of data. 64 bytes from 172.16.60.168: icmp_seq=1 ttl=64 time=0.202 ms 64 bytes from 172.16.60.168: icmp_seq=2 ttl=64 time=0.205 ms … 64 bytes from 172.16.60.168: icmp_seq=99 ttl=64 time=0.329 ms 64 bytes from 172.16.60.168: icmp_seq=100 ttl=64 time=0.365 ms --- 172.16.92.253 ping statistics --- 100 packets transmitted, 100 received, 0% packet loss, time 101353ms rtt min/avg/max/mdev = 0.371/0.450/0.691/0.040 ms By default, ping sends out one request each second. After 100 packets, the summary reports that we observed an average latency of 0.450 milliseconds, or 451 microseconds. For comparison, let’s run netperf TCP_RR with default settings for the same amount of packets. netperf-2.7.0/src/netperf -p {command_port} -j -v2 -t TCP_RR -H 132.145.132.29 -l 60 -- -P ,{data_port} -o THROUGHPUT, THROUGHPUT_UNITS, P50_LATENCY, P90_LATENCY, P99_LATENCY, STDDEV_LATENCY, MIN_LATENCY, MEAN_LATENCY, MAX_LATENCY --num_streams=1 --port_start=20000' --timeout 360 Netperf Results: {'Throughput': '4245.34', 'Throughput Units': 'Trans/s', '50th Percentile Latency Microseconds': '228', '90th Percentile Latency Microseconds': '239', '99th Percentile Latency Microseconds': '372', 'Stddev Latency Microseconds': '92.06', 'Minimum Latency Microseconds': '215', 'Mean Latency Microseconds': '235.08', 'Maximum Latency Microseconds': '21059' Which test can we trust? To explain, this is largely an artefact of the different intervals the two tools used by default. Ping uses an interval of 1 transaction per second while netperf issues the next transaction immediately when the previous transaction is complete. Fortunately, both of these tools allow to manually set the interval time between transactions. For ping, -i flag to set the interval, given in seconds or fractions of a second. On Linux systems, this has a granularity of 1 ms, and rounds down. $ ping -c 100 -i 0.010 For netperf TCP_RR, we can enable some options --enable-spin flag to compile with fine-grained intervals -w flag, to set the interval time, and the -b flag, to set the number of transactions sent per interval. This approach allows to set intervals with much finer granularity, by spinning in a tight loop until the next interval instead of waiting for a timer; this keeps the cpu fully awake. Of course, this precision comes at the cost of much higher CPU utilization as the CPU is spinning while waiting. *Note: Alternatively, setting less fine-grained intervals by compiling with the --enable-intervals flag. Use of the -w and -b options requires building netperf with either the --enable-intervals or --enable-spin flag set. The tests here are performed with the --enable-spin flag set. netperf with an interval of 10 milliseconds using: $ netperf -H -t TCP_RR -w 10ms -b 1 -- -o min_latency,max_latency,mean_latency Now, after aligning the interval time for both ping and netperf to 10 milliseconds, the effects are apparent: Ping result is --- 172.16.92.253 ping statistics --- 1000 packets transmitted, 1000 received, 0% packet loss, time 15981ms rtt min/avg/max/mdev = 0.252/0.306/0.577/0.025 ms Netperf results are Minimum Latency Microseconds,Maximum Latency Microseconds,Mean Latency Microseconds 215,235.08,21059 We have integrated OCI as a provider in Perfkitbenchmarker which we are using to carry out testing. Here are the results of the inter region ping benchmark for A1.Flex2, E4.Flex.1 and S1.Flex vms. Tested netperf for intra region, considered us-ashburn-1 region here. Generally, netperf is recommended over ping for latency tests. This isn't due to any lower reported latency at default settings, though. As a whole, netperf allows greater flexibility with its options and we prefer using TCP over ICMP. TCP is a more common use case and thus tends to be more representative of real-world applications. That being said, the difference between similarly configured runs with these tools is much less across longer path lengths. Also, remember that interval time and other tool settings should be recorded and reported when performing latency tests, especially at lower latencies, because these intervals make a material difference.
- Investigating Performance Discrepancy in HPL Test on ARM64 Machines
Introduction: High-Performance Linpack (HPL) is a widely used benchmark for testing the computational performance of computing systems. In this blog post, we explore an intriguing scenario where we conducted HPL tests on two ARM64 machines. Surprisingly, the Host-2 machine exhibited a 20% lower performance than the Host-1 machine in the HPL test. Intrigued by this result, we embarked on a journey to comprehensively diagnose the underlying cause of this performance discrepancy. Why HPL for Performance Testing? People use the High-Performance Linpack (HPL) benchmark for performance testing because it provides a standardised and demanding workload that measures the peak processing power of computer systems, particularly in terms of floating-point calculations. It helps assess and compare the computational capabilities of different hardware configurations. This benchmark helps in comparing and ranking supercomputers' performance and is often used as a metric for the TOP500 list of the world's most powerful supercomputers. For more information, you can refer to the TOP500 article here: TOP500 List Objective: The primary objective of this investigation was to identify the reason behind the 20% performance difference observed in the HPL test between the Host-1 and Host-2 machines. To comprehensively diagnose the performance discrepancy, we conducted additional benchmark tests, including Stream, Lmbench, and bandwidth tests. 1. System Details: We conducted a fair and controlled experiment using two ARM64 machines, referred to as Host-1 and Host-2. 1.1 Machine Specifications ( Host-1 and Host-2 ): CPU(s): 96 Architecture: aarch64 Total memory: 96 GB Memory speed: 3200 MHz 2. Running HPL Benchmark: To run the HPL benchmark on an arm64 machine, you can refer to the GitHub repository provided: https://github.com/AmpereComputing/HPL-on-Ampere-Altra . This repository likely contains instructions, scripts, and configurations specific to running HPL on Ampere Altra ARM64-based machines. It's important to follow the guidelines provided in the repository to ensure accurate and meaningful benchmarking results. 2.1 HPL Scores: Upon completing the HPL benchmark on both machines, we computed and compared the achieved HPL scores. The Host-1 machine garnered a higher HPL score, signifying better computational performance. Machine Time ( sec ) Score Host-1 619.91 1245 Host-2 784 985 This result raised a critical question: why was there such a substantial performance gap? To delve into the root causes behind this discrepancy, we decided to conduct a series of additional tests to comprehensively investigate the issue. 3. Exploring Additional Tests: We conducted several other benchmark tests to comprehensively investigate the performance discrepancy between the Host-1 and Host-2 ARM64 machines. These tests aimed to shed light on various aspects of the systems' hardware and memory subsystems, providing a holistic understanding of the observed difference. Below, we detail the tests and their findings: 3.1 Stream Benchmark: The Stream benchmark assesses memory bandwidth and measures the system's capability to read from and write to memory. The benchmark consists of four fundamental tests: Copy, Scale, Add, and Triad. Copy: Measures the speed of copying one array to another. Scale: Evaluates the performance of multiplying an array by a constant. Add: Tests the speed of adding two arrays together. Triad: Measures the performance of a combination of operations involving three arrays. The Stream benchmark helps uncover memory bandwidth limitations and assess memory subsystem efficiency. Host-1 machine result : Function Best Rate MB/s Avg time Min time Max time Copy 103837.8 0.367897 0.36192 0.373494 Scale 102739.4 0.369191 0.365789 0.372439 Add 106782.7 0.536131 0.527908 0.542759 Triad 106559.1 0.533549 0.529016 0.537881 Host-2 machine result : Function Best Rate MB/s Avg time Min time Max time Copy 66071.3 0.572721 0.568794 0.575953 Scale 65708.8 0.575758 0.571932 0.580686 Add 67215.5 0.843995 0.838667 0.848371 Triad 67668.1 0.837109 0.833058 0.84079 Best Rate MB/s vs Function Graph In the Stream Benchmark results, Host-1 outperformed Host-2 across all functions (Copy, Scale, Add, Triad). Host-1 demonstrated higher memory bandwidth in each function, achieving significantly faster data transfer rates. This suggests a stronger memory subsystem performance in Host-1 compared to Host-2. 3.2 Lmbench for Memory Latency: Lmbench is a suite of micro-benchmarks designed to provide insights into various aspects of system performance. The suite includes latency tests for system calls, memory accesses, and various operations to quantify the system's responsiveness. Memory access tests include random read/write latency and bandwidth, helping to identify memory subsystem performance. File I/O tests evaluate file system performance, providing insights into storage subsystem capabilities Result Memory Latency: Memory latency refers to the time it takes for the CPU to access a specific memory location. Lower latency values indicate better performance, as data can be fetched more quickly. size (MB) latency (ns)-HOST-1 latency (ns)-HOST-2 0.00049 1.43 1.429 —----- —----- —----- —----- —----- —----- 2 32.355 32.786 3 34.503 36.012 4 37.403 37.932 6 39.982 52.922 8 41.007 54.001 12 44.315 55.466 16 65.52 73.016 24 95.131 117.278 32 115.081 138.945 48 126.796 151.945 64 129.558 159.225 96 134.413 166.359 128 136.239 167.788 192 136.245 168.689 256 136.366 170.464 384 137.732 170.461 —----- —----- —----- —----- —----- —----- 2048 135.61 149.809 4. Analysis and Findings: After conducting these benchmark tests, we observed the Host-2 machine consistently exhibited lower performance across different tests compared to the Host-1 machine. The most significant finding came from the Lmbench test, which revealed that the Host-2 machine's RAM had notably higher latency compared to the Host-1 machine. Notably, an additional factor was identified—the RAM rank. The Host-1 machine is equipped with Dual-Rank RAM, while the Host-1 machine has Single-Rank RAM. This RAM rank difference could contribute to the performance discrepancy. The observation is in line with findings from various other studies that have examined the influence of RAM rank on system performance. To gain a more comprehensive understanding of this subject, the following articles could be of interest: Single vs. Dual-Rank RAM: Which Memory Type Will Boost Performance? - This article provides a thorough comparison between single and dual-rank RAM, aiding in comprehending the disparities between these two RAM types, methods to distinguish them, and guidance on selecting the most suitable option for your needs. ( LINK ) Single Rank vs Dual Rank RAM: Differences & Performance Impact : This article delves into the differences between Single Rank and Dual Rank RAM modules, investigating their structural dissimilarities and assessing the respective impacts on performance. ( LINK ) 5. Conclusion: After conducting an extensive series of benchmark tests, we have pinpointed certain factors that contribute to the performance disparity observed in the HPL test between the two ARM64 machines. In the Stream Benchmark results, Host-1 outperformed Host-2 across all functions (Copy, Scale, Add, Triad). Host-1 demonstrated higher memory bandwidth in each function, achieving significantly faster data transfer rates. Additionally, the higher memory latency in the Host-2 machine's RAM was identified as a key contributor to the performance gap. This latency impacted the efficiency of memory operations and had a cascading effect on overall performance. Another significant factor was the difference in RAM rank configurations — Host-1 had Dual-Rank RAM, while Host-2 had Single-Rank RAM. This divergence likely contributed to the varying memory access speeds between the two machines. 6. Future Scope: In the context of further exploration, it is recommended to extend the investigation by including additional benchmark tests, specifically focusing on the Lmbench memory bandwidth test. This test would provide deeper insights into the memory subsystem's performance on both the Host-1 and Host-2 machines. Additionally, an interesting avenue for investigation could involve modifying the RAM configuration in one of the machines and assessing its impact on performance. This would provide valuable information about the role of memory specifications in influencing the overall system performance.
- Root causing a memory corruption on Arm64 VMs
We recently migrated one of our websites to Azure Arm64 VMs. However, as soon as we pushed the infrastructure change in production, we started to observe our server process being restarted infrequently. These restarts may happen within a few seconds sometimes while not occurring for hours at other times. While the redundancy in our setup ensured minimal end-user impact, we wanted to quickly address the issue at hand. Looking at the logs A quick look at the logs showed the following error before process restarts: malloc(): corrupted top sizeAborted (core dumped) This is a Node.js based Next.js website with nothing memory intensive being performed. So, we were surprised to see a memory related issue. A quick look at the top also suggested we had adequate memory available for our running processes. So, this definitely looked like a memory corruption. Our next challenge was to identify what caused the memory corruption. On analyzing the logs further, it did not appear that there was a single website url causing this issue. Reproducing the issue With this information at hand, we went back to our test environment (which was also running on Azure Arm64 VM) and setup a more detailed logging. We then visited a large number of our website urls to see if we could reproduce the restart. Eventually, we did find a couple of urls where the Node.js process would exit with the corrupted memory error message. Identifying the root cause Once we could reproduce the issue, we narrowed it down the to images loading on these pages. Our images were being served by Next.js next/image library. This library internally leverages the `sharp` package to optimize the images being served. So, it appeared that for some images (not all), the sharp image optimization logic was resulting in memory corruption causing our Node.js process to exit. Looking at the current & past issues for lovell/sharp on github took us to this issue , which summarized our experience. Issue details & Fix On probing further, we understood that the libspng library being used by lovell/sharp had a memory corruption issue when trying to decode a paletted PNG on Arm64. libspng addressed this issue with v0.7.2 which was picked by lovell/sharp within v0.31.0. On pinning our sharp dependency within the package.json to v0.31.0, we were able to force our next/image to pick-up this version of the sharp library (instead of the older one) for image optimiztaion. With this change, the specific images that were causing Node.js process exit earlier were now being optimized as expected. Once the change went into production, we watched our production Node.js processes for any restarts. With no restarts observed for a couple of days, we were able to mark the issue as addressed.
- Mastering the 5 Essential Performance Engineering Skills for Software Engineers: A Professional Guide
Performance engineering is a vital area in software development that guarantees applications function efficiently and effectively. As modern software systems grow more complex, the need for skilled engineers who understand performance becomes increasingly important. This guide will cover five essential performance engineering skills every software engineer should develop to thrive in their careers. Grasping Performance Requirements To start, software engineers must excel at understanding performance requirements. This means knowing how the system behaves under different loads and the specific performance targets the application must meet. Involved discussions with stakeholders are crucial for defining clear performance metrics early in the development process. Important performance indicators (KPIs) include: Response Time : The time taken for a system to respond to a user request. According to a report, 47% of consumers expect a page to load in two seconds or less. Throughput : The amount of work completed in a given timeframe, often measured in transactions per second (TPS). Resource Utilization : Understanding how effectively system resources are being used, such as CPU, memory, and bandwidth. By setting these performance requirements early on, engineers can make better design decisions, leading to more efficient applications right from the beginning. Expertise in Performance Testing Tools A strong command of performance testing tools is essential. Knowledge of both open-source and proprietary tools enables engineers to simulate user traffic, evaluate system performance, and pinpoint potential problems. Some popular performance testing tools include Apache JMeter, LoadRunner, and Gatling. These tools help engineers create test scenarios that reflect real-world load conditions. For instance, a team using JMeter might simulate 10,000 concurrent users on their e-commerce site to ensure it can handle peak shopping times, like Black Friday. Effectively using performance testing tools helps reveal issues and provides actionable insights for optimization. In fact, organizations that conduct regular performance testing see a 30% improvement in application speed and responsiveness. Capacity Planning and Scalability A third essential skill is capacity planning and scalability. Software engineers must be able to forecast the resources needed to accommodate user growth without compromising performance. This involves analyzing historical usage data and anticipating future demands. For example, if a SaaS application reports a 20% monthly increase in active users, engineers must plan to scale infrastructure accordingly. This scaling can happen in two ways: Vertical Scaling : Adding more resources (like CPU or memory) to a single server. Horizontal Scaling : Adding more servers to distribute the load when the user demand increases. Team members should consistently monitor performance against these plans to refine forecasts and implement necessary adjustments. Mastering this skill enables teams to prevent performance issues and support seamless scaling as user needs change. Appreciating System Architecture A solid understanding of system architecture is crucial for performance engineering. Engineers need to be familiar with various architectural patterns such as microservices, serverless, and monolithic designs. Each architecture has its implications for performance. For example, a microservices architecture can enhance scalability but may lead to communication delays between services. In contrast, a monolithic architecture is easier to manage but might struggle under high loads due to its rigid structure. Understanding how different architectures influence performance helps engineers make informed design choices. For instance, a recent study showed that companies implementing microservices correctly reduced deployment times by 75%. Ongoing Performance Monitoring Lastly, ongoing performance monitoring is a critical skill that software engineers should cultivate. After an application is live, continuous monitoring allows teams to spot performance issues that may arise in real-world settings. Using tools like New Relic, Dynatrace, or Grafana helps engineers monitor application performance consistently. For instance, real-time monitoring can quickly alert teams when server response times exceed predefined limits, preventing user dissatisfaction. By integrating ongoing monitoring into their workflow, engineers foster a culture of performance awareness. Companies that prioritize performance monitoring often see conversion rates improve by up to 20% due to enhanced user experiences. Time to Enhance Performance Engineering Skills Mastering performance engineering skills is a necessity for software engineers, not just an option. With the increasing complexity of software systems, it is essential for engineers to possess the knowledge and tools required to ensure that applications meet crucial performance metrics. Focusing on understanding performance requirements, mastering performance testing tools, capacity planning and scalability, system architecture knowledge, and continuous performance monitoring can significantly boost engineers' effectiveness in this important field. As the demand for high-performance applications continues to rise, developing these skills will enhance individual careers while contributing to the success of software projects. Now is the time for aspiring engineers to invest in their own development and polish these performance engineering skills. Success is found in mastering these elements and effectively applying them to real-world challenges.
- Uncovering the Best: 5 Top Tools for Cutting-Edge Chip Benchmarking
In the fast-paced world of technology, chip benchmarking is vital. It helps engineers and developers measure the performance of semiconductor devices to keep up with advancements. This post dives into the top five tools for chip benchmarking, highlighting their features, benefits, and real-world applications. 1. Geekbench Geekbench stands out as a cross-platform benchmarking tool for assessing CPU and GPU performance. Its versatility allows it to work seamlessly across different operating systems, making it a favorite among developers. With a massive database of devices, Geekbench offers detailed scores that let users compare their hardware easily. It measures both single-core and multi-core performance, crucial for modern chips that handle multiple tasks simultaneously. For instance, Geekbench allows you to see how a chip like the Apple M1 stacks up against Intel's latest processors. Setting up Geekbench is quick and user-friendly. It provides insights into memory and compute performance, making it an essential tool for hardware professionals. In fact, many developers report improvements of up to 30% in their designs after optimizing based on Geekbench results. 2. SPEC CPU Benchmark The SPEC CPU Benchmark suite is trusted in the industry for evaluating CPU performance. Created by the Standard Performance Evaluation Corporation, it includes a set of diverse workloads assessing integer and floating-point calculations. SPEC offers reliable reports that reveal both efficiency and speed, enabling engineers to make data-driven decisions. For example, analysis from SPEC has helped companies like AMD refine their latest Ryzen processors, enhancing performance by approximately 25%. SPEC's rigorous validation ensures that results are credible for manufacturers and users alike. Its broad application makes it perfect for systems that demand high performance, such as servers running complex applications. 3. 3DMark 3DMark is essential for gamers and graphics professionals. This graphical benchmarking tool primarily evaluates GPU performance in rendering graphics but can also provide key insights into chip performance concerning integrated graphics. 3DMark includes various tests reflecting real-world gaming scenarios. Users can examine frame rates and rendering speeds, helping them understand how their hardware performs under strain. For instance, the “Fire Strike” test can assess how well a system handles intensive gaming tasks, highlighting up to a 15% difference in performance between competing GPUs. Additionally, the "Time Spy" test evaluates DirectX 12 performance. These visual benchmarks not only present performance data engagingly but also help users spot design flaws in their chips. 4. LLaMA Benchmarks LLaMA (Large Language Model Meta AI) benchmarks are designed to evaluate the performance of various language models across multiple tasks. These benchmarks provide a standardized way to measure the capabilities of models in understanding and generating human-like text. The benchmarks include a wide range of tasks, such as text completion, question answering, and summarization, allowing researchers to assess the models' effectiveness in real-world applications. For instance, recent evaluations have shown that LLaMA models outperform previous iterations in generating coherent and contextually relevant text. One of the key features of LLaMA benchmarks is their focus on zero-shot and few-shot learning capabilities. This aspect enables models to perform well on tasks they have not been explicitly trained for, showcasing their adaptability and generalization abilities. 5. GPT-3 Benchmarks GPT-3 benchmarks provide a comprehensive framework for assessing the performance of the GPT-3 language model across various linguistic tasks. These benchmarks measure aspects such as fluency, coherence, and relevance in generated text. The evaluation includes a variety of tasks, including language translation, text generation, and creative writing, allowing for a holistic view of the model's capabilities. For example, companies utilizing GPT-3 for content creation have reported significant improvements in engagement and quality due to the insights gained from these benchmarks. The user-friendly interface of the benchmarking tools associated with GPT-3 ensures that both novice and experienced users can easily interpret the results. This accessibility has led to its widespread adoption in industries seeking to leverage advanced natural language processing technologies. Making the Right Choice Selecting the appropriate tool for chip benchmarking is crucial. Whether it's the adaptable Geekbench, the trusted SPEC CPU Benchmark, the detailed 3DMark, the versatile SiSoftware Sandra, or the all-encompassing PassMark PerformanceTest, each tool provides unique insights that can foster innovation in chip development. By effectively utilizing these tools, professionals can enhance the performance of semiconductor devices, keeping pace with rapid technological changes. Investing in the right benchmarking tools is not just beneficial; it is vital for success in chip development.
- CPU-Centric HPC Benchmarking with miniFE and GROMACS
Benchmarks are vital for evaluating High-Performance Computing (HPC) system performance, guiding hardware choices, and optimizing software. This whitepaper focuses on understanding and overcoming bottlenecks in HPC benchmarks for CPU environments, specifically considering ARM/AARCH64 architectures, using miniFE and GROMACS as examples. 1. Introduction to miniFE and GROMACS Benchmarks 1.1. miniFE: A Finite Element Mini-Application miniFE, part of the Mantevo suite, simulates implicit finite element applications. It solves sparse linear systems, with its core kernels focused on element-operator computation, assembly, sparse matrix-vector products (SpMV), and basic vector operations. It's excellent for benchmarking systems handling sparse linear algebra and iterative solvers. To run miniFE, you typically compile it with an MPI-enabled compiler. Execution involves specifying problem dimensions and MPI processes. # Example for a single node with 16 MPI tasks srun -N 1 -n 16 miniFE.x -nx 264 -ny 256 -nz 256 # Example for a multi-node run (adjust N and n) srun -N 4 -n 64 miniFE.x -nx 528 -ny 512 -nz 512 Note: srun is for SLURM; mpirun or similar for other systems. 1.2. GROMACS: Molecular Dynamics Simulation Software GROMACS (GROningen MAchine for Chemical Simulations) is a highly optimized open-source software for molecular dynamics (MD) simulations. It models atomic and molecular movements, particularly for biochemical systems, and is efficient in calculating non-bonded interactions. A typical GROMACS workflow prepares input files, then runs the simulation. # Step 1: Prepare the run input file (.tpr) gmx grompp -f pme.mdp -c conf.gro -p topol.top -o topol.tpr # Step 2: Run the molecular dynamics simulation mpirun -np 4 gmx_mpi mdrun -s topol.tpr -ntomp 4 # To run a specific benchmark system (e.g., 'benchPEP-h') mpirun -np gmx_mpi mdrun -s benchPEP-h.tpr -ntomp Note: Tune MPI processes (-np) and OpenMP threads (-ntomp) to your hardware. 1.2. GROMACS: Molecular Dynamics Simulation Software GROMACS (GROningen MAchine for Chemical Simulations) is a highly optimized open-source software for molecular dynamics (MD) simulations. It models atomic and molecular movements, particularly for biochemical systems, and is efficient in calculating non-bonded interactions. A typical GROMACS workflow prepares input files, then runs the simulation. # Step 1: Prepare the run input file (.tpr) gmx grompp -f pme.mdp -c conf.gro -p topol.top -o topol.tpr # Step 2: Run the molecular dynamics simulation mpirun -np 4 gmx_mpi mdrun -s topol.tpr -ntomp 4 # To run a specific benchmark system (e.g., 'benchPEP-h') mpirun -np gmx_mpi mdrun -s benchPEP-h.tpr -ntomp Note: Tune MPI processes (-np) and OpenMP threads (-ntomp) to your hardware. 2. Interpreting Performance Output (Benchmarking POV) Understanding benchmark output is crucial for evaluating HPC system throughput and efficiency. 2.1. miniFE Performance Metrics miniFE outputs performance data, primarily focused on: Total CG Mflops (Mega-Floating Point Operations per Second for Conjugate Gradient solve): The main Figure of Merit (FOM). Higher values indicate better performance, reflecting the system's efficiency in sparse linear algebra, often limited by memory bandwidth and FPU throughput. 2.2. GROMACS Performance Metrics GROMACS provides detailed output, with the key metric being: ns/day (nanoseconds per day): The standard performance metric for GROMACS. It shows how many nanoseconds of simulated time can be computed per real-world day. A higher ns/day means faster simulation. This metric is ideal for comparing different CPU architectures or configurations. Other useful outputs include Total Wall Time and a breakdown of time spent in different force calculations, which helps pinpoint specific bottlenecks. 3. Bottlenecks in Running HPC Benchmarks Achieving peak HPC performance requires identifying and mitigating bottlenecks that limit system throughput. 3.1. miniFE Specific Bottlenecks miniFE is particularly sensitive to: Memory Bandwidth: The Sparse Matrix-Vector Product (SpMV) is highly memory-bandwidth bound due to irregular memory access patterns. Cache Misses: Irregular accesses lead to frequent cache misses, increasing data retrieval latency. Inter-node Communication (for large problems): For distributed problems, communication during assembly and the Conjugate Gradient solver can be limited by network latency and bandwidth. 3.2. GROMACS Specific Bottlenecks For GROMACS, key bottlenecks include: CPU Core Performance & Threading: The number of cores and their individual performance (Instructions Per Cycle (IPC), clock speed) directly impact ns/day. Optimal balance between MPI ranks and OpenMP threads per rank is crucial. Memory Bandwidth: The CPU needs to access large datasets frequently for force calculations. SIMD Vectorization: GROMACS heavily relies on CPU SIMD instructions (e.g., NEON). If the CPU architecture or compiler doesn't fully exploit these, performance will suffer. Cache Utilization: Efficient cache usage is critical for the main simulation loop. Inter-node Communication: For large systems simulated across multiple nodes, MPI communication for domain decomposition and force summation can be a significant bottleneck, even with fast interconnects. NUMA Effects: Proper process and memory binding is crucial on multi-socket systems to minimize cross-socket memory access latency. Load Imbalance: Uneven workload distribution across PP and PME leads to idle compute units. 3.3. Dynamic Monitoring for Bottleneck Analysis (Frequency, Power, Temperature) Beyond static analysis, dynamic monitoring of CPU frequency, power consumption, and temperature during benchmark execution provides invaluable insights for root-causing performance bottlenecks. This data, when mapped over the run duration, can reveal transient issues that logs alone might miss. Application-Specific Context: For miniFE, if memory bandwidth is the primary bottleneck, the CPU might not be fully utilized, leading to lower-than-expected power consumption and temperatures, even if the frequency remains high. Conversely, if the SpMV operations push the CPU's compute capabilities, sustained high power and temperature could be observed. Any sudden dips in Mflops alongside frequency drops would directly point to thermal or power throttling. For GROMACS, which can be highly compute-intensive, sustained high power consumption and temperatures are common. Analyzing frequency, power, and temperature trends can reveal if the ns/day performance is being limited by the CPU's ability to maintain its turbo frequencies due to thermal constraints or if it's hitting a configured power envelope. Discrepancies between expected maximum performance and observed ns/day can often be explained by these dynamic system responses. Tools for Monitoring: Various tools can collect this data, including vendor-specific utilities (e.g., Intel's pcm, AMD's uProf), Linux tools (perf, turbostat, sensors), or IPMI/BMC interfaces for server-level metrics. Correlating these dynamic metrics with the benchmark's reported performance can significantly aid in precise bottleneck identification and system optimization. Conclusion Effective HPC benchmarking goes beyond simply running an application and reporting a single performance number. As demonstrated with miniFE and GROMACS in a CPU-centric environment, a deep understanding of the benchmark's computational characteristics is essential. Identifying whether a workload is memory-bound, compute-bound, or communication-bound is the first step toward optimizing performance. Furthermore, leveraging dynamic monitoring of CPU frequency, power consumption, and temperature provides invaluable diagnostic data. By integrating performance metrics with detailed system telemetry, HPC administrators and researchers can precisely pinpoint bottlenecks, fine-tune system configurations, and ultimately extract the highest possible performance.
- Benchmarking and Validation of Workloads on Emulators
In this case study, we describe our systematic approach to benchmarking and validating workloads on FPGA platforms using HAPS (High-performance ASIC Prototyping System) models. The workflow involves compiling and cross-compiling a diverse set of workloads using both native QEMU and the open source toolchain, executing them on FPGA hardware, and capturing detailed performance metrics such as instructions executed and cycle counts. 1. Benchmark Preparation and Build Process We classify our benchmarks into the following categories: High-Performance Computing (HPC) Benchmarks: Includes matrix multiplication, FFT, and other numerical kernels. Synthetic Benchmarks: Includes whetstone, dhrystone, and other CPU stress tests. Algorithmic Benchmarks: Includes sorting algorithms, graph traversal, and numerical integration. Cryptography and Security Benchmarks: AES, RSA, SHA-based microbenchmarks (in future pipeline). Memory and I/O Benchmarks: Includes stream, memcpy stressors, and file read/write tests. Industry-standard Benchmarks: SPEC CPU2017 for INT and FP tracks. All benchmarks are first built or cross-compiled depending on their compatibility: Native Build: Performed on QEMU-based emulation environment where toolchain compatibility allows. Cross Compilation: Done using toolchain targeting the architecture for cases where native build fails or is time-prohibitive. Application Categories Distribution 2. Deployment and Execution on FPGA The compiled binaries are deployed to the FPGA via HAPS models configured with a soft-core. Execution is controlled using a lightweight shell interface or boot script. We utilize a custom performance monitoring utility (akc_counter_capture) to gather the following metrics: Total instruction count Cycle count These values are stored for each benchmark run and are used in performance comparisons. 3. Workload Example 1: DGEMM (Double-Precision General Matrix Multiply) DGEMM is a key linear algebra kernel from the BLAS library. We compiled and executed the DGEMM kernel using double-precision arithmetic with matrix size NxN, where N=256. Performance was evaluated using instruction count, cycle count, and IPC (Instructions Per Cycle). 2: N-Queens Problem The N-Queens benchmark is a classic example of combinatorial search used to evaluate control-flow-heavy algorithm performance. It computes all valid arrangements of N queens on an N×N chessboard such that no two queens attack each other. We verified correctness by comparing the total number of valid solutions for standard board sizes (e.g., N=12 and N=14), which matched precisely across architectures. The benchmark’s output was deterministic, and no deviations were observed across multiple FPGA runs. 3: Red-Black Tree (RBTree) Manipulation Red-Black Tree (RBTree) manipulation represents a memory-bound and pointer-intensive workload that tests dynamic memory access patterns and data structure balancing algorithms. This benchmark was compiled using both the Embedded toolchain and natively on QEMU for consistency. Validation involved verifying the in-order traversal of the tree after bulk insertions and deletions. RBTree serves as a robust test of both instruction scheduling and memory subsystem behavior. Conclusion Our approach demonstrates that workloads can be effectively compiled, executed, and validated on FPGA platforms using HAPS models.
- Open-Source Benchmarking Tools with Ad-Hoc Extension
Automation is essential for performance benchmarking because it ensures that results are reliable, repeatable, scalable, and comparable. Various open source benchmarking tools are used for Automation. Tools are essential for benchmarking because they bring standardization, accuracy, efficiency, and repeatability to performance evaluation. Open-Source Benchmarking Tools that support ad-hoc extensibility, meaning they can be customized or extended without rebuilding or heavily modifying the core codebase. These tools provide flexibility in creating custom test scenarios, simulating various workloads, and adapting to new APIs or environments. List of tools which we used for benchmarking: Phoronix Test Suite PerfKit Benchmarker Phoronix Test Suite: Phoronix Test Suite is the most comprehensive open-source benchmarking platform available for Linux, macOS, and windows systems. It is widely used for automated testing, performance analysis, and software comparisons. What is PTS Extension? A PTS Extension is a plugin or add-on for the Phoronix Test Suite (PTS) that extends its functionality. It allows users to add custom behaviors before, during, or after benchmark runs—ideal for automation, integration, or custom logging. PTS extensions are used to: Add full socket runs Add open source docker tests Integrate with other systems. System & hardware benchmarking Why Shift from PTS to PerfKit Benchmarker? Phoronix Test Suite (PTS) is primarily a single-node benchmarking tool, which runs on a single machine. To overcome this issue Perfkit Benchmarker tool is used. PKB is specifically built for cloud platforms. PKB handles provisioning, benchmarking, monitoring, and cleanup automatically. PTS requires manual test setup, especially for cloud VMs. PKB can push benchmark data to: InfluxDB Stackdriver Grafana JSON logs for CI/CD systems PTS does offer HTML/JSON/CSV output but lacks native telemetry integrations. Perfkit Benchmarker(PKB): PerfKit Benchmarker is an open-source tool developed by Google that automates the process of benchmarking cloud infrastructure across different cloud providers. Main Stages of a PerfKit Benchmarker Run: What Is a PerfKit Benchmarker Extension? Extensions allow users to define: Custom benchmark Flags Providers Workloads Top Benefits of PerfKit Benchmarker Extension: PerfKit Benchmarker can run distributed benchmarks involving multiple VMs across one or more cloud zones or providers. Automatically handles VM provisioning, software installation, test execution, teardown. Easily integrates with dashboards, analytics pipelines, or cost/performance reports. Useful in capacity planning, performance regression testing, or SLI validation. In addition PKB Extension supports Turbostat (useful for analyzing power and frequency behavior during benchmarks), Lm-Senso rs(Linux utility used to monitor hardware sensors), and Sysstat (analyze CPU, memory, disk I/O, networking, and other system-level performance metrics.). PKB extension also support additional feature for Report generation, which is useful to generate report with all result and peripheral data. It supports various formats such as TXT, CSV and HTML. Here’s a set of workload charts for PerfKit Benchmarker (PKB) organized by category. These charts summarize the common benchmark workloads available in PKB, helping you choose the right tests for CPU, memory, disk, network, and database performance analysis across cloud platforms. Cloud Comparison Using PerfKit Benchmarker Here's a comprehensive comparison of cloud providers (GCP, Azure, OCI) using PerfKit Benchmarker (PKB) as a common benchmarking framework: Conclusion PTS is excellent for deep technical benchmarking of a single system. PKB is a robust choice for cloud performance comparisons, cost evaluation, and infrastructure benchmarking at scale.
- Understanding DLRM with PyTorch
DLRM stands for Deep Learning Recommendation Model. It is a neural network architecture developed by Facebook AI (Meta) for large-scale personalized recommendation systems. DLRM is widely used in real-world applications where personalized recommendations or ranking predictions are needed. DLRM designed for click-through rate (CTR) prediction and ranking task. Examples: Online Advertising, E-commerce Recommendations, Social Media Feed Ranking, Streaming Services, Online Marketplace and Classifieds etc. DLRM features: DLRM Installation Options: Install Original Facebook DLRM(PyTorch) using git and python. Install DLRM using TorchRec Install NVIDIA DLRM Install DLRM in Docker (CPU-only or GPU) What Is the Relationship Between DLRM and PyTorch? DLRM is built using PyTorch. PyTorch serves as the foundational deep-learning framework that powers every component inside DLRM. PyTorch Is the Framework; DLRM Is the Model DLRM is not a framework, it is a specific neural-network architecture designed by Meta (Facebook) for large-scale recommendation systems. PyTorch provides: DLRM uses these tools to construct its dense MLPs, embedding tables, and feature-interaction layers. Pytorch Installation Options: PyTorch can be installed in several ways depending on your environment, hardware, and workflow. Install via pip (Most Common & Easiest) Install via Conda (Best for GPU Environments) Install via Docker (Isolated & Production-Friendly) Install from Source (For Developers and Custom Builds) Cloud-Based PyTorch Installation Install via Package Managers (Limited OS Support) Pytorch Installation via Docker: Installing PyTorch through Docker is one of the most reliable and hassle-free ways to set up a deep learning environment. Instead of manually managing Python versions, CUDA toolkits, cuDNN libraries, and system dependencies, Docker provides a pre-configured container where everything already works out of the box. By pulling an official PyTorch image—either CPU-only or with CUDA support—you get an isolated and reproducible environment that runs identically on any machine. Quick steps 1. Pull an image CPU-only: docker pull pytorch/pytorch:latest GPU (CUDA 11.8 example): docker pull pytorch/pytorch:latest-cuda11.8-cudnn8-runtime 2. Run the container CPU: docker run -it pytorch/pytorch:latest bash GPU (with NVIDIA container toolkit): docker run -it --gpus all pytorch/pytorch:latest-cuda11.8-cudnn8-runtime bash 3. Verify inside the container python3 -c "import torch; print(torch.__version__); print('cuda:', torch.cuda.is _available())" How to Run DLRM Inside a PyTorch Docker Container? Pull a PyTorch Docker Image Start the Container Install Dependencies (Inside the Container) Clone DLRM Repository Run DLRM DLRM Command: Running DLRM effectively requires understanding the key command-line options that control data loading, model architecture, training configuration, and performance tuning. DLRM accepts a rich set of flags that allow you to configure everything from batch sizes to embedding dimensions. These options fall into four major categories: Data Options Training Options Model Architecture Options System / Performance Options Frequently Used DLRM Command: python dlrm_s_pytorch.py \ --data-generation=synthetic \ --mini-batch-size=2048 \ --learning-rate=0.01 \ --arch-sparse-feature-size=16 \ --arch-mlp-bot="13-512-256-64-16" \ --arch-mlp-top="512-256-1" \ --print-freq=10 Conclusion Using PyTorch Docker containers to run DLRM (Deep Learning Recommendation Model) provides a streamlined, consistent, and reproducible environment across different hardware platforms. Docker eliminates dependency conflicts, simplifies setup, and ensures that the exact software stack—PyTorch version, libraries, and optimizations—can be deployed seamlessly. In short, PyTorch Docker + DLRM offers a reliable, flexible, and efficient path to train, evaluate, and deploy recommendation models with minimal friction.












