Search Results | Whileone

Blog Posts (75)

Other Pages (23)

75 results found with an empty search

MySQL Cloud Workload Brief
Overview MySQL is an open-source relational database management system (RDBMS) that stores and organizes data using tables, rows, and columns, and allows you to query and manage that data using SQL (Structured Query Language). MySQL Database Server is fast, reliable, scalable, and easy to use. It continues to rank highly in popularity among databases, according to DB-engines. SysBench is a multi-threaded benchmark tool. The tool can create a simple database schema, populate database tables with data, and generate multi-thread load (SQL queries) towards the database server. Sysbench works very well with MySQL because it was originally designed specifically to benchmark MySQL and MariaDB under various OLTP workloads. Sysbench comes with Lua scripts to simulate: Read/write transactions (oltp_read_write.lua) Read-only workloads (oltp_read_only.lua) Write-only workloads (oltp_write_only.lua) Point-select workloads (oltp_point_select.lua) Setup Details: MySQL Server Version: 8.0.36 Sysbench Version: 1.0.20 Example: MySQL Benchmark (OLTP) sysbench oltp_point_select\ --db-driver=mysql \ --mysql-user=root \ --mysql-password=yourpass \ --mysql-db=test \ prepare # load data sysbench oltp_point_select \ --db-driver=mysql \ --mysql-user=root \ --mysql-password=yourpass \ --mysql-db=test \ --tables=10 \ --table-size=100000 \ --threads=8 \ --time=60 run # run benchmark sysbench oltp_point_select \ --db-driver=mysql \ --mysql-user=root \ --mysql-password=yourpass \ --mysql-db=test \ cleanup # remove data MySQL - Competitive Analysis: Compare MySQL on AMD, Intel and ARM machines. On each platform using the same Linux distribution (Fedora 38) and the same kernel version (6.4.13-200.fc38), with 4K Page Size. Test Purpose: oltp_point_select.lua: This workload performs single-row SELECTs by primary key. It’s used to measure pure read throughput, memory/cache efficiency, and indexing performance. Sysbench Command: sysbench/oltp_point_select.lua --table-size=10000000 --tables=8 --mysql-port=3000 --mysql-db=sbtest --threads=64 --events=0 --time=600 --report-interval=10 --thread-init-timeout=5 --rate=0 --rand-type=uniform --rand-seed=1 --mysql-host=`ip_address` --mysql-user=sbtest --mysql-password=`yourpass` --mysql-ssl=REQUIRED --mysql-ssl-cipher=AES128-SHA256 --db-ps-mode=disable --mysql-ignore-errors=1213,1205,1020,2013 --db-driver=mysql run Observations: High QPS with very low average latency — typical of optimized read workloads. Performance scales well with threads up to a point. Depends heavily on index lookups and buffer pool hits. Minor spikes in latency may occur with disk I/O or buffer pool misses. Performance Insights Arm outperforms others in high-concurrency scenarios, ideal for thread-heavy OLTP workloads. Icelake performs best in low-to-mid thread counts, likely due to strong single-threaded or cache performance. Genoa is consistent, Good Compatibility with all MySQL features. Best Performance. Milan may need tuning or reflects a less optimized test environment. MySQL tunings for better performance: Tuning MySQL for better performance involves adjusting configuration settings based on your workload type (OLTP, analytics, mixed), system resources (RAM, CPU, disk), and traffic pattern (read-heavy, write-heavy, etc.). Also tune some parameters in my.cnf such as “innodb write io threads”, “innodb read io threads”, “max connections” etc. Also used PGO (Profile-Guided Optimization) is a compiler optimization technique used with GCC to improve binary performance by collecting execution profiles and optimizing accordingly. MySQL - Competitive Analysis over GCP cloud: Compare MySQL on AMD, Intel and ARM machines on GCP cloud. Performance Analysis: Performance Insights GCP-N2 is the best choice for maximum performance in read-intensive benchmarks like Sysbench OLTP Point Select. GCP-N2D and T2A offer a good balance of performance and likely cost. GCP-T2D may not scale well for high QPS workloads — best suited for light-duty use. Conclusion MySQL handled concurrent connections efficiently with steady TPS and low latency. The oltp_point_select.lua test demonstrated MySQL's ability to handle high-throughput primary key lookups with exceptional performance — achieving 18,000 QPS at ~0.5 ms average latency. The results indicate efficient use of the InnoDB buffer pool and minimal I/O waits. To sustain performance under higher concurrency, further tuning of buffer size, read threads, and CPU parallelism may be beneficial. This workload is a good indicator of how MySQL will perform in read-heavy applications like caching layers or real-time dashboards.
SPDK AIO bdevperf Performance Report: Analyzing Workload on AWS Graviton4
We conducted SPDK bdevperf tests on an AWS EC2 r8gd.metal-24xl instance, focusing on single CPU core performance under high I/O load. Our objective was to demonstrate a CPU-bound workload. Results show low I/O wait and high CPU utilization, confirming the CPU is the limiting factor. The 2-disk configuration achieved the highest throughput, indicating a CPU saturation point. 1. Performance Results Summary (100-second duration) Below is a consolidated view of our 100-second bdevperf runs across 1, 2, and 3 local NVMe disks. These figures include throughput, latency, CPU utilization (from mpstat), and Instructions Per Cycle (IPC, from perf stat) for the dedicated CPU core. 2. Test Setup and Environment Tests were conducted on an AWS EC2 r8gd.metal-24xl instance, a bare-metal machine. Processor: AWS Graviton4 (ARM-based). Local Storage: Three 1900 GiB NVMe SSDs (Instance Store). bdevperf parameters used to focus on CPU utilization include: SPDK Driver: AIO (Asynchronous I/O), which uses the Linux kernel's native AIO interfaces. Queue Depth (QD): 384, set high to keep storage busy. I/O Size (IO_SIZE): 4096 bytes (4 KiB), a block size for transactional workloads. Workload Type: randrw (Mixed Random Read/Write) with 70% Reads / 30% Writes. Test Duration: 100 seconds per run. CPU Core Dedication: bdevperf was affinity-set to a single CPU core (core 0) to measure that core's I/O processing capacity. 3. Our Script's Test Methodology Our custom automation script executes and monitors bdevperf tests as follows: 1. Device Identification & Selection: The script identifies available, unmounted NVMe block devices, excluding system partitions. We then select devices for testing. 2. SPDK AIO bdev Configuration: For each test (incrementally adding selected disks), a JSON configuration file is generated. This configures SPDK to use AIO block devices from the physical NVMe drives. 3. Performance Execution with Monitoring: bdevperf runs with the generated configuration. mpstat Monitoring: mpstat concurrently monitors the dedicated CPU core(s), capturing CPU utilization percentages (User, System, Idle, I/O Wait). perf stat Monitoring: perf stat wraps bdevperf, targeting the dedicated CPU core(s). It collects hardware performance counter data (instructions, cycles) and directly extracts Instructions Per Cycle (IPC), a measure of CPU efficiency. All raw outputs are saved to a unique, timestamped directory. 4. Results Aggregation & Summary: After each test, the script parses bdevperf (IOPS, throughput, latency) and CPU metrics. A summary table is presented, highlighting the configuration with the highest total CPU utilization. 4. Key Findings: CPU-Bound Workload Confirmation Our tests confirm that the workload is CPU-bound on the single dedicated Graviton4 core, not bottlenecked by NVMe storage. Low I/O Wait: Multi-disk configurations show I/O Wait at 0.02% to 0.06%, indicating NVMe storage provides data faster than the CPU can process it. The single disk I/O wait is 5.49%. High CPU Utilization: Total CPU Utilization on the dedicated core remained high (84.87% for 1-disk, nearly 99% for 2-disk and 3-disk), confirming the single core as the performance bottleneck. Dominant System CPU: High System CPU (76-87%) is expected with SPDK AIO bdevs under heavy load, reflecting kernel overhead in processing numerous asynchronous I/O requests. IPC Values: IPC values (2.21 to 2.53) indicate the Graviton4 core's efficiency. The slight IPC increase with more disks suggests improved pipeline utilization as CPU saturation increases. 5. Performance Dynamics: 2 Disks vs. 3 Disks - Optimal Point Identification The comparison between 2-disk and 3-disk scenarios shows the CPU's saturation point: 2 Disks: Optimal Throughput: This configuration achieved the highest throughput (658.59 kIOPS / 20.09 Gbps), with nearly 99% Total CPU Utilization and minimal I/O Wait. Two NVMe devices provide I/O that a single Graviton4 core can optimally handle, maximizing throughput without excessive contention. 3 Disks: Beyond Optimal Saturation: Adding a third disk maintained high CPU utilization (98.93%) and low I/O Wait (0.02%). However, total throughput slightly decreased (638.92 kIOPS / 19.50 Gbps), and average latency significantly increased to 1802.86 μs (from 1166.00 μs). This indicates that beyond CPU saturation, additional I/O sources increase contention and queueing, leading to higher latencies without throughput gain. 6. Limitations and Future Work: The Role of SPDK Drivers (VFIO/UIO) Our current methodology utilizes the SPDK AIO bdev driver , which passes all I/O through the Linux kernel's I/O stack. This incurs kernel overhead, contributing to our observed high System CPU utilization. SPDK offers VFIO (Virtual Function I/O) and UIO (Userspace I/O) drivers for direct, zero-copy access to NVMe devices from user space, bypassing kernel overhead. These drivers typically offer higher IOPS and lower latency. We were unable to utilize VFIO or UIO drivers in this test series due to setup constraints. Using these drivers could yield higher performance (more User CPU, less System CPU), further pushing the single Graviton4 core's capabilities. Future Work: Investigating SPDK performance with VFIO or UIO drivers to fully assess the r8gd.metal-24xl instance's potential by minimizing kernel involvement. Conclusion Our experiments confirm the CPU-bound nature of the SPDK AIO bdevperf workload on a single Graviton4 core. The r8gd.metal-24xl instance's local NVMe storage is suffcient to saturate a single CPU core with high-volume, small-block random I/O. The 2-disk configuration represents the optimal point for throughput before latency increases. Future tests with user-space drivers like VFIO or UIO could demonstrate even higher performance.
Automating Web Application Deployment on AWS EC2 with GitHub Actions
Introduction Deploying web applications manually can be time-consuming and error-prone. Automating the deployment process ensures consistency, reduces downtime, and improves efficiency. In this blog, we will explore how to automate web application deployment on AWS EC2 using GitHub Actions. By the end of this guide, you will have a fully automated CI/CD pipeline that pushes code from a GitHub repository to an AWS EC2 instance, ensuring smooth and reliable deployments. Prerequisites Before we begin, ensure you have the following: An AWS account An EC2 instance with SSH access A GitHub repository containing your web application A domain name (optional) Basic knowledge of AWS, Linux, and GitHub Actions Step 1: Set Up Your EC2 Instance Log in to your AWS account and navigate to the EC2 dashboard. Launch a new EC2 instance with your preferred operating system (Ubuntu recommended). Create a new security group and allow inbound SSH (port 22) and HTTP/HTTPS traffic (ports 80, 443). Connect to your EC2 instance using SSH: ssh -i /path/to/your-key.pem ubuntu@your-ec2-ip Update the system and install necessary packages: sudo apt update && sudo apt upgrade -y sudo apt install -y git nginx docker Ensure your application dependencies are installed. Step 2: Configure SSH Access from GitHub Actions To allow GitHub Actions to SSH into your EC2 instance and deploy the code: Generate a new SSH key on your local machine: ssh-keygen -t rsa -b 4096 -C "github-actions" Copy the public key to your EC2 instance: cat ~/.ssh/id_rsa.pub | ssh ubuntu@your-ec2-ip 'mkdir -p ~/.ssh && cat >> ~/.ssh/authorized_keys' Store the private key as a GitHub Actions secret: Go to your repository on GitHub. Navigate to Settings > Secrets and variables > Actions . Add a new secret named EC2_SSH_PRIVATE_KEY and paste the private key. Also, add a secret named EC2_HOST with your EC2 public IP address. Add a secret named EC2_USER with the value ubuntu (or your EC2 username). Step 3: Clone the Repository on EC2 SSH into your EC2 instance: ssh ubuntu@your-ec2-ip Navigate to the /var/www/html directory and clone your repository: cd /var/www/html git clone https://github.com/your-username/your-repo.git myapp Step 4: Configure Docker (If Using Docker) Navigate to the project directory: cd myapp Create a docker-compose.yml file: version: '3' services: app: image: myapp:latest build: . ports: - "80:80" Run the application using Docker: docker-compose up -d --build Step 5: Create a GitHub Actions Workflow In your GitHub repository, create a new directory for workflows: mkdir -p .github/workflows Create a new file named deploy.yml inside .github/workflows: name: Deploy to AWS EC2 on: push: branches: - main jobs: deploy: runs-on: ubuntu-latest steps: - name: Checkout Code uses: actions/checkout@v3 - name: Set up SSH run: | echo "${{ secrets.EC2_SSH_PRIVATE_KEY }}" > private_key.pem chmod 600 private_key.pem - name: Deploy to EC2 run: | ssh -o StrictHostKeyChecking=no -i private_key.pem ${{ secrets.EC2_USER }}@${{ secrets.EC2_HOST }} << 'EOF' cd /var/www/html/myapp git pull origin main docker-compose down docker-compose up -d --build exit EOF Step 6: Test the CI/CD Pipeline Push some changes to the main branch of your repository. Navigate to Actions in your GitHub repository to see the workflow running. After the deployment completes, visit your EC2 instance's public IP in a browser. Step 7: Configure Nginx as a Reverse Proxy (Optional) Install Nginx on your EC2 instance if not already installed: sudo apt install nginx -y Create a new Nginx configuration file: sudo nano /etc/nginx/sites-available/myapp Add the following configuration: server { listen 80; server_name yourdomain.com; location / { proxy_pass http://localhost:80; proxy_set_header Host $host; proxy_set_header X-Real-IP $remote_addr; } } Enable the configuration and restart Nginx: sudo ln -s /etc/nginx/sites-available/myapp /etc/nginx/sites-enabled/ sudo systemctl restart nginx Step 8: Enable HTTPS with Let’s Encrypt (Optional) Install Certbot: sudo apt install certbot python3-certbot-nginx -y Obtain an SSL certificate: sudo certbot --nginx -d yourdomain.com -d www.yourdomain.com Verify SSL renewal: sudo certbot renew --dry-run Step 9: Set Up Auto-Restart for Services Ensure Docker services restart on reboot: sudo systemctl enable docker If using a Node.js or Python application, use PM2 or Supervisor to keep it running. Step 10: Implement Rollback Strategy Keep older versions of your application in a backup directory. In case of failure, manually switch to a previous version by checking out an older commit: git checkout docker-compose up -d --build Conclusion By following this guide, you have successfully automated the deployment of your web application on AWS EC2 using GitHub Actions. This setup ensures that every time you push code to the main branch, your application gets automatically updated on the server. For further improvements, consider: Adding rollback strategies for failed deployments. Implementing automated tests before deployment. Using AWS CodeDeploy for more complex deployment workflows.
AWS Graviton4 vs. GCP Axion
This blog post dives into a head-to-head performance comparison of two leading contenders: AWS Graviton4 (powering AWS r8g instances) and Google Axion (powering GCP Axion instances), both built on the advanced Arm Neoverse-V2 architecture. We'll examine their performance with Valkey 8.0.1, a popular in-memory data store. The Contenders: AWS Graviton4 and Google Axion AWS Graviton and Google Axion represent the latest generation of ARM-based server processors from Amazon and Google. Both leverage the Arm Neoverse-V2 CPU architecture, which is specifically designed for cloud computing, machine learning, and high-performance computing (HPC). These custom chips aim to provide superior performance and energy efficiency compared to traditional x86-based alternatives. The Benchmark: Valkey 8.0.1 To conduct a meaningful comparison, we chose Valkey 8.0.1, a high-performance, open-source in-memory data structure store. Valkey is a fork of Redis and is widely used for caching, session management, and real-time analytics, making it an excellent workload for testing the raw processing and memory capabilities of these instances. Our benchmark setup was configured to ensure a fair comparison: Valkey Server Cores: The Valkey server was pinned to cores 2 through 7. Request Parameters: Each experiment used 100 million parallel requests, 256 clients, and a payload size of 1024 KiB. Performance Metrics: We focused on two key metrics: Requests Per Second (RPS) for throughput and P99 latency (the 99th percentile) for responsiveness. Experiment 1: Network Performance The first experiment tested the performance when the Valkey client and server instances were running on separate hosts within the same cluster network. This scenario highlights the efficiency of the underlying network virtualization and interconnects, which are critical for many distributed workloads. IRQ Pinning: For this test, we used IRQ pinning on cores 0 and 1. This dedicates specific CPU cores to handling network interrupts, preventing them from interfering with the Valkey server's workload and ensuring a more stable and accurate network performance measurement. Distributed Application AWS r8g Instance Results GCP Axion Instance Results SET RPS 925,860 790,020 SET P99 Latency 0.431 ms 0.655 ms GET RPS 941,802 870,920 GET P99 Latency 0.415 ms 0.543 ms In this network-bound test, the AWS r8g instances consistently outperformed GCP Axion in both SET and GET operations, with higher throughput and lower P99 latency. This suggests that the AWS Nitro System's networking capabilities, which are tightly integrated with the Graviton4 processor, provide a notable advantage for distributed, network-sensitive applications. Experiment 2: Same-Host Performance The second experiment evaluated the raw processing power by running both the Valkey client and server on the same host. This test minimizes network overhead and focuses on CPU and memory performance. Same-Host Application AWS r8g Instance Results: GCP Axion Instance Results: SET RPS 1,024,894 894,262 SET P99 Latency 0.407 ms 0.367 ms GET RPS 1,060,186 942,720 GET P99 Latency 0.359 ms 0.303 ms Here, the results reveal a more nuanced picture. While AWS r8g instances again delivered higher overall throughput (RPS), the GCP Axion instances demonstrated lower P99 latency for both SET and GET operations. This indicates that while AWS's architecture may be optimized for achieving maximum throughput, Google's design seems to prioritize low-latency performance, which is a key characteristic of Valkey's command execution model. Conclusion and Analysis The benchmark results paint a clear picture: for this specific workload, AWS Graviton4-based r8g instances lead in raw throughput, while Google Axion instances excel in latency. AWS r8g (Graviton4): The higher RPS in both experiments suggests that AWS's implementation is highly optimized for parallel and high-throughput workloads, likely due to a tight integration with the AWS Nitro System. GCP Axion: The lower P99 latency on the same-host test is a significant indicator. It suggests that Google's Axion processor might have a more efficient core design or cache structure that benefits workloads where low-latency performance is paramount.
RISCV Fuzzer for GCC and LLVM
Fuzzing RISC-V compilers like GCC and LLVM is a crucial practice for ensuring the correctness and security of the entire software ecosystem built on this architecture. It's not about finding vulnerabilities in the final compiled code, but rather about discovering bugs within the compiler itself that could lead to incorrect code generation, unexpected behavior, or even exploitable flaws. Why Compiler Fuzzing is a Unique Challenge Fuzzing compilers is different from fuzzing a typical application. Instead of feeding random data to a program, you're generating random, yet syntactically valid, source code to feed to the compiler. A dumb fuzzer that just mutates bytes will quickly generate code that can't even be parsed, missing deeper bugs. The primary goal of compiler fuzzing is to detect two main types of bugs: Crashes and Panics: The fuzzer generates code that causes the compiler to crash, hang, or throw a fatal error during compilation. This indicates a compiler bug that needs to be fixed. Miscompilations: This is the most dangerous type of bug. The compiler successfully compiles the fuzzed code, but the generated machine code (the RISC-V assembly) is incorrect. This can lead to silent data corruption, security vulnerabilities, or unpredictable program behavior. Finding these requires a technique called differential fuzzing . The Power of Differential Fuzzing for RISC-V Compilers Differential fuzzing is an exceptionally powerful technique for finding miscompilations in RISC-V compilers. Here's how it works: A fuzzer, often one that generates valid C or C++ code (like csmith), creates a unique program. This program is compiled by at least two different compilers (e.g., GCC and LLVM) or with different optimization flags (e.g., -O0 and -O3). The compiled binaries are then executed, and their outputs are compared. If the outputs don't match, it means at least one of the compilers has a miscompilation bug. The fuzzer then saves this specific source code as a test case for a developer to analyze. This method effectively uses a "test oracle" to automatically identify bugs without needing to know the correct output beforehand. It's a key reason why so many compiler bugs have been found in both GCC and LLVM. Key Tools and Repositories for RISC-V Compiler Fuzzing While many of the general-purpose fuzzers mentioned before (like AFL++) can be used to fuzz a compiler's source code, specialized tools are often needed to effectively generate and test valid RISC-V-specific code. csmith: This is a well-known, randomized test case generator for C programs. It creates complex, valid C code that is a perfect input for differential testing of C compilers like GCC and LLVM. While not RISC-V-specific, it's an essential part of the workflow for fuzzing any C compiler's RISC-V backend. GitHub Repo: https://github.com/csmith-project/csmith RISCV-DV: Maintained by the RISC-V community, this tool is primarily for design verification of RISC-V processors, but it can be used to generate complex instruction sequences for testing compiler backends. It's highly configurable and can target specific ISA extensions. GitHub Repo: https://github.com/google/riscv-dv IRFuzzer: A specialized fuzzer for the LLVM backend . Instead of generating C/C++ source code, it generates LLVM's Intermediate Representation (IR), allowing it to directly test the backend code generation without worrying about frontend bugs. This is a very targeted approach for finding issues in LLVM's RISC-V code generator. GitHub: As a research tool, you can often find resources on arXiv and university websites. Searching for "IRFuzzer" on GitHub will lead to related projects. RISCV-Vector-Intrinsic-Fuzzing (RIF): A specific fuzzer designed to generate random code using the RISC-V Vector Extension (RVV) intrinsics. This is crucial for verifying that compilers like GCC and LLVM correctly implement this complex and performance-critical part of the RISC-V ISA. GitHub Repo: https://github.com/sifive/riscv-vector-intrinsic-fuzzing Patrick-rivos/compiler-fuzz-ci: This GitHub repository provides a great example of a Continuous Integration (CI) setup for fuzzing RISC-V compilers. It demonstrates how to combine tools like csmith with a CI pipeline to automatically fuzz GCC and LLVM and report bugs. GitHub Repo: https://github.com/patrick-rivos/compiler-fuzz-ci Fuzzing RISC-V compilers is an ongoing and critical effort. It ensures that the software developers are building on top of is reliable, secure, and correctly translated to the underlying hardware, strengthening the entire RISC-V ecosystem.
From Classroom to Code: Our Transformative Journey as Interns at WhileOne
The Leap into the Unknown Stepping out of the academic bubble and into the professional world is often painted as a daunting transition. For us, it was less a leap of faith and more an excited dive into the deep end, specifically, into the innovative waters of WhileOne.Our motivation to join was simple yet profound: we sought a place where curiosity was celebrated, challenges were seen as growth opportunities, and real-world impact was a daily pursuit. Little did we know that this internship would not just introduce us to company life but fundamentally reshape our technical acumen and career outlook. Interns Interning Unpacking the Technical Toolkit: What we've Learned: Tanaya Ajgar: Diagnostic Configuration Dashboard for BeagleBone Black: I developed a diagnostic configuration page for the BeagleBone Black board, which gave me valuable hands-on experience in full-stack development. I worked on building a responsive React + HTML frontend and integrated it with a Python Flask backend to enable seamless communication with the board. Through this, I learned how to design and implement interactive dashboards that allow users to configure system parameters such as IP addresses and ensure that the updates persist at the system level. I also gained practical knowledge of storing configuration results in both SQLite databases and JSON files for reliability and easy retrieval. This project helped me strengthen my understanding of REST APIs, data flow between frontend and backend, and the importance of efficient database integration in embedded system applications. I also improved my debugging skills while resolving real-time hardware-software interaction issues. Additionally, I learned how to apply UI/UX practices to make technical dashboards more intuitive and user-friendly. Overall, the project enhanced my skills in embedded system integration, web technologies, and problem-solving in a real-world scenario. Soham Gargote: During my internship, I had the valuable opportunity to contribute to two diverse and impactful projects. I delved into low-level systems programming by extending the open-source GDB debugger to enable support for RISC-V vector instructions. In parallel, I was instrumental in creating a new internal tool for benchmark management, where I developed the backend for its UI/UX visualization capabilities. This dual exposure to both open-source contributions and internal tool development made for an incredibly fun and enriching learning experience, significantly strengthening my software engineering skills. Saee Gade : RISC-V Toolchain Validation & Compiler Fuzzing As an intern, my work on RISC-V toolchain validation taught me the immense value of compiler fuzzing and its role in software reliability. I gained hands-on experience using tools like Csmith to automatically generate complex test cases and uncover hidden bugs. Beyond just bug hunting, the project's most significant takeaway for me was the process of creating high-quality, actionable bug reports. I learned the critical skill of creating minimal, reproducible test cases and effectively communicating findings to developers on platforms like Bugzilla. Contributing to major open-source projects like GCC and LLVM showed me the real-world dynamics of collaborative development and the tangible impact my work could have on improving the stability of a key toolchain for an entire ecosystem. Ruchi Joshi : Technical takeaways from Benchmarking project: Through hands-on experience with benchmarking, I learned to evaluate system performance using industry-standard HPC benchmarks like MiniFE and HPCG. I gained practical skills in writing automated shell scripts to test CPU and GPU performance across different architectures. This project also provided me with the opportunity to work with low-level hardware performance counters, learning to collect data on retired instructions, cache misses, and branch mispredictions. I now understand how to translate this raw data into higher-level microarchitecture insights using Intel's Top-Down Microarchitecture Analysis Methodology (TMAM) to identify critical bottlenecks. Furthermore, I explored ARM's Performance Monitoring Unit (PMU), which gave me insight into the distinct tooling and counter availability between Intel and ARM ecosystems. This holistic experience has provided me with a comprehensive understanding of performance analysis from high-level application benchmarks down to low-level hardware counters. The WhileOne Way: A Glimpse into Company Life Our general experience as interns has been overwhelmingly positive. The atmosphere at WhileOne is one of collaborative energy, where questions are encouraged, and mentorship is readily available. It’s a far cry from the sometimes solitary nature of academic projects. Joining WhileOne feels like becoming part of a forward-thinking family. There’s a palpable sense of innovation and a shared drive to create impactful solutions. The differences between college and company life are stark but refreshing. In college, deadlines can feel somewhat arbitrary, and projects often exist in a vacuum. Here, every task has a purpose, directly contributing to a product or service. The pace is faster, the stakes are higher, but the support system is robust. Learning is continuous, driven by real-world problems rather than theoretical exercises. Navigating Opportunities and Challenges Our internship presented a wealth of opportunities: Direct contribution to live projects: This was incredibly motivating, seeing our codes go into production. Mentorship from experienced engineers: Their guidance has been instrumental in our growth. Exposure to diverse technologies and methodologies: Expanding our technical horizons significantly. Challenges were equally present and equally valuable: Steep learning curve: Rapidly adapting to new tools and complex systems. Problem-solving under pressure: Learning to debug efficiently and think critically when faced with unexpected issues. Balancing multiple tasks: Juggling different responsibilities and prioritizing effectively. A New Beginning As we reflect on our journeys from curious students to contributing members of the WhileOne team, we are filled with gratitude and excitement. This internship has been more than just a stepping stone; it's been a foundational experience that has shaped our technical abilities, professional outlook, and career aspirations. The transition from classroom concepts to production code has been challenging yet incredibly rewarding. If you're considering an internship, especially one where real impact is made, we wholeheartedly recommend diving in. The future, for us, is bright and brimming with code, collaboration, and continuous learning all tha nks to my transformative time at WhileOne.
Neoverse -V2 Support to Intel Perfspect
We recently worked on extending Intel Perfspect ( https://github.com/Whileone-Techsoft/PerfSpect/tree/Neoverse-native-support ) , a robust, command-line performance analysis tool that implements the Top-Down Microarchitecture Analysis Method (TMAM). It fully supports the Arm Neoverse-V2 architectures. This project required mapping the Performance Monitoring Unit (PMU) events on the ARM cores to the metrics of TMAM methodology. We can now get the Level 1 breakdown (Frontend Bound, Backend Bound, Retiring, Lost) to pinpoint the bottlenecks on respective systems, which was previously incompatible with this tool. Through debugging the code, it became possible to generate the continuous series graphs to understand if the bottleneck. This extension of Perfspect for Arm (our code allows native compilation on ARM) allowed the capture of the CPU utilization HeatMap generated in Telemetry reports, which shows the distribution of work across all cores over time. The challenge was mapping the Arm events to the TMAM formulae, and then correctly comparing the captured values through the modified Perfspect tool with the values that were manually calculated using the formulae. The key learnings from this project were quickly adapting to a new programming language like Go ( Golang) and making significant changes in the code to get the appropriate results, also adding to the knowledge of TMAM methodology and the specific challenges of cross-architecture analysis, particularly in translating PMU events from the Intel ecosystem to Arm's Neoverse-V2 core.
Debugging the Debugger: A Deep Dive into GDB and RISC-V
In the world of software development, the GNU Debugger (GDB) is an essential tool for programmers. It allows us to peer inside a running program, find bugs, and understand complex code. As new hardware architectures emerge, it's crucial that our tools keep pace. One such rising star is RISC-V, an open-source instruction set architecture that is rapidly gaining popularity, particularly with its new vector extensions for high-performance computing. The Challenge: An Unknown Instruction Recently, our team took on a task to get a few bug fixes into gdb. The challenge: GDB was unable to recognize or debug vector instructions for the RISC-V architecture. This was a significant gap, hindering developers who were working with advanced RISC-V features. Without this support, debugging modern, high-performance RISC-V applications was a major challenge. My task was to dive into the GDB source code and enable this missing capability. Navigating a Sea of Code The first and most significant hurdle was the sheer scale of the GDB codebase. As a newcomer to such a vast and mature open-source project, understanding the intricate flow of control and finding the right place to intervene was a daunting task. The initial phase involved a lot of learning and exploration, and I'm grateful for the guidance of my colleagues who helped me navigate the complexities and build a mental map of the system. Through a careful process of debugging the debugger itself, we were able to trace the execution path for instruction processing. The breakthrough came when we identified the root cause of the issue: there was a missing function call responsible for reading and interpreting the new vector instructions. The logic was there, but it was never being invoked for this specific case. With the problem identified,we implemented an initial solution. Our contribution involved a hardcoded fix that proved the concept and successfully enabled GDB to recognize the vector instructions. This initial patch paved the way for a more robust and integrated solution that was later refined by other contributors in the open-source community. The result is a direct enhancement to a critical developer tool. Programmers working with RISC-V can now debug vector-based code more effectively, accelerating development and improving software quality within the ecosystem. https://www.sourceware.org/pipermail/gdb-patches/2025-May/217880.html
Top CPU Performance Benchmarking Toolkits You Should Know
Modern compute platforms - from cloud hyperscale CPUs to edge processors - deliver unprecedented parallelism and instruction-set capabilities. But to truly understand performance, you need the right benchmarking tools. Whether you're comparing cloud instances, evaluating Arm-based servers like Ampere , or validating x86, RISC-V, or AI-accelerated hardware, the ecosystem offers several battle-tested frameworks. In this blog, we explore the most widely-used CPU benchmarking toolkits today - what they do, where they shine, and when to use each. 1. Ampere Performance Toolkit (APT) Ampere’s servers built on Arm architecture are optimized for cloud-native performance and power efficiency. The Ampere Performance Toolkit provides a set of scripts, automation, and recommended benchmarks to evaluate real-world workloads. Key Features Best For ✔ Evaluating Arm server performance ✔ Cloud benchmarking on Ampere instances ✔ Developers migrating workloads from x86 to Arm 2. PerfKit Benchmarker (Google) Originally built by Google, PerfKit Benchmarker (PKB) is the gold standard for cloud performance benchmarking across providers. Key Features Best For ✔ Comparing cloud VM types ✔ Reproducible benchmark automation ✔ Cloud procurement and architectural evaluations Fun fact: PKB has become the foundation for multiple forks and extensions across companies and academia for transparent benchmarking. 3. Phoronix Test Suite (PTS) The Phoronix Test Suite is one of the largest open-source benchmarking ecosystems—great for developers and hardware reviewers. Key Features Best For ✔ Broad CPU and system benchmarking ✔ Linux performance testing ✔ Reviewers, researchers, and enthusiasts 4. SPEC CPU Suite The Standard Performance Evaluation Corporation (SPEC) CPU suites are industry-trusted benchmarks for vendors and OEMs. Key Features Best For ✔ Enterprise-grade server benchmarking ✔ Official vendor comparisons ✔ Performance engineering and compiler tuning Note: Requires paid license. 5. Microbenchmark Suites (Core Latency, Memory, IPC) Sometimes, detailed architectural behavior matters more than high-level scores. Popular Tools Best For ✔ Low-level CPU behavior ✔ Memory latency & bandwidth analysis ✔ Performance debugging ML & AI-Centric Benchmarks (Emerging) Even CPU evaluations increasingly involve AI workloads. Best For ✔ AI inference on CPUs ✔ Edge compute & acceleration evaluations Bonus: Build-Your-Own Benchmark Harness Cloud providers and silicon vendors often implement custom harnesses around: Docker-ized workloads Kubernetes load-generation frameworks Real-app benchmarking (Redis, NGINX, PostgreSQL, Spark) For engineering teams, custom workload pipelines often reveal more than synthetic scores . Summary Table Toolkit Scope Best Use Case Ampere Performance Toolkit Server-class Arm systems Cloud-native Arm benchmarking PerfKit Benchmarker Multi-cloud benchmarking Cloud instance comparisons Phoronix Test Suite Broad system benchmark suite Linux and multi-OS testing SPEC CPU Industry standard CPU benchmarks Formal server performance publication sysbench / lmbench / perf Microbenchmarks & counters CPU profiling & tuning MLPerf / HPL / HPCG AI & HPC performance Compute-heavy + scientific workloads
Network Latency Study in OCI Cloud
Network testing tools such as netperf can perform latency tests plus throughput tests and more. In netperf, the TCP_RR and UDP_RR (RR=request-response) tests report round-trip latency. With the -o flag, output metrics can be customized to display the exact information. Here’s an example of using the test-specific -o flag so netperf outputs several latency statistics: Google has lots of practical experience in latency benchmarking and as per blog using-netperf-and-ping-to-measure-network-latency , we tried to create own latency benchmarking before and after migrating workloads to the OCI cloud. Which tools and why All the tools in this area do roughly the same thing: measure the round trip time (RTT) of transactions. Ping does this using ICMP packets. ping -c 100 ping command sends one ICMP packet per second to the specified IP address until it has sent 100 packets. netperf -H -t TCP_RR -- -o min_latency,max_latency,mean_latency -H for remote-host and -t for test-name with a test-specific option -o for output-selectors. When we run latency tests at Google in a cloud environment, our tool of choice is PerfKit Benchmarker (PKB). This open-source tool allows to run benchmarks on various cloud providers while automatically setting up and tearing down the virtual infrastructure required for those benchmarks. After setting up, PerfkitBenchmarker, its simple to run ping and netperf benchmarks ./pkb.py --benchmarks=ping --cloud=OCI --zone=us-ashburn-1 ./pkb.py --benchmarks=netperf --cloud=OCI --zone=us-ashburn-1 --netperf_benchmarks=TCP_RR These commands run intra-zone latency benchmarks between two machines in a single zone in a single region. Intra-zone benchmarks like this are useful for showing very low latencies, in microseconds, between machines that work together closely. Latency discrepancies We've set up two VM.Standard.E4.Flex machines running Ubuntu 22.04 in zone us-ashburn-1 , and we'll use Private IP addresses to get the best results. If we run a ping test with default settings and set the packet count to 100, we get the following results: ping -c STDOUT: PING 172.16.60.168 (172.16.60.168) 56(84) bytes of data. 64 bytes from 172.16.60.168: icmp_seq=1 ttl=64 time=0.202 ms 64 bytes from 172.16.60.168: icmp_seq=2 ttl=64 time=0.205 ms … 64 bytes from 172.16.60.168: icmp_seq=99 ttl=64 time=0.329 ms 64 bytes from 172.16.60.168: icmp_seq=100 ttl=64 time=0.365 ms --- 172.16.92.253 ping statistics --- 100 packets transmitted, 100 received, 0% packet loss, time 101353ms rtt min/avg/max/mdev = 0.371/0.450/0.691/0.040 ms By default, ping sends out one request each second. After 100 packets, the summary reports that we observed an average latency of 0.450 milliseconds, or 451 microseconds. For comparison, let’s run netperf TCP_RR with default settings for the same amount of packets. netperf-2.7.0/src/netperf -p {command_port} -j -v2 -t TCP_RR -H 132.145.132.29 -l 60 -- -P ,{data_port} -o THROUGHPUT, THROUGHPUT_UNITS, P50_LATENCY, P90_LATENCY, P99_LATENCY, STDDEV_LATENCY, MIN_LATENCY, MEAN_LATENCY, MAX_LATENCY --num_streams=1 --port_start=20000' --timeout 360 Netperf Results: {'Throughput': '4245.34', 'Throughput Units': 'Trans/s', '50th Percentile Latency Microseconds': '228', '90th Percentile Latency Microseconds': '239', '99th Percentile Latency Microseconds': '372', 'Stddev Latency Microseconds': '92.06', 'Minimum Latency Microseconds': '215', 'Mean Latency Microseconds': '235.08', 'Maximum Latency Microseconds': '21059' Which test can we trust? To explain, this is largely an artefact of the different intervals the two tools used by default. Ping uses an interval of 1 transaction per second while netperf issues the next transaction immediately when the previous transaction is complete. Fortunately, both of these tools allow to manually set the interval time between transactions. For ping, -i flag to set the interval, given in seconds or fractions of a second. On Linux systems, this has a granularity of 1 ms, and rounds down. $ ping -c 100 -i 0.010 For netperf TCP_RR, we can enable some options --enable-spin flag to compile with fine-grained intervals -w flag, to set the interval time, and the -b flag, to set the number of transactions sent per interval. This approach allows to set intervals with much finer granularity, by spinning in a tight loop until the next interval instead of waiting for a timer; this keeps the cpu fully awake. Of course, this precision comes at the cost of much higher CPU utilization as the CPU is spinning while waiting. *Note: Alternatively, setting less fine-grained intervals by compiling with the --enable-intervals flag. Use of the -w and -b options requires building netperf with either the --enable-intervals or --enable-spin flag set. The tests here are performed with the --enable-spin flag set. netperf with an interval of 10 milliseconds using: $ netperf -H -t TCP_RR -w 10ms -b 1 -- -o min_latency,max_latency,mean_latency Now, after aligning the interval time for both ping and netperf to 10 milliseconds, the effects are apparent: Ping result is --- 172.16.92.253 ping statistics --- 1000 packets transmitted, 1000 received, 0% packet loss, time 15981ms rtt min/avg/max/mdev = 0.252/0.306/0.577/0.025 ms Netperf results are Minimum Latency Microseconds,Maximum Latency Microseconds,Mean Latency Microseconds 215,235.08,21059 We have integrated OCI as a provider in Perfkitbenchmarker which we are using to carry out testing. Here are the results of the inter region ping benchmark for A1.Flex2, E4.Flex.1 and S1.Flex vms. Tested netperf for intra region, considered us-ashburn-1 region here. Generally, netperf is recommended over ping for latency tests. This isn't due to any lower reported latency at default settings, though. As a whole, netperf allows greater flexibility with its options and we prefer using TCP over ICMP. TCP is a more common use case and thus tends to be more representative of real-world applications. That being said, the difference between similarly configured runs with these tools is much less across longer path lengths. Also, remember that interval time and other tool settings should be recorded and reported when performing latency tests, especially at lower latencies, because these intervals make a material difference.
Investigating Performance Discrepancy in HPL Test on ARM64 Machines
Introduction: High-Performance Linpack (HPL) is a widely used benchmark for testing the computational performance of computing systems. In this blog post, we explore an intriguing scenario where we conducted HPL tests on two ARM64 machines. Surprisingly, the Host-2 machine exhibited a 20% lower performance than the Host-1 machine in the HPL test. Intrigued by this result, we embarked on a journey to comprehensively diagnose the underlying cause of this performance discrepancy. Why HPL for Performance Testing? People use the High-Performance Linpack (HPL) benchmark for performance testing because it provides a standardised and demanding workload that measures the peak processing power of computer systems, particularly in terms of floating-point calculations. It helps assess and compare the computational capabilities of different hardware configurations. This benchmark helps in comparing and ranking supercomputers' performance and is often used as a metric for the TOP500 list of the world's most powerful supercomputers. For more information, you can refer to the TOP500 article here: TOP500 List Objective: The primary objective of this investigation was to identify the reason behind the 20% performance difference observed in the HPL test between the Host-1 and Host-2 machines. To comprehensively diagnose the performance discrepancy, we conducted additional benchmark tests, including Stream, Lmbench, and bandwidth tests. 1. System Details: We conducted a fair and controlled experiment using two ARM64 machines, referred to as Host-1 and Host-2. 1.1 Machine Specifications ( Host-1 and Host-2 ): CPU(s): 96 Architecture: aarch64 Total memory: 96 GB Memory speed: 3200 MHz 2. Running HPL Benchmark: To run the HPL benchmark on an arm64 machine, you can refer to the GitHub repository provided: https://github.com/AmpereComputing/HPL-on-Ampere-Altra . This repository likely contains instructions, scripts, and configurations specific to running HPL on Ampere Altra ARM64-based machines. It's important to follow the guidelines provided in the repository to ensure accurate and meaningful benchmarking results. 2.1 HPL Scores: Upon completing the HPL benchmark on both machines, we computed and compared the achieved HPL scores. The Host-1 machine garnered a higher HPL score, signifying better computational performance. Machine Time ( sec ) Score Host-1 619.91 1245 Host-2 784 985 This result raised a critical question: why was there such a substantial performance gap? To delve into the root causes behind this discrepancy, we decided to conduct a series of additional tests to comprehensively investigate the issue. 3. Exploring Additional Tests: We conducted several other benchmark tests to comprehensively investigate the performance discrepancy between the Host-1 and Host-2 ARM64 machines. These tests aimed to shed light on various aspects of the systems' hardware and memory subsystems, providing a holistic understanding of the observed difference. Below, we detail the tests and their findings: 3.1 Stream Benchmark: The Stream benchmark assesses memory bandwidth and measures the system's capability to read from and write to memory. The benchmark consists of four fundamental tests: Copy, Scale, Add, and Triad. Copy: Measures the speed of copying one array to another. Scale: Evaluates the performance of multiplying an array by a constant. Add: Tests the speed of adding two arrays together. Triad: Measures the performance of a combination of operations involving three arrays. The Stream benchmark helps uncover memory bandwidth limitations and assess memory subsystem efficiency. Host-1 machine result : Function Best Rate MB/s Avg time Min time Max time Copy 103837.8 0.367897 0.36192 0.373494 Scale 102739.4 0.369191 0.365789 0.372439 Add 106782.7 0.536131 0.527908 0.542759 Triad 106559.1 0.533549 0.529016 0.537881 Host-2 machine result : Function Best Rate MB/s Avg time Min time Max time Copy 66071.3 0.572721 0.568794 0.575953 Scale 65708.8 0.575758 0.571932 0.580686 Add 67215.5 0.843995 0.838667 0.848371 Triad 67668.1 0.837109 0.833058 0.84079 Best Rate MB/s vs Function Graph In the Stream Benchmark results, Host-1 outperformed Host-2 across all functions (Copy, Scale, Add, Triad). Host-1 demonstrated higher memory bandwidth in each function, achieving significantly faster data transfer rates. This suggests a stronger memory subsystem performance in Host-1 compared to Host-2. 3.2 Lmbench for Memory Latency: Lmbench is a suite of micro-benchmarks designed to provide insights into various aspects of system performance. The suite includes latency tests for system calls, memory accesses, and various operations to quantify the system's responsiveness. Memory access tests include random read/write latency and bandwidth, helping to identify memory subsystem performance. File I/O tests evaluate file system performance, providing insights into storage subsystem capabilities Result Memory Latency: Memory latency refers to the time it takes for the CPU to access a specific memory location. Lower latency values indicate better performance, as data can be fetched more quickly. size (MB) latency (ns)-HOST-1 latency (ns)-HOST-2 0.00049 1.43 1.429 —----- —----- —----- —----- —----- —----- 2 32.355 32.786 3 34.503 36.012 4 37.403 37.932 6 39.982 52.922 8 41.007 54.001 12 44.315 55.466 16 65.52 73.016 24 95.131 117.278 32 115.081 138.945 48 126.796 151.945 64 129.558 159.225 96 134.413 166.359 128 136.239 167.788 192 136.245 168.689 256 136.366 170.464 384 137.732 170.461 —----- —----- —----- —----- —----- —----- 2048 135.61 149.809 4. Analysis and Findings: After conducting these benchmark tests, we observed the Host-2 machine consistently exhibited lower performance across different tests compared to the Host-1 machine. The most significant finding came from the Lmbench test, which revealed that the Host-2 machine's RAM had notably higher latency compared to the Host-1 machine. Notably, an additional factor was identified—the RAM rank. The Host-1 machine is equipped with Dual-Rank RAM, while the Host-1 machine has Single-Rank RAM. This RAM rank difference could contribute to the performance discrepancy. The observation is in line with findings from various other studies that have examined the influence of RAM rank on system performance. To gain a more comprehensive understanding of this subject, the following articles could be of interest: Single vs. Dual-Rank RAM: Which Memory Type Will Boost Performance? - This article provides a thorough comparison between single and dual-rank RAM, aiding in comprehending the disparities between these two RAM types, methods to distinguish them, and guidance on selecting the most suitable option for your needs. ( LINK ) Single Rank vs Dual Rank RAM: Differences & Performance Impact : This article delves into the differences between Single Rank and Dual Rank RAM modules, investigating their structural dissimilarities and assessing the respective impacts on performance. ( LINK ) 5. Conclusion: After conducting an extensive series of benchmark tests, we have pinpointed certain factors that contribute to the performance disparity observed in the HPL test between the two ARM64 machines. In the Stream Benchmark results, Host-1 outperformed Host-2 across all functions (Copy, Scale, Add, Triad). Host-1 demonstrated higher memory bandwidth in each function, achieving significantly faster data transfer rates. Additionally, the higher memory latency in the Host-2 machine's RAM was identified as a key contributor to the performance gap. This latency impacted the efficiency of memory operations and had a cascading effect on overall performance. Another significant factor was the difference in RAM rank configurations — Host-1 had Dual-Rank RAM, while Host-2 had Single-Rank RAM. This divergence likely contributed to the varying memory access speeds between the two machines. 6. Future Scope: In the context of further exploration, it is recommended to extend the investigation by including additional benchmark tests, specifically focusing on the Lmbench memory bandwidth test. This test would provide deeper insights into the memory subsystem's performance on both the Host-1 and Host-2 machines. Additionally, an interesting avenue for investigation could involve modifying the RAM configuration in one of the machines and assessing its impact on performance. This would provide valuable information about the role of memory specifications in influencing the overall system performance.
Root causing a memory corruption on Arm64 VMs
We recently migrated one of our websites to Azure Arm64 VMs. However, as soon as we pushed the infrastructure change in production, we started to observe our server process being restarted infrequently. These restarts may happen within a few seconds sometimes while not occurring for hours at other times. While the redundancy in our setup ensured minimal end-user impact, we wanted to quickly address the issue at hand. Looking at the logs A quick look at the logs showed the following error before process restarts: malloc(): corrupted top sizeAborted (core dumped) This is a Node.js based Next.js website with nothing memory intensive being performed. So, we were surprised to see a memory related issue. A quick look at the top also suggested we had adequate memory available for our running processes. So, this definitely looked like a memory corruption. Our next challenge was to identify what caused the memory corruption. On analyzing the logs further, it did not appear that there was a single website url causing this issue. Reproducing the issue With this information at hand, we went back to our test environment (which was also running on Azure Arm64 VM) and setup a more detailed logging. We then visited a large number of our website urls to see if we could reproduce the restart. Eventually, we did find a couple of urls where the Node.js process would exit with the corrupted memory error message. Identifying the root cause Once we could reproduce the issue, we narrowed it down the to images loading on these pages. Our images were being served by Next.js next/image library. This library internally leverages the `sharp` package to optimize the images being served. So, it appeared that for some images (not all), the sharp image optimization logic was resulting in memory corruption causing our Node.js process to exit. Looking at the current & past issues for lovell/sharp on github took us to this issue , which summarized our experience. Issue details & Fix On probing further, we understood that the libspng library being used by lovell/sharp had a memory corruption issue when trying to decode a paletted PNG on Arm64. libspng addressed this issue with v0.7.2 which was picked by lovell/sharp within v0.31.0. On pinning our sharp dependency within the package.json to v0.31.0, we were able to force our next/image to pick-up this version of the sharp library (instead of the older one) for image optimiztaion. With this change, the specific images that were causing Node.js process exit earlier were now being optimized as expected. Once the change went into production, we watched our production Node.js processes for any restarts. With no restarts observed for a couple of days, we were able to mark the issue as addressed.