72 results found with an empty search
- Benchmarking and Validation of Workloads on Emulators
In this case study, we describe our systematic approach to benchmarking and validating workloads on FPGA platforms using HAPS (High-performance ASIC Prototyping System) models. The workflow involves compiling and cross-compiling a diverse set of workloads using both native QEMU and the open source toolchain, executing them on FPGA hardware, and capturing detailed performance metrics such as instructions executed and cycle counts. 1. Benchmark Preparation and Build Process We classify our benchmarks into the following categories: High-Performance Computing (HPC) Benchmarks: Includes matrix multiplication, FFT, and other numerical kernels. Synthetic Benchmarks: Includes whetstone, dhrystone, and other CPU stress tests. Algorithmic Benchmarks: Includes sorting algorithms, graph traversal, and numerical integration. Cryptography and Security Benchmarks: AES, RSA, SHA-based microbenchmarks (in future pipeline). Memory and I/O Benchmarks: Includes stream, memcpy stressors, and file read/write tests. Industry-standard Benchmarks: SPEC CPU2017 for INT and FP tracks. All benchmarks are first built or cross-compiled depending on their compatibility: Native Build: Performed on QEMU-based emulation environment where toolchain compatibility allows. Cross Compilation: Done using toolchain targeting the architecture for cases where native build fails or is time-prohibitive. Application Categories Distribution 2. Deployment and Execution on FPGA The compiled binaries are deployed to the FPGA via HAPS models configured with a soft-core. Execution is controlled using a lightweight shell interface or boot script. We utilize a custom performance monitoring utility (akc_counter_capture) to gather the following metrics: Total instruction count Cycle count These values are stored for each benchmark run and are used in performance comparisons. 3. Workload Example 1: DGEMM (Double-Precision General Matrix Multiply) DGEMM is a key linear algebra kernel from the BLAS library. We compiled and executed the DGEMM kernel using double-precision arithmetic with matrix size NxN, where N=256. Performance was evaluated using instruction count, cycle count, and IPC (Instructions Per Cycle). 2: N-Queens Problem The N-Queens benchmark is a classic example of combinatorial search used to evaluate control-flow-heavy algorithm performance. It computes all valid arrangements of N queens on an N×N chessboard such that no two queens attack each other. We verified correctness by comparing the total number of valid solutions for standard board sizes (e.g., N=12 and N=14), which matched precisely across architectures. The benchmark’s output was deterministic, and no deviations were observed across multiple FPGA runs. 3: Red-Black Tree (RBTree) Manipulation Red-Black Tree (RBTree) manipulation represents a memory-bound and pointer-intensive workload that tests dynamic memory access patterns and data structure balancing algorithms. This benchmark was compiled using both the Embedded toolchain and natively on QEMU for consistency. Validation involved verifying the in-order traversal of the tree after bulk insertions and deletions. RBTree serves as a robust test of both instruction scheduling and memory subsystem behavior. Conclusion Our approach demonstrates that workloads can be effectively compiled, executed, and validated on FPGA platforms using HAPS models.
- Open-Source Benchmarking Tools with Ad-Hoc Extension
Automation is essential for performance benchmarking because it ensures that results are reliable, repeatable, scalable, and comparable. Various open source benchmarking tools are used for Automation. Tools are essential for benchmarking because they bring standardization, accuracy, efficiency, and repeatability to performance evaluation. Open-Source Benchmarking Tools that support ad-hoc extensibility, meaning they can be customized or extended without rebuilding or heavily modifying the core codebase. These tools provide flexibility in creating custom test scenarios, simulating various workloads, and adapting to new APIs or environments. List of tools which we used for benchmarking: Phoronix Test Suite PerfKit Benchmarker Phoronix Test Suite: Phoronix Test Suite is the most comprehensive open-source benchmarking platform available for Linux, macOS, and windows systems. It is widely used for automated testing, performance analysis, and software comparisons. What is PTS Extension? A PTS Extension is a plugin or add-on for the Phoronix Test Suite (PTS) that extends its functionality. It allows users to add custom behaviors before, during, or after benchmark runs—ideal for automation, integration, or custom logging. PTS extensions are used to: Add full socket runs Add open source docker tests Integrate with other systems. System & hardware benchmarking Why Shift from PTS to PerfKit Benchmarker? Phoronix Test Suite (PTS) is primarily a single-node benchmarking tool, which runs on a single machine. To overcome this issue Perfkit Benchmarker tool is used. PKB is specifically built for cloud platforms. PKB handles provisioning, benchmarking, monitoring, and cleanup automatically. PTS requires manual test setup, especially for cloud VMs. PKB can push benchmark data to: InfluxDB Stackdriver Grafana JSON logs for CI/CD systems PTS does offer HTML/JSON/CSV output but lacks native telemetry integrations. Perfkit Benchmarker(PKB): PerfKit Benchmarker is an open-source tool developed by Google that automates the process of benchmarking cloud infrastructure across different cloud providers. Main Stages of a PerfKit Benchmarker Run: What Is a PerfKit Benchmarker Extension? Extensions allow users to define: Custom benchmark Flags Providers Workloads Top Benefits of PerfKit Benchmarker Extension: PerfKit Benchmarker can run distributed benchmarks involving multiple VMs across one or more cloud zones or providers. Automatically handles VM provisioning, software installation, test execution, teardown. Easily integrates with dashboards, analytics pipelines, or cost/performance reports. Useful in capacity planning, performance regression testing, or SLI validation. In addition PKB Extension supports Turbostat (useful for analyzing power and frequency behavior during benchmarks), Lm-Senso rs(Linux utility used to monitor hardware sensors), and Sysstat (analyze CPU, memory, disk I/O, networking, and other system-level performance metrics.). PKB extension also support additional feature for Report generation, which is useful to generate report with all result and peripheral data. It supports various formats such as TXT, CSV and HTML. Here’s a set of workload charts for PerfKit Benchmarker (PKB) organized by category. These charts summarize the common benchmark workloads available in PKB, helping you choose the right tests for CPU, memory, disk, network, and database performance analysis across cloud platforms. Cloud Comparison Using PerfKit Benchmarker Here's a comprehensive comparison of cloud providers (GCP, Azure, OCI) using PerfKit Benchmarker (PKB) as a common benchmarking framework: Conclusion PTS is excellent for deep technical benchmarking of a single system. PKB is a robust choice for cloud performance comparisons, cost evaluation, and infrastructure benchmarking at scale.
- Understanding DLRM with PyTorch
DLRM stands for Deep Learning Recommendation Model. It is a neural network architecture developed by Facebook AI (Meta) for large-scale personalized recommendation systems. DLRM is widely used in real-world applications where personalized recommendations or ranking predictions are needed. DLRM designed for click-through rate (CTR) prediction and ranking task. Examples: Online Advertising, E-commerce Recommendations, Social Media Feed Ranking, Streaming Services, Online Marketplace and Classifieds etc. DLRM features: DLRM Installation Options: Install Original Facebook DLRM(PyTorch) using git and python. Install DLRM using TorchRec Install NVIDIA DLRM Install DLRM in Docker (CPU-only or GPU) What Is the Relationship Between DLRM and PyTorch? DLRM is built using PyTorch. PyTorch serves as the foundational deep-learning framework that powers every component inside DLRM. PyTorch Is the Framework; DLRM Is the Model DLRM is not a framework, it is a specific neural-network architecture designed by Meta (Facebook) for large-scale recommendation systems. PyTorch provides: DLRM uses these tools to construct its dense MLPs, embedding tables, and feature-interaction layers. Pytorch Installation Options: PyTorch can be installed in several ways depending on your environment, hardware, and workflow. Install via pip (Most Common & Easiest) Install via Conda (Best for GPU Environments) Install via Docker (Isolated & Production-Friendly) Install from Source (For Developers and Custom Builds) Cloud-Based PyTorch Installation Install via Package Managers (Limited OS Support) Pytorch Installation via Docker: Installing PyTorch through Docker is one of the most reliable and hassle-free ways to set up a deep learning environment. Instead of manually managing Python versions, CUDA toolkits, cuDNN libraries, and system dependencies, Docker provides a pre-configured container where everything already works out of the box. By pulling an official PyTorch image—either CPU-only or with CUDA support—you get an isolated and reproducible environment that runs identically on any machine. Quick steps 1. Pull an image CPU-only: docker pull pytorch/pytorch:latest GPU (CUDA 11.8 example): docker pull pytorch/pytorch:latest-cuda11.8-cudnn8-runtime 2. Run the container CPU: docker run -it pytorch/pytorch:latest bash GPU (with NVIDIA container toolkit): docker run -it --gpus all pytorch/pytorch:latest-cuda11.8-cudnn8-runtime bash 3. Verify inside the container python3 -c "import torch; print(torch.__version__); print('cuda:', torch.cuda.is _available())" How to Run DLRM Inside a PyTorch Docker Container? Pull a PyTorch Docker Image Start the Container Install Dependencies (Inside the Container) Clone DLRM Repository Run DLRM DLRM Command: Running DLRM effectively requires understanding the key command-line options that control data loading, model architecture, training configuration, and performance tuning. DLRM accepts a rich set of flags that allow you to configure everything from batch sizes to embedding dimensions. These options fall into four major categories: Data Options Training Options Model Architecture Options System / Performance Options Frequently Used DLRM Command: python dlrm_s_pytorch.py \ --data-generation=synthetic \ --mini-batch-size=2048 \ --learning-rate=0.01 \ --arch-sparse-feature-size=16 \ --arch-mlp-bot="13-512-256-64-16" \ --arch-mlp-top="512-256-1" \ --print-freq=10 Conclusion Using PyTorch Docker containers to run DLRM (Deep Learning Recommendation Model) provides a streamlined, consistent, and reproducible environment across different hardware platforms. Docker eliminates dependency conflicts, simplifies setup, and ensures that the exact software stack—PyTorch version, libraries, and optimizations—can be deployed seamlessly. In short, PyTorch Docker + DLRM offers a reliable, flexible, and efficient path to train, evaluate, and deploy recommendation models with minimal friction.
- Tuning Compiler Flags for Custom Hardware
Benchmarking SPECint on FPGA: Introductio n With the growing interest in AI hardware for high-performance and power-efficient computing, understanding how industry-standard benchmarks perform on such platforms is critical. In this paper, we focus on SPECrate®2017 Integer workloads, a widely-used CPU benchmark suite, and share a case study comparing various runs on an FPGA target: a base run and a tuned run that achieved better performance. This paper describes how the tuning and benchmarking procedure was executed, the challenges faced, and what we learned from this hands-on analysis. Why SPECint on FPGA? SPECrate®2017 integer evaluates the integer processing capabilities of a CPU through a set of compute-intensive, single-threaded programs. Running these on an FPGA (with soft or hardened CPU cores) helps evaluate and tune how custom logic performs in realistic software scenarios, especially in workloads like compilers, compression, and AI preprocessing. Benchmarking Setup Platform: FPGA Emulation: Run on QEMU for pre-validation, native execution on FPGA target Benchmark Suite: SPEC CPU2017 Cross-compilation: All benchmarks built using a target toolchain with specmake, applying -static, and a base set of flags Base Run: No tuning; baseline compiler flags, minimal memory tuning Optimized Run: Enhanced compiler flags, better memory layout, cache tuning Here’s how the benchmarking was carried out: Cross-Compilation of SPECrate®2017 Integer Benchmarks Ensured static linking for portability Verified ELF binaries using file and readelf Execution with runspec Invoked with runspec --config=target.cfg --tune=base --size=test,train.ref for initial testing Data Collection Captured runtime, SPEC score, and individual benchmark outputs Track CPU MHz, instruction counts using perf or counters Tricks Use math models to reduce run times of Spec Workloads Get a sense for Test, Train and Ref workloads and find a relation so there is no need to runref everytime.
- Performance Modelling: How to Predict and Optimize System Efficiency
1. Introduction In today’s fast-paced digital world, system performance is critical to the success of applications ranging from cloud computing platforms to high-performance computing (HPC) workloads. Performance modelling is a powerful technique used to predict, analyze, and optimize the efficiency of computing systems. By simulating and understanding system behavior, developers, engineers, and IT managers can make informed decisions about design, scaling, and optimization strategies. 2. What is Performance Modelling? Performance modelling is the process of creating abstract representations (models) of a system's behavior under various workloads and configurations. These models help predict how systems respond to changes in usage, hardware, software, or architecture. Performance models can be analytical, simulation-based, or empirical, each offering unique insights into system behavior. 3. Objectives of Performance Modelling Prediction: Estimate system behavior before deployment. Bottleneck Identification: Locate components that limit performance. Optimization: Inform design choices to improve efficiency. Capacity Planning: Guide resource allocation for current and future needs. Cost Efficiency: Avoid over-provisioning and reduce operational expenses. 4. Key Techniques in Performance Modelling Analytical Models: Use mathematical formulas to describe system performance. Simulation Models: Create detailed simulations to mimic system behavior over time. This could be as simple as equations with simple assumptions or using models available online. Empirical Models: Rely on real-world data and benchmarks to build predictive models. This is more involved since this requires in-depth knowledge of system architecture. 5. Steps in Developing a Performance Model Define Goals: Determine what you want to achieve (e.g., optimize response time, throughput). Collect Data: Gather metrics from logs, monitoring tools, or benchmarks. Choose Modelling Technique: Decide between analytical, simulation, or empirical models. Build the Model: Construct the performance model using appropriate tools or software. Validate the Model: Compare predictions with actual performance to ensure accuracy. Analyze & Optimize: Use the model to explore different configurations and identify optimal settings. 6. Tools for Performance Modelling Queuing Models for analyzing response times Simulators for detailed, event-based modeling Benchmarking Suites for real-world performance data Profiling Tools for low-level performance metrics 7. Applications of Performance Modelling High-Performance Computing (HPC): Optimize cluster performance and parallel job scheduling. Cloud Computing: Predict performance under varying loads and optimize resource allocation. Software Engineering: Improve application architecture and identify inefficient code paths. Enterprise IT: Plan for infrastructure upgrades and disaster recovery. 8. Challenges and Best Practices Challenges: Model accuracy vs. complexity trade-off Data collection overhead Environmental variability Best Practices: Keep models as simple as possible while maintaining accuracy Continuously validate models against real performance Use a combination of modelling techniques when necessary 9. Conclusion Performance modelling is an indispensable approach for understanding, predicting, and optimizing system efficiency. Whether you're designing a new application, upgrading infrastructure, or managing a complex cloud environment, performance models can help you make better, data-driven decisions. By embracing the right modelling techniques and tools, organizations can improve performance, reduce costs, and deliver superior user experiences.
- Unleashing Performance Insights on ARM: Bringing Intel's PerfSpect to the Entire Ecosystem
Performance analysis can often feel like searching for a needle in a haystack. When your application isn't running as fast as you'd like, where do you even begin to look? Is it a memory bottleneck? Are you stalling in the CPU's front-end? Answering these questions is critical, but traditional tools can be complex and overwhelming. This is where Intel's PerfSpect comes in. And now, thanks to some recent contributions, this powerful tool is no longer just for x86 systems. I'm happy to share - how I've been able to natively compile PerfSpect on ARM architecture- enabling deep performance analysis on platforms like Neoverse series for processors like Ampere, AWS Graviton, Google Axion, NVIDIA Grace and Microsoft Cobalt series of Processors supporting. Why PerfSpect? A Simpler Path to Performance Insights PerfSpect is a lightweight, command-line performance analysis tool. Its primary strength lies in its use of the Top-Down Microarchitecture Analysis (TMA) methodology. Instead of drowning you in hundreds of raw performance counters, TMA provides a structured, hierarchical way to identify the primary bottleneck in your system. It breaks down CPU cycles into a few high-level categories: Front-End Bound: The CPU isn't getting instructions fast enough. Back-End Bound: Instructions are available, but the execution units are stalled. This is further broken down into: Core Bound: The computation units are the bottleneck. Memory Bound: The CPU is waiting on data from memory or caches. Retiring: The CPU is successfully executing instructions. This is the "good" category. Bad Speculation / Miss: The CPU wasted work on instructions that were ultimately discarded (e.g., due to branch misprediction). By presenting performance data through this lens, PerfSpect makes it incredibly easy to pinpoint the character of your bottleneck and tells you exactly where to focus your optimization efforts. The Competitive Landscape: How Does PerfSpect Compare? PerfSpect doesn't exist in a vacuum. The Linux ecosystem is rich with powerful profiling tools like perf, Intel VTune Profiler, AMD uProf. PerfSpect's unique value is its combination of simplicity, structured TMA methodology, and now, cross-architecture support. It provides actionable insights without the steep learning curve of raw perf or the complexity of a full-blown GUI profiler. My Contribution: Native Support for ARM My primary contribution was to port PerfSpect, enabling it to build and run natively on ARMv8/ARMv9 architectures. This involved mapping the ARM Performance Monitoring Unit (PMU) events to the TMA categories, allowing the same intuitive reporting to work seamlessly on platforms from Ampere, Amazon, and Microsoft. Now, developers can use a single, familiar tool to analyze workloads across different server fleets. Get Started: Build and Run PerfSpect on ARM Ready to try it on your ARM machine? Here’s how you can get it up and running. Prerequisites Ensure you have Python, pip, and the standard Linux performance tools installed. # For Debian/Ubuntu-based systems $ sudo apt-get update $ sudo apt-get install -y python3 python3-pip linux-tools-common linux-tools-generic Step 1: Clone the Repository $ git clone -b Neoverse-native-support https://github.com/Whileone-Techsoft/PerfSpect.git $ cd PerfSpect Step 2: Build Tools Docker for aarch64 $ ./builder/build.sh Step 3: Build Perfspect natively on aarch64 $ make -j Sample TMA image for Graviton4
- Building Observability-Driven Performance Benchmarking Frameworks
Complex computing environments, spanning cloud, HPC, AI, and edge workloads; observability is no longer optional. With multiple layers of hardware and software working together, traditional monitoring alone cannot surface the insights needed for optimizing performance or preventing downtime. At Whileone Techsoft Pvt. Ltd. , we help companies go beyond monitoring by building deep observability frameworks that connect performance benchmarking , system analytics , telemetry , and profiling . This integrated approach helps engineering teams gain complete visibility into their systems, enabling faster debugging, reduced operational costs, and enhanced end-user experiences. Why Observability Matters As infrastructures scale and workloads diversify, blind spots emerge. This can lead to: Observability addresses these challenges by providing end-to-end visibility into the state and behavior of your systems. This means you can detect issues earlier, understand their root causes, and fix them before they impact users. Observability vs. Traditional Monitoring Traditional monitoring answers the “what”, for example, CPU utilisation or error counts. Observability goes deeper and answers the “why” behind performance issues. It focuses on three core pillars: Metrics – Quantifiable measurements (e.g., latency, throughput) Logs – Detailed event records for context Traces – Understanding requests as they travel across distributed systems At Whileone Techsoft, we layer these pillars with analytics, telemetry, and profiling to deliver actionable insights. Performance Benchmarking: The Foundation Our Performance Benchmarking Services form the cornerstone of observability. We help companies: This data-driven approach uncovers bottlenecks early before they become costly in production. System Analytics for Deeper Understanding Benchmarking generates performance data, but analytics transforms that data into insights. System analytics helps teams understand: How workloads utilize CPU, memory, I/O, and network resources Correlation between resource consumption and performance outcomes Trends and anomalies in system behavior over time Our analytics frameworks leverage advanced models to identify optimization opportunities, ensuring your workloads perform consistently and reliably. Telemetry for Real-Time Visibility Telemetry extends observability by collecting live data from hardware, firmware, middleware, and applications. It captures fine-grained performance metrics continuously Enables proactive alerts for deviations from benchmarks Allows visualization of live system health through unified dashboards Whileone’s use of open standards like OpenTelemetry makes this telemetry layer scalable and interoperable with your existing tools. Profiling for Root Cause Analysis Even the best benchmarking and telemetry setups cannot replace profiling when you need detailed root cause analysis. System-level profiling : Identifies hotspots in the kernel, drivers, or hardware interfaces Code-level profiling : Finds inefficient functions, loops, or algorithms in the application stack By correlating profiling data with benchmark and telemetry insights, we help engineering teams quickly diagnose and resolve performance regressions. An Integrated Observability Framework At Whileone Techsoft, we integrate benchmarking, analytics, telemetry, and profiling into a single observability framework: Unified dashboards to correlate data across layers Automated workflows for continuous testing and monitoring Cross-silo visibility that spans hardware, system software, and applications This holistic approach ensures reliable, high-performance outcomes for pre-silicon validation, cloud workload optimization, and edge deployments. Benefits of Observability-Driven Benchmarking Real-World Example One of our semiconductor customers was struggling with inconsistent performance in their post-silicon validation phase. By deploying Whileone’s observability-driven benchmarking framework , they were able to: Pinpoint compiler-level inefficiencies using code profiling: https://www.whileone.in/post/tuning-compiler-flags-for-custom-hardware Correlate memory bandwidth metrics from telemetry data with workload performance: https://www.whileone.in/post/investigating-performance-discrepancy-in-hpl-test-on-arm64-machines Best Practices for Building Observability Observability is no longer a “nice-to-have”, it’s essential for ensuring reliable, high-performance systems. Whileone Techsoft Pvt. Ltd. brings together performance benchmarking, system analytics, telemetry, and profiling to build observability frameworks tailored for semiconductor companies, cloud providers, and software enterprises. Ready to take your performance engineering efforts to the next level? Reach out to us to learn how our observability-driven services can help you reduce costs, accelerate time-to-market, and achieve industry-leading performance.
- Understanding SPEC HPC Benchmarks: A Comprehensive Guide for Beginners
1. Introduction High-Performance Computing (HPC) is at the core of solving complex computational problems in scientific research, engineering, and large-scale data analysis. Benchmarking plays a critical role in evaluating and optimizing HPC system performance. The Standard Performance Evaluation Corporation (SPEC) provides widely recognized benchmarking suites tailored for different computing environments, helping researchers, businesses, and hardware vendors assess system capabilities. 2. What is SPEC HPC Benchmarking? SPEC HPC benchmarks are designed to measure the performance of high-performance computing systems under real-world workloads. Unlike general performance testing, SPEC HPC benchmarks focus on evaluating scalability, efficiency, and computational power across various hardware and software configurations. Key metrics include execution time, scalability efficiency, and energy consumption. 3. Why SPEC HPC Benchmarks Matter? Evaluating Scalability & Efficiency: SPEC benchmarks measure how well HPC systems scale with increasing workloads. Benchmarking Real-World Applications: Unlike synthetic benchmarks, SPEC HPC benchmarks reflect real-world HPC workloads used in scientific and industrial applications. Standardization & Comparability: They enable fair performance comparisons between different architectures, compilers, and system configurations. 4. Key SPEC HPC Benchmark Suites SPEC MPI: Measures parallel computing performance using MPI-based workloads. SPEC OMP: Evaluates OpenMP-based applications for multi-threaded workloads. SPEC ACCEL: Assesses performance on GPUs and other accelerators. SPEC CPU: Focuses on single-thread and multi-thread performance in computational workloads. 5. How SPEC HPC Benchmarks Work Benchmark execution process: Benchmarks are executed under controlled conditions to ensure reproducibility. Setting up the testing environment: Includes configuring system parameters, compilers, and libraries. Running SPEC benchmarks on various HPC hardware: Executing the benchmark suite on HPC hardware to collect performance data. Collecting and analyzing results CPU, GPU, and memory performance Compiler optimizations and software configurations Networking and storage bottlenecks Factors that impact benchmarking results: 6. Understanding Benchmark Results Interpreting SPEC Scores: Higher scores indicate better performance. Comparing Results: Performance ratios help compare different architectures and software configurations. Case Studies: SPEC benchmarks are widely used in industries like climate modeling, genomics, and engineering simulations to evaluate and improve HPC systems. 7. Best Practices for Running SPEC HPC Benchmarks Preparing an Optimized Benchmarking Environment: Ensure system settings and compiler options align with best practices. Choosing the Right SPEC Benchmark: Select the benchmark that aligns with the intended workload. Avoiding Common Mistakes: Properly setting up software and avoiding misinterpretations of results ensures accurate assessments. 8. Future Trends in SPEC HPC Benchmarking AI, ML, and Cloud Computing: Emerging workloads in artificial intelligence and machine learning are shaping future benchmarks. Heterogeneous Computing: SPEC is evolving to benchmark performance across GPUs, FPGAs, and new architectures like RISC-V. Upcoming Developments: Continuous updates in benchmarking methodologies are expected to keep pace with next-generation HPC innovations. 9. Conclusion SPEC HPC benchmarks provide a standardized way to evaluate and compare HPC system performance. Businesses, researchers, and hardware vendors can leverage these benchmarks to optimize their computing infrastructure. For further exploration, SPEC’s official website and research publications offer in-depth insights into benchmarking methodologies.
- YOLOX on RISC-V QEMU
Goal of this project: This project aims to determine RISC-V's readiness for running YOLOX for the latest edge requirements. Target Application: Running YOLOX on RISC-V QEMU involves setting up a RISC-V virtual machine and then configuring the necessary environment to compile and run YOLOX. Please note that this is a complex process, and it's essential to have prior experience with virtualization and RISC-V development. From the RISCV website, this is a blog ( https://riscv.org/blog/2023/07/yolox-for-object-detection/ ) which describes the steps to build and run YOLOX for a development board. These steps did not work as is when running on QEMU. This blog assumes the readers of this blog are comfortable with a Linux-based host system (this guide is based on Ubuntu 22.04). Step 1: Install QEMU and Set Up a RISC-V Virtual Machine First, you need to install QEMU and the RISC-V toolchain. You can do this by running: sudo apt-get install qemu-system-riscv In this step, you'll create a RISC-V virtual machine using QEMU. You'll need a RISC-V disk image for this. You can find pre-built RISC-V images for various Linux distributions online. You can also build your own RISC-V image if you prefer. wget https://cdimage.ubuntu.com/releases/22.04/release/ubuntu-22.04.3-preinstalled-server-riscv64+unmatched.img.xz tar xf ubuntu-22.04.3-preinstalled-server-riscv64+unmatched.img.xz #Rename the qemu_image mv ubuntu-22.04.3-preinstalled-server-riscv64+unmatched.img riscv-ubuntu2204.img qemu-img resize ubuntu-22.04.3-preinstalled-server-riscv64+unmatched.img +16G Launch the Qemu VM as follows: qemu-system-riscv64 -nographic -machine virt -m 16G -append "root=/dev/vda rw" -drive file=riscv-ubuntu2204.img,if=none,format=raw,id=hd0 -device virtio-blk-device,drive=hd0 -device virtio-net-device,netdev=net0 -netdev user,id=net0 This will boot the RISC-V VM with 16GB of RAM. Step 2: Configure the Python Environment Once the VM is up and running, log in, and set up your RISC-V development environment. You may need to install the necessary dependencies, which may vary depending on the distribution and the version. Most of the software packages that Python program software depends on can be installed by pip. You can run the following command to install pip. apt install python3-pip Before installing other Python packages, install the venv package that can be used to create a Python virtual environment. apt install python3.11-venv Create a Python virtual environment and activate it. cd /root python3 -m venv yolox source /root/yolox/bin/activate Step 3: Install necessary whl packages The Python ecology of the RISC-V architecture is still lacking. We have created build packages to be able to install directly on python3.11. Step 4: Build and Run YOLOX Next, clone the YOLOX repository into your RISC-V qemu git clone https://github.com/Megvii-BaseDetection/YOLOX Navigate to the YOLOX directory and build the YOLOX code. This step may involve installing additional dependencies and configuring the build for RISC-V architecture. cd YOLOX make With YOLOX successfully built, you can now run it on your RISC-V system. You'll need to adapt the YOLOX commands to work with your specific use case and input data. Standard models https://github.com/Megvii-BaseDetection/YOLOX#standard-models In this example, yolox_s is downloaded. wget wttps://github.com/Megvii-BaseDetection/YOLOX/releases/download/0.1.1rc0/yolox_s.pth -P /home/ubuntu/ python3 tools/demo.py image -n yolox-s -c /home/ubuntu/yolox_s.pth --path assets/demo.png --conf 0.25 --nms 0.45 --tsize 640 --save_result --device cpu #Output Logs 2023-09-15 17:05:49.803 | INFO | __main__:main:269 - Model Summary: Params: 8.97M, Gflops: 26.93 2023-09-15 17:05:49.860 | INFO | __main__:main:282 - loading checkpoint 2023-09-15 17:05:53.884 | INFO | __main__:main:286 - loaded checkpoint done. 2023-09-15 17:06:24.598 | INFO | __main__:inference:165 - Infer time: 30.0775s 2023-09-15 17:06:24.708 | INFO | __main__:image_demo:202 - Saving detection result in ./YOLOX_outputs/yolox_s/vis_res/2023_09_15_17_05_53/demo.png We would like to hear from you if this blog was useful to you. Please contact us at info@whileone.in. We would be happy to understand and discuss your requirements and showcase our expertise in a variety of cloud and edge technologies.
- Bring up Yocto for RISC-V deployment
We at Whileone Techsoft pvt ltd understood the requirements of our customer who wanted to have a basic Yocto based RiscV deployment for their custom SoC chip. The customer intended to share this basic deployment with their clients who wished to make use of our customer’s SoC in their products. Our customer was unaware of Yocto and what was needed to ensure a favorable deployment. They had their own custom patched Linux kernel, Root file system, Toolchain and custom Bootloader and a custom simulator as well to boot the final image. Their client insisted on Yocto instead of their default BuildRoot deployment. As Yocto has its own tools, compiler and dependencies, the challenge was to ensure the final Image generated by Yocto was compatible enough to be run by their custom simulator. Introduction to Yocto: With the Opensource Yocto Project, we can create custom Linux based systems for embedded products. It is quite possible to tailor the Linux images as per requirements with a set of flexible tools and friendly customizable scripts. Yocto provides a reference embedded distribution called ‘Poky’ that was used for this project The customer’s custom patched Linux Kernel was of a much smaller version than the current that was available in the kernel.org website. So, initially when we went with the latest Yocto version (Mickledore, v4.2) which featured GCC compiler version 12.x, we got errors during Kernel build. The errors pointed to some unknown assembly instructions. The reason was that our custom kernel version was old and it wasn’t updated. As the Customer was already using GCC 11.x in their build infrastructure, we did a search for a match of Yocto version that provided the nearest GCC 11.x and that was found to be Yocto Honister (v3.4). Initial test build was successful with Honister and so we finalized this version before moving ahead. Yocto uses Bitbake as its build tool. So, whenever we plan to create recipes in Yocto, we should create a separate folder inside poky that starts with “meta-” as per the Yocto manual. Also, referring to similar meta folders like meta, meta-yocto-bsp, meta-poky; we came up with our own “meta-riscv-custom”. To add a new meta layer, make use of bitbake commands, such as the one given below, $> bitbake-layers add-layer meta-riscv-custom Yocto Configuration options: As we were using the sample Poky distribution of Yocto and to let Poky know that we intend to use our custom “meta-riscv-custom” folder in the build process, we have to update a file “bblayers.conf” in the build/conf directory. This build/conf directory is generated after we initialize the environment by executing “source oe-init-build-env” in the Poky root folder. Also, we have to modify the variable “MACHINE” among others in the location “build/conf/local.conf ” to “qemuriscv64” and comment out the default value. There are other options in the file “local.conf” that we can modify to get image output in a desired format. The variable IMAGE_FSTYPES = “tar cpio” will generate the image in both tar and cpio formats. This is especially useful when we want to generate a root file system in this format. Creating recipes: Recipes are like script files that are created under the meta- folders. Files like “ riscv-linux.bb ” which is a recipe for building Linux kernel, “ riscv-boot.bb ” for building bootloader and so on. Custom changes: The customer was also interested to know how one could add a custom directory and files and make custom changes to existing files in the file system. Yocto has its own package group recipe file “ packagegroup-core-boot.bb ” that can be modified. For example, 1. We can disable UDEV by commenting it out 2. Similarly, we can also comment out HWCLOCK in the same file To create a custom folder “custom-riscv” inside root (“/”) and a file named “custom.conf” with some configuration options and comments, we had to modify a recipe file “ base-files_3.0.14.bb ”. Build: Yocto uses Bitbake as its build tool. To build, we make use of the following commands. $> bitbake -cclean riscv-linux $> bitbake riscv-linux The above command skips the extension “.bb”. Also, if we had not added the folder path of “meta-riscv-custom” to bblayers.conf, then we would get an error here after running the above command. Build artifacts: The artifacts are generated in the work directory under the path, Poky/build/tmp/work/riscv64-poky-linux/riscv-linux/1.0-r0/custom-linux/*
- GCP Cloud Performance: Time-Based Score Variations
In May 2022, one of our customers asked us to tune Elasticsearch with Esrally for cloud providers. We started with trying multiple combinations of manual runs on all cloud providers. We were collecting scaling runs with 2/4/8/16 cores. In the above data collection, we could not see the proportionate scores. Hence, we decided to experiment with running the Elasticsearch ESRally benchmark throughout the day. As Esrally doesn’t run for a particular duration, we carried out the runs 50 times so that it will span a whole day. And here is what we saw! Used configurations are: Altra - t2a-standard-16 Intel Icelake - n2d-standard-16 Milan - c2d-standard-16 Elasticsearch 8.4.1 Esrally 2.6.0 Server Altra . Intel Icelake. Milan . Client Altra Altra Altra Variation is observed according to time of day. AMD is the best where SD is lowest. But Intel and Altra show large standard deviations. NGINX-wrk benchmark also shows such behaviour on GCP. NGINX- wrk runs are carried out 1440 times keeping each run ‘s duration 60 seconds. Variation in p95 latency is observed through the time of day. Both Intel and Altra show 10% standard deviation in p95 latency numbers. Do consider Time-based Score Variations before running network applications: Time of day does affect latency since neighboring VMs might be busy or idle depending on the time of day. Run-to-run variation is a function of time of day. Eventually, we were able to help the customer figure out where the performance difference was coming from. To ensure a specific output through the day the scaling of VM has been suggested.
- Network compute agnostic Performance Analysis for Cloud workloads
At Whileone we take pride in customer's success. We help customers achieve goals and execute out of the box ideas that are necessary for success. One such project was to get IPCs for cloud applications on different architectures, completely omitting network stack. This would give the RISC-V chip designing customer a good picture whether architecture IPC ( Instructions per Cycle ) is inline with competition like Intel or ARM. To achieve this, we modified cloud applications to profile and benchmark the performance with no network or socket calls. The idea here was to see performance of different architectures with vanilla versions and lite ( modified ) versions. This would help the customer run these applications on their simulator. This would help them to get the IPC number of that architecture for that application and compare it with the competition. To give you an example, one of the applications we picked was Redis -a cached server application. Redis takes SET/GET requests from clients and processes those internally to keep cached copy for quick response. To get away with the network part, we simulated the client and to look like Redis has N SET/GET requests and processed those. Now the performance numbers we have are solely for that application processing on that architecture. This helped eliminate network noise and get a good picture of what the IPC is for core application processing. Table below shows the IPC Redis vs RedisLite. Drop in IPC can be attributed to networking sockets being removed. SET Redis Redis-Lite Graviton2 Intel 8275 Cascade Lake Graviton2 IPC 0.94 0.69 1.76 Icount / packet ~39000 ~30000 ~20200 In doing so, we made sure that we do not modify program logic and core behavior of the application in any way. We could see a similar call stack in case of Redis and Redis Lite. Below are the snapshots. REDIS Flamegraph REDIS-LITE Flamegraph As evident from the flame graphs- the call stack of the core application is not altered. In the Redis-lite flamegraph, the network component is absent. Redis is a single threaded application. We helped the customer port various multi-threaded / multi-process applications. The customer was able to cross-compile these application and run it on the it’s RISC-V simulator. This was an interesting experiment from the performance numbers point of view and useful for the customer in the early phase of chip development. This helped the customer to understand where they are placed with respect to the competition.












