Search Results | Whileone

Blog Posts (76)

Other Pages (23)

76 results found with an empty search

Beyond the Bill: Why Performance Benchmarking is the Secret to Sustainable Cloud Savings
Introduction In our previous post, How CloudNudge Can Help You Optimize and Manage Your Cloud Expenses , we discussed how visibility is the first step toward financial control. However, for software and hardware engineers, a low cloud bill is a hollow victory if it comes at the cost of system latency. Saving money is great. Saving money without breaking your application is Performance Engineering. The Performance-Cost Paradox The most common mistake in cloud optimization is "Blind Downsizing." This happens when a team sees an underutilized instance and immediately scales it down to a cheaper tier. The result? Unexpected bottlenecks during peak traffic and a degraded user experience. To achieve true efficiency, you must find the "Sweet Spot" where cost and performance intersect. How Whileone Techsoft Validates Your Savings While CloudNudge identifies where you are overspending, Whileone Techsoft’s benchmarking services tell you how low you can go without risking a crash. We bridge the gap through: Data-Driven Rightsizing: We use real-world stress tests to ensure that a smaller instance can actually handle your high-concurrency workloads. Code Optimization vs. Hardware Scaling: Sometimes the "cost" isn't the server; it's the code. Our benchmarking identifies "performance leaks," allowing you to fix the software rather than paying for more hardware. Sustainable Scaling: In line with the CloudFest 2026 theme of "The Sustainability of Everything," we believe the greenest cloud is the one that uses exactly what it needs nothing more, nothing less. Conclusion True cloud management isn't just about cutting costs; it’s about maximizing the ROI of every millisecond of compute time. By pairing CloudNudge’s visibility with Whileone’s performance validation, you aren't just saving money, you’re building a leaner, faster, and more sustainable infrastructure.
RISC-V: Accelerating Software Readiness for Numerical Computing
Introduction to RISC-V and Software Readiness As RISC-V expands into accelerator domains, software readiness becomes as critical as hardware innovation. This work focuses on implementing a set of mathematical and BLAS primitives for a custom RISC-V architecture. These primitives form foundational building blocks for numerical computing. The implementation includes vector and matrix operations, with careful attention to numerical correctness and floating-point behavior. Overcoming Challenges and Constraints A key challenge was the absence of physical hardware. Development was carried out using a remote x86-based system with RISC-V cross-compilation toolchains. Functional validation relied on an emulation-based execution framework that provided pass/fail results against reference outputs. Despite these constraints, the work enabled early software validation, ABI compliance checks, and confidence in algorithmic correctness. This demonstrates a software-first approach to accelerator enablement. Early investment in core math primitives significantly accelerates ecosystem readiness for emerging architectures like RISC-V. Categorizing the Math Core: BLAS Level 1 vs. Level 2 To ensure the custom RISC-V can handle diverse workloads, the implementation was divided into two fundamental tiers of the BLAS (Basic Linear Algebra Subprograms) hierarchy. Level 1: Vector-Vector Primitives Level 1 operations are the simplest building blocks. They perform O(n) operations on O(n) data. In a RISC-V architecture, these are often "bandwidth-bound." This means the speed of the operation is limited by how fast the hardware can pull data from memory rather than the raw speed of the floating-point units. Level 2: Matrix-Vector & Rank Updates Level 2 operations are significantly more complex. They perform O(n²) operations on O(n²) data. These routines are the backbone of most "Band Matrix" solvers. They are critical for engineering simulations where matrices have a specific structure, like symmetric or triangular. The Importance of Mathematical Primitives in RISC-V Mathematical primitives are essential for efficient numerical computing. They provide the necessary tools for performing complex calculations. With the rise of data-driven applications, the demand for efficient computing solutions is higher than ever. RISC-V architectures must be equipped with robust mathematical capabilities to meet these demands. Enhancing Performance with Optimized Algorithms Optimizing algorithms is crucial for maximizing performance. By refining mathematical operations, we can reduce computation time and resource usage. This is especially important in environments where speed and efficiency are paramount. Future Directions for RISC-V in Numerical Computing Looking ahead, RISC-V has the potential to revolutionize numerical computing. As more developers adopt this architecture, the ecosystem will continue to grow. Collaboration among researchers, engineers, and developers will drive innovation and improve software readiness. Conclusion: The Path Forward for RISC-V In conclusion, the journey towards software readiness for RISC-V is ongoing. By focusing on mathematical primitives and optimizing algorithms, we can pave the way for a more efficient future. The RISC-V architecture holds great promise for various applications, and continued investment in software development will be key to unlocking its full potential. For more information on RISC-V and its capabilities, visit RISC-V Foundation .
Chaos Engineering in the Production Stack
Chaos Engineering: Enhancing System Resilience Chaos engineering is the discipline of intentionally introducing controlled faults to validate system resilience. In any production ecosystem, spanning silicon validation, system integration, and software stacks, it helps uncover performance, reliability, and scalability risks long before production deployment. Understanding Kubernetes Pods Modern validation and benchmarking workloads increasingly run on Kubernetes. Pods, the smallest deployable units in Kubernetes, encapsulate application containers, runtime dependencies, and resource constraints. This makes them ideal fault-injection targets for testing real-world system behavior. The Role of Adversarial Agents Adversarial agents simulate failure conditions such as resource exhaustion, pod restarts, network latency, I/O throttling, or node instability. These agents operate with precision, mimicking realistic stress scenarios across compute, memory, and interconnect layers. By using these agents, organizations can better prepare for unexpected failures. Chaos Orchestrator: The Heart of Chaos Engineering A chaos orchestrator coordinates experiments, schedules adversarial actions, collects telemetry, and evaluates system responses across silicon, system, and software boundaries. This orchestration is crucial for effective chaos engineering. Architectural Overview: The Autonomous Feedback Loop Block Diagram: Setup for introducing Chaos The implemented architecture establishes a closed-loop system where failure injection is not a static script but a dynamic response to the system's current state. At the core of this setup is the Chaos Orchestrator , which functions as the decision-making "brain" by interacting with two critical APIs: The Monitoring Provider The Chaos Orchestrator The flow begins with the agent ingesting telemetry to define a "state," a snapshot of the environment's health, including latency percentiles and error rates. Based on this state, the agent's internal Reinforcement Learning model selects an adversarial action designed to maximize system stress. This action is then translated into a Kubernetes Custom Resource, which the Chaos Mesh controller executes against the target microservices. This effectively bridges the gap between abstract AI logic and physical infrastructure manipulation. Benefits of Chaos Engineering Chaos engineering offers several benefits that can enhance system resilience: Proactive Identification of Weaknesses By simulating real-world failures, organizations can identify potential weaknesses in their systems before they lead to significant issues. Improved System Reliability Regular chaos testing helps ensure that systems can handle unexpected failures, leading to improved reliability in production environments. Enhanced Team Collaboration Chaos engineering fosters a culture of collaboration among development, operations, and quality assurance teams. This shared responsibility enhances overall system health. Data-Driven Decision Making The insights gained from chaos experiments enable data-driven decisions regarding system architecture and design. Conclusion: The Future of Resilience Testing This closed-loop chaos engineering architecture transforms resilience testing from a predefined exercise into an adaptive, intelligence-driven process. By continuously observing system behavior, learning from real-time telemetry, and dynamically selecting adversarial actions, the Chaos Orchestrator ensures that stress scenarios remain both realistic and impactful. The tight integration between monitoring, decision-making, and fault execution enables deeper visibility into failure modes that span silicon, system, and software layers. As a result, organizations can move beyond reactive validation toward proactive robustness. This approach identifies performance bottlenecks, reliability risks, and recovery gaps early in the lifecycle. Ultimately, this lays the foundation for building production-grade platforms and cloud-native systems that are not only functional under ideal conditions but resilient under real-world uncertainty. Embracing chaos engineering is essential for any organization aiming to thrive in today's complex digital landscape.
Boost Software Efficiency with Software Performance Optimization
Software efficiency is more critical than ever. Users expect applications to be fast, reliable, and scalable. Achieving this requires more than just writing clean code; it demands a strategic approach known as software performance optimization. This process ensures that software not only meets functional requirements but also performs optimally under various conditions. By focusing on performance engineering, businesses can deliver superior user experiences, reduce operational costs, and stay competitive. Understanding Software Performance Optimization Software performance optimization involves analyzing and improving the speed, responsiveness, and stability of software applications. It covers a wide range of activities, from identifying bottlenecks in code to optimizing system architecture and infrastructure. The goal is to ensure that software runs efficiently, even under heavy loads or complex operations. Key aspects of software performance optimization include: By focusing on these areas, developers and engineers can create software that not only meets but exceeds user expectations. The Role of Performance Engineering in Software Optimization Performance engineering is a proactive approach that integrates performance considerations throughout the software development lifecycle. Unlike traditional testing, which often focuses on functionality, performance engineering emphasizes early detection and resolution of performance issues. This approach includes: Performance Modeling: Predicting how software will behave under different conditions. Load Testing: Simulating real-world usage to identify potential bottlenecks. Profiling and Monitoring: Continuously tracking software performance to detect anomalies. Optimization Techniques: Applying code refactoring, caching, and database tuning to enhance efficiency. One of the significant benefits of performance engineering is its ability to reduce costly post-release fixes. By addressing performance early, teams can avoid delays and improve overall software quality. For organizations looking to enhance their software’s efficiency, partnering with performance engineering services can provide expert guidance and tailored solutions. Practical Strategies to Boost Software Efficiency Implementing software performance optimization requires a combination of best practices and tools. Here are some actionable strategies to consider: Early Performance Testing: Integrate performance tests into the development process from the start to catch issues early. Optimize Algorithms: Use efficient algorithms and data structures to reduce processing time. Implement Caching: Store frequently accessed data temporarily to reduce database hits. Minimize Network Calls: Reduce the number of requests between client and server to lower latency. Use Asynchronous Processing: Allow non-critical tasks to run in the background without blocking user interactions. Monitor and Analyze Logs: Regularly review logs to identify and troubleshoot performance problems. Automate Performance Testing: Use automated tools to run tests consistently and quickly. By applying these strategies, teams can significantly improve software responsiveness and user satisfaction. Enhancing Software Efficiency for Long-Term Success Optimizing software performance is not a one-time task but an ongoing commitment. By adopting a performance engineering mindset and leveraging specialized services, businesses can ensure their software remains fast, reliable, and scalable. This leads to happier users, lower operational costs, and a stronger competitive edge. Investing in software performance optimization today sets the foundation for future growth and innovation. Whether you are developing new applications or maintaining existing ones, prioritizing performance will pay dividends in the long run. Frequently Asked Questions (FAQs) What is the difference between performance testing and performance engineering? Performance testing focuses on evaluating how a system behaves under specific conditions, usually at later stages of development. Performance engineering, on the other hand, is a proactive approach that integrates performance considerations throughout the entire software development lifecycle to prevent issues before they occur. When should performance engineering be implemented in the SDLC? Performance engineering should begin in the early design and architecture phase of the SDLC. Integrating performance modeling, early load testing, and continuous monitoring from the start helps reduce costly post-release fixes and ensures scalability. How does performance engineering improve software scalability? Performance engineering improves scalability by identifying bottlenecks early, optimizing algorithms, reducing latency, implementing caching strategies, and continuously monitoring system behavior. This ensures applications can handle increasing workloads efficiently without compromising speed or stability.
Stop Starting, Start Resuming: Quickly starting dockers
Cold starting docker containers is expensive. Before a Dockerized application does anything useful, it pulls images, initializes the runtime, loads classes or modules, allocates memory, opens files and sockets, and slowly warms into a steady operating state. In modern infrastructure, this cost shows up everywhere: pod restarts, scale-outs, rollouts, autoscaling events. Each time, the same warm-up work is paid for again. Capture and restore offers a different idea: instead of starting containers from scratch, we can resume them. When a container’s state is frozen, the entire running container is captured at a specific moment. The processes inside it, their memory, threads, and execution state are frozen and written to disk. Restoring the container brings it back exactly where it left off. The container does not rerun initialization code or repeat warm-up logic. Execution continues from the point where it was paused. From the applications’ perspective, this feels less like a restart and more like waking up after a brief pause. The value for Docker environments is straightforward. Containers are often most expensive right after they start. If a container is already warm, stable, and ready to serve traffic, why throw that away? By saving a warm image at the right time, infrastructure pays the cost of warm-up once and reuses it many times. New containers can appear already prepared to handle work. Some parts of execution still live outside the container boundary. CPU caches and branch predictors are rebuilt naturally after restore. Scheduling history is lost. Time moves forward while the container is paused, even if the process itself did not experience that passage. Network connections may persist, but the systems on the other end were never frozen. These limits are inherent, but they rarely matter for fast startup and readiness. This doesn’t weaken the restore model; it just defines the limits. In practice, those limits are rarely a problem. Most systems care far more about preserving progress and avoiding repeated work than about recreating a perfectly identical moment in time. State preservation isn’t about recreating the universe; it’s about resuming useful work with minimal friction. And it raises an interesting question for the future: What if containers did not just start quickly, but started already warm? We’re not there yet. But even today, resuming from captures lets systems bend time in useful ways by choosing when warm-up work happens and when it doesn’t have to happen again.
The Evolution of Software Performance with Agentic AI
From Tuner to Governor This shift to agentic benchmarking means a developer is evolving from a tuner of loops into a governor of constraints . Interfacing with these agents is not just using a sophisticated profiler; it is collaborating with entities that can autonomously navigate a running system, identify deep-seated concurrency flaws, and propose architectural optimizations that would take a human team weeks to uncover. The focus can now remain on the "why" (SLAs and User Experience) and the "what" (Scalability targets), while the agents handle the "how" (caching strategies, database indexing, and memory management). It is a liberating experience that allows for optimizing at the speed of thought, turning a laptop into a simulation lab capable of stress-testing world-class software in a fraction of the time. The Weight of Autonomous Optimization Yet, this sudden surge in efficiency brings a sobering realization about the nature of the craft. When agents can autonomously refactor code to squeeze out every microsecond, the potential for unforeseen regressions scales just as fast as the throughput. Always remember, with great power comes great responsibility. In the age of AI, the definition of the benchmark is the definition of the product's destiny. I. Performance vs. Integrity These SLOs prevent the agent from sacrificing correctness for speed. Latency with Freshness Constraints: - The Rule: Do not just set a target of "200ms response time." - The AI-Proof SLO: "99% of requests must be served within 200ms, provided the data served is no older than 5 seconds ." - Why: This prevents the agent from implementing aggressive, stale caching strategies just to hit the speed target. Throughput with Error Budget Coupling: - The Rule: Maintain 10,000 RPS (Requests Per Second). - The AI-Proof SLO: "Maintain 10,000 RPS with a non-retriable error rate of $< 0.1\%$." - Why: An agent might drop complex requests to keep the request counter moving fast. Coupling throughput with error rates forces it to process the hard tasks, too. Functional Correctness Tests: - The Rule: The API returns a 200 OK status. - The AI-Proof SLO: "99.9% of responses must pass a checksum or schema validation test." - Why: Agents optimizing code might accidentally simplify logic that results in empty but "successful" (200 OK) responses. II. Speed vs. Cost AI agents often treat computing resources as infinite unless told otherwise. III. Tail Latency AI optimization often targets the average (P50) to look good on charts, ignoring the unhappy users at the outliers. The P99.9 Variance Limit: - The SLO: "The gap between P50 (median) and P99 latency must not exceed 3x." - Why: This forces the agent to optimize the entire codebase, including edge cases, rather than just the "happy path" code. Cold Start Constraints: - The SLO: "First-byte latency after >5 minutes of inactivity must be $< 500ms$." - Why: Prevents the agent from optimizing run-time performance while ignoring startup/initialization heavy lifting. IV. The Deployment Safety Net When agents write and deploy code autonomously, the rollback strategy is your last line of defense. Regression Tolerance: Summary Table: The Human vs. The AI-Proof Approach We are building such tools to make the life of our Customers much easier but at the same time deploying faster.
From Innovation to Impact: Aligning ER&D with Marketing and Sales
Engineering R&D in a Changing Landscape Engineering Research and Development has always been at the heart of innovation. But today, its role is evolving rapidly. What was once primarily about pushing technical boundaries is now equally about speed, efficiency, and alignment with business outcomes. As industries grow more complex and interconnected, Engineering R&D teams are being asked to deliver faster, smarter, and with fewer margins for error. From a marketing and sales point of view, this evolution changes how Engineering R&D capabilities must be understood, positioned, and communicated to the market. From Technical Excellence to Market Expectations While closely tracking industry developments and Engineering R&D narratives, a few consistent patterns stand out. First, customization is increasingly replacing generalization. The rise of AI and edge computing has made it clear that one-size-fits-all systems are no longer sufficient. Workloads are becoming more domain-specific, driving the need for tailored accelerators, customized architectures, and memory hierarchies designed with specific use cases in mind. This shift has implications far beyond silicon—it affects system design, validation strategies, and long-term software support. Second, while hardware capability continues to advance, software readiness often determines whether that capability translates into real-world success. Performance alone is no longer the finish line. Debuggability, observability, and the ability to tune and validate software efficiently play a critical role in how usable and scalable a system actually is. In many cases, limitations are not discovered at design time, but much later, when systems are already expected to perform in production environments. Observing these gaps has shaped how I think about Engineering R&D as a whole. Challenges rarely exist at just one layer of the stack. They span from silicon to systems to software, and addressing them in isolation often leads to short-term fixes rather than sustainable solutions. This is where a cross-layer understanding becomes important—not just to identify performance issues or validation gaps, but to understand how decisions at one level ripple across the entire system. In my experience, many players in the ecosystem tend to focus on specific parts of this chain. Some excel at silicon, others at systems, and others at software. What stood out to me was the importance of approaching these challenges with a connected mindset—one that recognizes how benchmarking, bare-metal testing, software porting, and validation inform each other rather than operate independently. Market Research Beyond Strategy But market research doesn’t stop at identifying technical gaps. One of the biggest learnings for me has been realizing how much research influences communication, not just strategy. When communicating with the market, context matters. Messaging that resonates with an engineer often differs significantly from what resonates with a business leader. Over time, I’ve learned that effective communication depends on understanding who you’re speaking to, what problems matter most to them, and how they frame success. Research helps guide not only what we say, but how we say it whether through email, direct conversations, or broader content. Turning Insight into Impact Knowing which channels to use, the language that makes concepts accessible, and how to connect technical capabilities to real pain points is just as critical as understanding the technology itself. In many ways, communication becomes the final bridge between insight and impact. As Engineering R&D continues to evolve, market research serves as a compass not just for identifying where the industry is headed, but for shaping how we think, build, and communicate along the way.
How to Leverage Figma Like a Pro for Modern UI/UX Design
In today’s fast-paced product landscape, design teams need tools that are fast, collaborative, and adaptable. Figma has become the go-to platform for designers because it does exactly that—and more. As someone who works closely with design, I’ve seen firsthand how using Figma the right way can transform the entire workflow, from ideation to delivery. Here’s how you can take full advantage of Figma and elevate your design process. 1. Start With Clear Structure: Pages, Frames & Naming Design becomes chaotic without structure. Figma makes it easy to organize your work if you: Break your project into separate pages (Wireframes, UI Screens, Components, Prototypes, etc.) Use meaningful frame names like Home – Logged In , Dashboard – Empty State , etc. Follow consistent naming conventions for layers. A tidy file keeps teams aligned and improves handoff for developers. 2. Build a Strong Design System Early Figma truly shines when you use components, variants, color styles, and typography styles . A solid design system helps you: Maintain UI consistency Reduce repetitive work Update global changes instantly Speed up onboarding for new designers Whether it's a simple style guide or a full-blown atomic system, Figma makes it scalable. 3. Collaborate in Real Time One of Figma’s biggest advantages is live collaboration. You can: Brainstorm together using FigJam Review designs with stakeholders in real time Leave comments directly on components Avoid endless file versions like Final_v4_revised2.fig This drastically reduces communication gaps and accelerates decision-making. 4. Use Auto-Layout for Responsive, Smart Design Auto-Layout is a game changer. It allows you to create designs that automatically adapt to content changes. Buttons resize, cards expand, and layouts adjust intelligently. This saves hours of manual adjustment and makes your designs feel more like real UI behavior. 5. Create Interactive Prototypes Easily With Figma’s prototyping tools, you can showcase: Realistic user flows Micro-interactions Transitions and animations Mobile, tablet, and desktop behaviors These prototypes help clients and developers understand functionality long before development starts. 6. Take Advantage of Plugins Figma has an enormous plugin ecosystem that can speed up your work: Iconify – access thousands of icons Content Reel – generate text, avatars, and data Autoflow – create quick user flows Mockup plugins – place screens inside device frames Plugins reduce manual effort and improve efficiency instantly. 7. Use Version Control and Components Library Figma’s built-in version history lets you: Track changes Revert mistakes Maintain clean progress Shared libraries help teams collaborate across multiple files without duplicating components. 8. Simplify Handoff With Inspect Mode Developers can easily view: Exact CSS properties Spacing and sizing Assets ready for export Design tokens Figma removes the typical “design-to-development gap” and ensures accurate implementation. Final Thoughts Figma isn’t just a design tool, it’s a complete ecosystem for UI/UX design, collaboration, prototyping, and delivery . When used strategically, it upgrades the entire workflow and helps teams deliver polished, user-centered products faster. If you want to design smarter, not harder, Figma is the tool to privilege.
QEMU vs. FPGA: Understanding the Differences in Emulating and Prototyping Any ISA
With the evolution of hardware design and development, two tools have become fundamental for those working on Instruction Set Architectures (ISA) QEMU and FPGA boards. Although both serve as key resources for developing, testing, and experimenting with different ISAs (such as RISC-V, ARM, x86, etc.), they operate in significantly different ways. This blog highlights the key distinctions between QEMU and FPGA boards and their use cases across various architectures. Key Features of QEMU Across Architectures: Ease of Use: QEMU can be installed on standard systems (PCs or servers), enabling developers to work with different ISAs without needing specific hardware. Cost-Effective: As a free, open-source tool, QEMU provides a cost-effective solution for developers to emulate a wide range of ISAs. Software Emulation: QEMU simulates the target architecture’s instruction set, allowing developers to test code configurations and features of multiple ISAs without hardware limitations. What are FPGA Boards? FPGA (Field-Programmable Gate Arrays) boards are hardware devices designed to prototype and implement specific ISA designs at the hardware level. Unlike software emulation, FPGAs provide real-world testing platforms where developers can configure the architecture and observe its behavior in real-time. Key Features of FPGA Boards for Any Architecture: Hardware Prototyping: FPGAs allow the implementation of ISA-specific designs (e.g., RISC-V, ARM), providing accurate insights into the performance and real-time behavior of the hardware. Customization: FPGAs offer highly customizable environments where users can configure the hardware to match their specific ISA requirements and experiment with different core designs. Real-Time Processing: Since FPGAs execute instructions at the hardware level, they deliver real-time processing capabilities. This makes them ideal for applications that require low-latency response and performance tuning. Scalability: FPGA boards can scale to support various ISA implementations, ranging from simple cores to complex multi-core architectures. Speed and Runtime Limitations of FPGA Boards While FPGA boards support high clock frequencies (up to 300-400 MHz or more in certain designs), real-world performance is often constrained by factors like routing complexity, timing constraints, and resource usage. Achieving clock speeds consistently above 100 MHz can be challenging for complex designs. Hardware engineers often employ iterative cycles of compiling, testing, and optimizing clock speeds to reach desired performance levels. Additionally, runtime limitations on FPGA boards include constraints like memory bandwidth and resource bottlenecks, which can affect performance. Strategies such as pipelining, partitioning, and efficient resource management are often necessary to optimize designs for different ISAs. Source: Link Use Cases and Applications QEMU is best suited for software engineers who need to test applications and firmware targeting different ISAs in a virtualized environment. Whether it's any other architecture, QEMU provides a safe, cost-free platform for debugging and simulation. It is ideal for early-stage development where physical hardware is not necessary. FPGA Boards, on the other hand, are invaluable for hardware engineers and researchers who need to prototype and verify ISA designs in real-world conditions. For example, suppose you are developing a custom RISC-V core or tuning an ARM design for a specific use case. In that case, FPGA boards allow you to test performance, latency, and resource utilization in a physical setting. The insights gained here are crucial for final hardware implementation. Comparing QEMU and FPGA Boards Both QEMU and FPGA boards provide critical support for ISA development, but they serve different purposes. The choice between the two depends on whether you are focused on software or hardware development. Aspect QEMU FPGA Boards Nature Software-based emulation Hardware-based prototyping Cost Free and open-source Requires investment in FPGA hardware Setup Time Quick setup on a standard PC Requires hardware setup and configuration Performance Limited by host system capabilities Real-time performance based on hardware design Flexibility Flexible software environment Hardware customization based on project needs Network Capabilities Full network support and integration Historically limited, with newer boards supporting it Use Cases Software testing, debugging, simulation Hardware prototyping, real-world performance analysis
Key Practices for Effective Full-stack Web Development
Developer experience The Full Stack Development team spends a significant time writing code. A good developer experience implies a grossly improved developer productivity. Some ways to improve the DX, thereby improving the quality of life and hence the productivity include: Setting up eslint/tslint/prettier so that the IDE can take care of the mundane tasks like enforcing code formatting, highlighting possible code quality issues, enabling early bug detection. Integrating eslint/tslint/prettier within CI/CD Using tools like Storybook with Chromatic enabling UI/UX reviews early on Basic documentation which will enable an easy onboarding process - viz. Instructions on setting up projects for development on local environment, database structure, API documentation, mock data/API sandboxes. Depending on the project, if appropriate, adopt a test driven development. Dependencies & Code maintainability Choosing dependencies is crucial. Using libraries that are no longer maintained might limit upgrading core frameworks in the near future and might require forking repos to update those libraries and then update the core frameworks. Another issue with obsolete libraries is the security risks. From a productivity perspective, some libraries do not actively host documentation, and new developers find it difficult to use those libraries by referring to code snippets without the documentation. Furthermore, having libraries without type safety make it extremely difficult for new developers to use those libraries without documentation. As such, as far as possible, either long term stable libraries should be used. If only small fractions of libraries are required, possibly create in-house libraries. Long term maintainability with a large number of external dependencies can turn into a maintenance nightmare. Team culture Having a positive culture that facilitates open communication, respects individual differences, appreciates contributions, makes people feel valued, has a large impact on how products are built. Code reviews should be what developers look up to, in terms of learning new aspects, instead of dreadful processes. Design system For frontends, having a design system, or components that could easily be abstracted to a library can have a large impact on costs in the long term. When a product pivots, or when companies launch new products, the component libraries can be easily be reused across products. As such, having a design system that facilitates themes can greatly reduce the repetitive efforts of styling and maintaining component libraries. Shipping often Shipping often and early enables feedback earlyon. In this agile-world, regular shipping enables stakeholder participation, identifies issues early, and makes sure that the developers are aligned with the product goals. If releasing to production isn’t an option, having multiple environments can enable internal stakeholder feedback.
Revamping Full-stack Projects: Strategies for Long-term Success
In the agile world, with dynamic requirements, large projects might need a revamp for numerous reasons. The need for a revamp might arise to improve performance and thereby user experience, to upgrade some libraries that have security vulnerabilities, or to remove certain libraries which are no longer supported and port to new/better libraries. Rebranding might require revamping as well. As products scale, cost optimization might need revamping. Revamps are also frequently driven by a need to merge multiple applications into a single app, usually internal-facing apps. With multiple such reasons overlapping, my team and I had to revamp some projects in the last few years. Here are some of our learnings with the challenges we faced. Moving to a design system Business temptations for shipping optics and associated operational hazards Content architecture vs design changes vs simultaneous content and design changes Dependencies that broke us Developer experience Moving to a design system One of our projects involved an admin panel for a product suite of 5 products. The product suite is a SaaS offering. The admin panel for end users had StyledComponents. Moving to a design system - standardising all components meant great time savings for the long term. However, that implied porting an existing code base of 25+ man-years. We found a sweet spot where we migrated component by component to the design system. Very small PRs where every PR involved changes from only 1 page for only 1 type of component made code review, testing and UX review easy. The new design system had a theme that styled the new components similar to the old implementation. The idea is to switch the theme after all components are migrated. That way, the new design release is a large change visually, but a small code diff on the day theme is flipped. If such small code is not shipped often, the branches become stale, and in a large team that ships new features often, rebasing becomes a large task in itself. Hence the need to ship as often as possible. Business temptations for shipping optics and associated operational hazards Business teams demand large releases that are good optics for marketing. However, those come at a risk. The users using the app face challenges when there are large changes. Large shipping could cause an increase in support tickets if there’s any confusion during app usage, specially from the internal stakeholders. When there are content architecture changes, during release, a different set of fields is required for the old vs new versions. The new fields could be derived from the old ones, hence the need to maintain the old fields for a while. Such old fields become a tech debt and have to be removed later. This brings to another aspect - whether the changes are graphic design changes or content architecture changes or both. For CMS-driven large static websites we developed, for changes involving such old redundant fields, and a large number of new fields, having Storybook updated enabled the content managers to check the old and new components on Storybook. We’ve been using Chromatic for the past few years, and it has helped us enable content managers to quickly experiment with components. Content architecture vs design changes vs simultaneous content and design changes Often, designers creatively add more fields to the UI while redesigning. This mixes two tasks up - graphic design change and content architecture. Separating graphic design changes and content architecture changes helps. Content architecture changes need to be shipped extremely small. Whereas large graphic design changes could be made in large releases. Dependencies that broke us NextJS shipped a stable app directory/app routing. We used a CSS-in-JS library ChakraUI. This library does not support app directory and only works with client-side rendering if used in app directory. If we would have used TailwindCSS, this would not have happened. However, using TailwindCSS, and building components with some other library which would facilitate accessibility would have taken more time (Radix is one way to use primitives and use TailwindCSS). The trade-off of using ChakraUI paid off in the first few years of the project, but now is a blocker for upgrading NextJS and using new features NextJS offers. Another roadblock is that the ChakraUI plugin for Storybook does not support newer Storybook versions. Essentially, we’ll need a complete re-write to exit ChakraUI at some point. In one of our projects, we used a navigation library - navi. This library stopped supporting ReactJS v17 onwards. We had to fork the library and upgrade it since it is no longer supported. While the dev team might want to take some risks and use libraries, when such libraries aren’t supported in newer versions, replacing such core features require large refactoring. Developer experience When setting up a project, prettier and eslint need to be the priority. Missing these out, and enforcing rules later is expensive. Having these set during the project setup makes it easier to onboard new developers. Instead of maintaining documentation or addressing formatting rules during code reviews or memorising rules without documentation, setting up prettier and eslint enforce rules on the codebase, using IDE/IDE plugin features. This saves time and reduces developer fatigue as well as irritation. While some teams consider writing types/interfaces in TypeScript an unnecessary overhead, when building large projects, knowing types really helps in knowing what might break during development instead of runtime. The learning curve associated with learning TypeScript isn’t much, and GitHub Copilot has made things easier as well.
Server Performance Prediction using ML Models - Part 1
This blog is the first part of the series in "Server Performance Prediction using Machine Learning Models" OVERVIEW: In the semiconductor industry, a Silicon Cycle takes nearly a year or more from conception to the silicon being available. As soon as the concept comes in, there is immense pressure on the marketing and sales teams to come up with performance numbers for this new generation of the silicon i.e. processor. There is a need for a way to find out what the score could be on this new generation. Since the new processor is not physically available, companies run several benchmarks on simulators/emulators for getting the performance scores. This process has two major drawbacks: Running a benchmark on a simulator takes hours more than running it on the physical processor Such simulators are very expensive and a limited number of them are available There should be some way of projecting new scores based on the scores of the older generation of processors. In this project, we aim to predict performance for a benchmark on newer generation processors by training Machine Learning Models on older generation data. METHODOLOGY: Performance Parameters: We identified two performance parameters, namely IPC (Instructions per cycle) and Performance Runtime. The Runtime is actually derived from the Instructions per cycle, assuming a particular clock speed. In the current iteration of the research, we will limit our work to IPC and Runtime prediction. Data Gathering and pre-processing: We gathered our data on two generations of the Graviton processors, namely Graviton 2 and Graviton 3. We captured data for benchmark suites named SPECInt and SPECFp designed by the Standard Performance Evaluation Corporation. For each benchmark in these SPEC benchmark suites, we have captured the following counters at every 100 millisecond intervals: Sr. No: Counter Name 1. L1I_TLB_REFILL 2. L1D_TLB_REFILL 3. L3D_CACHE_ALLOCATE 4. L3D_CACHE_REFILL 5. L3D_CACHE 6. l1i_cache 7. l1i_cache_refill 8. l1d_cache 9. l1d_cache_refill 10. l2d_cache 11. l2d_cache_refill 12. br_mis_pred 13. br_pred 14. mem_access 15. stall_backend 16. stall_frontend 17. ASE_SPEC 18. VFP_SPEC 19. L1I_CACHE_REFILL 20. L2D_CACHE_REFILL 21. INST_SPEC 22. BR_RETIRED 23. BR_MIS_PRED_RETIRED 24. branch-loads 25. MEM_ACCESS_RD Why SPEC? The reason that we captured data on the SPEC benchmark suite is because these benchmarks are designed to capture most of the areas of the CPU. For example, a benchmark like gcc stresses the memory, whereas, another benchmark like x264 stresses on vectorization capabilities of the CPU. Let us call a set of counters collectively as “The Snapshot” of the System. On the other hand, we also capture instructions and cycles for every 100 milliseconds. During post processing, we calculate the Instructions-Per-Cycle (IPC) for every snapshot. For our machine learning model discussed further, we need a set of Input variables and an output variable. The Snapshot of the System serves as the input set of variables (X). The output variable is the Ratio of the IPCs of the newer generation of CPUs to the older generation. Let us have a look at how this training data is generated. Post Processing the counters: For a benchmark, the captured data is available in 7 CSV files. Each file has a few counters (for eg. L1D_TLB_REFILL) captured at every 100 milliseconds intervals over the duration of the benchmark. There are 7 files because only a limited number of counters can be captured during a given benchmark run. Hence, we run the benchmark several times in order to capture all the necessary counters.Inside a counter file, there are many columns, out of which only 3 are important to us for post processing. These are time, count, and type. We filter the dataframe that is read from the csv file for only these 3 columns. For example, a row in the filtered dataframe would look like - 0.300514373,1778078,L1D_TLB_REFILL. The row entry means that the counter L1D_TLB_REFILL is captured at 300 milliseconds from the start of the test, for which the count is 1778078. Once we have the counter files for a benchmark, we apply post processing to the data, where we calculate the Per-Kilo-Instruction (PKI) metric for each counter. For example, we calculate L1D_TLB_REFILL_PKI for the L1D_TLB_REFILL counter for every capture of the L1D_TLB_REFILL counter. So for example, if the corresponding instructions capture for the above example is like this - 0.300514373,618104386,instructions, it means that the instructions count was 618104386 at 300 milliseconds. To calculate the PKI count for each counter, we use the following formula: PKI_Count = (Counter/instructions)*1000 The formula gives us the Per-Kilo-Instructions value of a particular counter. For example, for the 300 milliseconds capture example above, the PKI Count would be calculated as follows: L1D_TLB_REFILL_PKI = L1D_TLB_REFILL/instructions*1000 = 1778078/618104386*1000 = 2.876662972 Hence the L1D_TLB_REFILL_PKI for this particular capture at 300 milliseconds is equal to 2.876662972. Additionally, an Instructions Per Cycle (IPC) count is calculated for each capture of the counters. The formula for IPC simply divides the number of Instructions by the number of cycles. IPC = instructions/cycles It tells you how many instructions were executed for a given clock cycle of the CPU. We consider it a measure of the performance of the CPU. Likewise, the PKI count calculation (same as shown above) is done for all the 25 counters captured. As a result, we have a post processed dataframe which has the following columns: time, IPC, and 25 PKI counters. This is the 2-dimensional dataframe for a benchmark. The post processing is repeated for all the benchmarks, thus creating 2-dimensional dataframes for each benchmark. These dataframes are concatenated one after another and are indexed by the benchmark name, which is our 3rd dimension. Hence, we stack all the 2D dataframes into a 3D dataframe indexed by the benchmark name. Plotting a comparative IPC plot When a benchmark runs on a newer generation processor, it is generally expected that it would take less time to run than on the older generation processor. For example, consider a benchmark named ‘500.perlbench_r_checkspam_0’ . It takes less time to run on Graviton 3 as compared to Graviton 2. Below is a plot of the calculated IPC for the entire duration of the ‘500.perlbench_r_checkspam_0’ benchmark. As seen above, the benchmark on G3 runs in lesser time compared to the same benchmark on G2. The IPC is higher as well. Dynamic Time Warping, Generating Training Data Let us consider an example while taking Graviton 2 (G2) as the older generation and Graviton 3 (G3) as the newer generation used while training the machine learning model. Now for each benchmark, we calculate a new column called the “IPC Ratio”. This is the ratio of the IPC values of each row of G3 dataframe mapped to a row of G2 dataframe for that benchmark. We perform this mapping using Dynamic Time Warping. Why IPC Ratio? For a given snapshot of the system (X), the IPC ratio of G3 to G2 is the same irrespective of the benchmark being run. This is because the architecture of the systems are designed in such a way that if the “state” of the older system is the same at a point during any two benchmark runs, then the IPC Ratios of the new generation to the older generation are equal for both the benchmarks. This is true irrespective of the benchmark being run. Note that only the ratio is the same and not the raw IPC numbers. Hence, we choose IPC Ratio as the output (Y) parameter Dynamic Time Warping Since we need to find the ratio of the IPCs at each capture of the snapshot, there should be an intelligent way to map the benchmark run instances. If we observe the comparative IPC plot above, we notice that the benchmark runs in lesser time on G3 than on G2. This means that more instructions are executed in the same amount of time on G3 than on G2. We calculated the cumulative sum of the number of instructions executed for each row, and store it in a column named “instructions_cumulative_sum”. The graph of the cumulative instructions for G3 and G2 can be seen below (The numbers on the Y-Axis are in the order of 10^12). The example belongs to the benchmark named '502.gcc_r_gcc-pp_3' We map the captured instances by matching the instructions_cumulative_sum column. This is because taking the IPC ratio makes sense only when we are taking the ratio of rows at nearly the same time during the benchmark run. Dynamic time warping is a method that calculates an optimal match between two given sequences with certain rules. For example, First value of sequence 1 is mapped to first value of sequence 2 Last value of sequence 1 mapped to last value of sequence 2 Intermediate mappings have monotonically increasing indices Below is an example of the mapping path generated using the Dynamic Time Warping algorithm We have time on the X-axis, and instructions_cumulative_sum on the Y-axis. The line above represents G2 and the line below represents G3. The orange part in the graph is a lot of lines that represent mapping from one row of G2 to a row of G3 based on the cumulative instructions count. Below are a few initial actual mappings for the above example using the DTW algorithm ((row_g2, row_g3), cumulative_sum_g2, cumulative_sum_g3) (0, 0), 460190954.0, 672156965.0),((1, 0), 1106484590.0, 672156965.0),((2, 1), 1724588976.0, 1657376501.0),((3, 2), 2249696930.0, 2344648605.0),((4, 3), 2780771235.0, 3021707178.0),((5, 3), 3313773511.0, 3021707178.0),((6, 4), 3797070083.0, 3687200150.0),((7, 5), 4311980235.0, 4247959294.0),((8, 6), 4851216305.0, 4889782076.0),((9, 7), 5393885229.0, 5538932327.0),((10, 8), 5943214302.0, 6192166460.0),((11, 8), 6520858849.0, 6192166460.0),((12, 9), 7152711138.0, 7087454578.0)... As seen in the example, the 0th row of G2 is mapped to 0th row of G3, 1st row of G2 with 0th row of G3, 2nd row of G2 with 1st row of G3 and so on. This is done by matching the total number of instructions that are executed as explained above. Calculating IPC Ratio Now since the mapping is done, calculating the IPC ratio is relatively straightforward. We calculate the IPC Ratio as follows: For row i belonging to G2 dataframe, and row j belonging to the G3 dataframe: IPC_Ratio = IPC_j/IPC_i This is done for each (i, j) mapping calculated above. A few insights about IPC_Ratio for '502.gcc_r_gcc-pp_3' : The IPC_Ratio when calculated for the '502.gcc_r_gcc-pp_3' benchmark shows the following distribution which looks pretty much accurate. This is because the improvement is approximately 30% which means the ratio should be 1.3. The comparative IPC plot for G3 vs G2 for this benchmark is shown below The IPC_Ratio calculated for each row after Dynamic time warping (DTW) is shown below Final training data The final training data has the columns ‘time’, ‘instructions’, ‘instructions_cumulative_sum’, ‘IPC’ removed. Hence, the training data has the 25 counters (X variables), and the IPC_Ratio (Y). We use the K Neighbors Regression model for training and inference of IPC Ratio for a given snapshot X. X: Set of variables that define the state of the system at a particular time during the benchmark.Y: IPC Ratio - The Ratio of the Instructions per cycle (IPC) of G3 to G2 at the time when the same number of instructions (approximately) were executed for the benchmark on both the CPUs. Part 1 - Summary In this part, we have covered the captured counters and its post processing, and also the generation of the Final training data. In the next part, we will cover the machine learning algorithm for predicting the IPC Ratio given a snapshot of the system.