72 results found with an empty search
- The Architecture of Speed: Agentic AI & The New Era of Performance
The world is currently witnessing the most significant shift in the history of software performance. For decades, the life of a performance engineer was defined by the struggle against invisible bottlenecks and the constant interpretation of cryptic flame graphs. Optimization was often manual labor of intuition, spending ninety percent of the time instrumenting and reproducing race conditions, and only ten percent on actual resolution. But today, with agentic AI , that paradigm has flipped. It no longer feels like simply staring at dashboards; it feels like orchestrating a symphony of intelligent agents that understand latency and throughput as clearly as the architect does. The ability to guarantee a hyper-scale, sub-millisecond application from a single conceptual spark is quite literally a power now held in the palm of the hand. From Tuner to Governor This shift to agentic benchmarking means a developer is evolving from a tuner of loops into a governor of constraints . Interfacing with these agents is not just using a sophisticated profiler; it is collaborating with entities that can autonomously navigate a running system, identify deep-seated concurrency flaws, and propose architectural optimizations that would take a human team weeks to uncover. The focus can now remain on the "why" (SLAs and User Experience) and the "what" (Scalability targets), while the agents handle the "how" (caching strategies, database indexing, and memory management). It is a liberating experience that allows for optimizing at the speed of thought, turning a laptop into a simulation lab capable of stress-testing world-class software in a fraction of the time. The Weight of Autonomous Optimization Yet, this sudden surge in efficiency brings a sobering realization about the nature of the craft. When agents can autonomously refactor code to squeeze out every microsecond, the potential for unforeseen regressions scales just as fast as the throughput. Always remember, with great power comes great responsibility. In the age of AI, the definition of the benchmark is the definition of the product's destiny. What would be a good vhecklist of AI based SLOs: I. Performance vs. Integrity These SLOs prevent the agent from sacrificing correctness for speed. Latency with Freshness Constraints: The Rule: Do not just set a target of "200ms response time." The AI-Proof SLO: "99% of requests must be served within 200ms, provided the data served is no older than 5 seconds ." Why: This prevents the agent from implementing aggressive, stale caching strategies just to hit the speed target. Throughput with Error Budget Coupling: The Rule: Maintain 10,000 RPS (Requests Per Second). The AI-Proof SLO: "Maintain 10,000 RPS with a non-retriable error rate of $< 0.1\%$." Why: An agent might drop complex requests to keep the request counter moving fast. Coupling throughput with error rates forces it to process the hard tasks, too. Functional Correctness Tests: The Rule: The API returns a 200 OK status. The AI-Proof SLO: "99.9% of responses must pass a checksum or schema validation test." Why: Agents optimizing code might accidentally simplify logic that results in empty but "successful" (200 OK) responses. II. Speed vs. Cost AI agents often treat computing resources as infinite unless told otherwise. III. Tail Latency AI optimization often targets the average (P50) to look good on charts, ignoring the unhappy users at the outliers. The P99.9 Variance Limit: The SLO: "The gap between P50 (median) and P99 latency must not exceed 3x." Why: This forces the agent to optimize the entire codebase, including edge cases, rather than just the "happy path" code. Cold Start Constraints: The SLO: "First-byte latency after >5 minutes of inactivity must be $< 500ms$." Why: Prevents the agent from optimizing run-time performance while ignoring startup/initialization heavy lifting. IV. The Deployment Safety Net When agents write and deploy code autonomously, the rollback strategy is your last line of defense. Regression Tolerance: Summary Table: The Human vs. The AI-Proof Approach We are building such tools to make the life of our Customers much easier but at the same time deploying faster.
- From Innovation to Impact: Aligning ER&D with Marketing and Sales
Engineering R&D in a Changing Landscape Engineering Research and Development has always been at the heart of innovation. But today, its role is evolving rapidly. What was once primarily about pushing technical boundaries is now equally about speed, efficiency, and alignment with business outcomes. As industries grow more complex and interconnected, Engineering R&D teams are being asked to deliver faster, smarter, and with fewer margins for error. From a marketing and sales point of view, this evolution changes how Engineering R&D capabilities must be understood, positioned, and communicated to the market. From Technical Excellence to Market Expectations While closely tracking industry developments and Engineering R&D narratives, a few consistent patterns stand out. First, customization is increasingly replacing generalization. The rise of AI and edge computing has made it clear that one-size-fits-all systems are no longer sufficient. Workloads are becoming more domain-specific, driving the need for tailored accelerators, customized architectures, and memory hierarchies designed with specific use cases in mind. This shift has implications far beyond silicon—it affects system design, validation strategies, and long-term software support. Second, while hardware capability continues to advance, software readiness often determines whether that capability translates into real-world success. Performance alone is no longer the finish line. Debuggability, observability, and the ability to tune and validate software efficiently play a critical role in how usable and scalable a system actually is. In many cases, limitations are not discovered at design time, but much later, when systems are already expected to perform in production environments. Observing these gaps has shaped how I think about Engineering R&D as a whole. Challenges rarely exist at just one layer of the stack. They span from silicon to systems to software, and addressing them in isolation often leads to short-term fixes rather than sustainable solutions. This is where a cross-layer understanding becomes important—not just to identify performance issues or validation gaps, but to understand how decisions at one level ripple across the entire system. In my experience, many players in the ecosystem tend to focus on specific parts of this chain. Some excel at silicon, others at systems, and others at software. What stood out to me was the importance of approaching these challenges with a connected mindset—one that recognizes how benchmarking, bare-metal testing, software porting, and validation inform each other rather than operate independently. Market Research Beyond Strategy But market research doesn’t stop at identifying technical gaps. One of the biggest learnings for me has been realizing how much research influences communication, not just strategy. When communicating with the market, context matters. Messaging that resonates with an engineer often differs significantly from what resonates with a business leader. Over time, I’ve learned that effective communication depends on understanding who you’re speaking to, what problems matter most to them, and how they frame success. Research helps guide not only what we say, but how we say it whether through email, direct conversations, or broader content. Turning Insight into Impact Knowing which channels to use, the language that makes concepts accessible, and how to connect technical capabilities to real pain points is just as critical as understanding the technology itself. In many ways, communication becomes the final bridge between insight and impact. As Engineering R&D continues to evolve, market research serves as a compass not just for identifying where the industry is headed, but for shaping how we think, build, and communicate along the way.
- Porting Math Primitives on a Custom RISC-V
Introduction As RISC-V expands into accelerator domains, software readiness becomes as critical as hardware innovation. This work focused on implementing a set of mathematical and BLAS primitives for a custom RISC-V architecture, forming foundational building blocks for numerical computing. The implementation included vector and matrix operations with careful attention to numerical correctness and floating-point behavior. Challenges and Constraints A key challenge was the absence of physical hardware. Development was carried out using a remote x86-based system with RISC-V cross-compilation toolchains, while functional validation relied on an emulation-based execution framework that provided pass/fail results against reference outputs. Despite these constraints, the work enabled early software validation, ABI compliance checks, and confidence in algorithmic correctness—demonstrating a software-first approach to accelerator enablement. Early investment in core math primitives significantly accelerates ecosystem readiness for emerging architectures like RISC-V Categorizing the Math Core: BLAS Level 1 vs. Level 2 To ensure the custom RISC-V can handle diverse workloads, the implementation was divided into two fundamental tiers of the BLAS (Basic Linear Algebra Subprograms) hierarchy. Level 1: Vector-Vector Primitives Level 1 operations are the simplest building blocks. They perform O(n) operations on O(n) data. In a RISCV architecture, these are often "bandwidth-bound," meaning the speed of the operation is limited by how fast the hardware can pull data from memory rather than the raw speed of the floating-point units. Level 2: Matrix-Vector & Rank Updates Level 2 operations are significantly more complex, performing O(n2) operations on O(n2) data. These routines are the backbone of most "Band Matrix" solvers and are critical for engineering simulations where matrices have a specific structure (like symmetric or triangular).
- How to Leverage Figma Like a Pro for Modern UI/UX Design
In today’s fast-paced product landscape, design teams need tools that are fast, collaborative, and adaptable. Figma has become the go-to platform for designers because it does exactly that—and more. As someone who works closely with design, I’ve seen firsthand how using Figma the right way can transform the entire workflow, from ideation to delivery. Here’s how you can take full advantage of Figma and elevate your design process. 1. Start With Clear Structure: Pages, Frames & Naming Design becomes chaotic without structure. Figma makes it easy to organize your work if you: Break your project into separate pages (Wireframes, UI Screens, Components, Prototypes, etc.) Use meaningful frame names like Home – Logged In , Dashboard – Empty State , etc. Follow consistent naming conventions for layers. A tidy file keeps teams aligned and improves handoff for developers. 2. Build a Strong Design System Early Figma truly shines when you use components, variants, color styles, and typography styles . A solid design system helps you: Maintain UI consistency Reduce repetitive work Update global changes instantly Speed up onboarding for new designers Whether it's a simple style guide or a full-blown atomic system, Figma makes it scalable. 3. Collaborate in Real Time One of Figma’s biggest advantages is live collaboration. You can: Brainstorm together using FigJam Review designs with stakeholders in real time Leave comments directly on components Avoid endless file versions like Final_v4_revised2.fig This drastically reduces communication gaps and accelerates decision-making. 4. Use Auto-Layout for Responsive, Smart Design Auto-Layout is a game changer. It allows you to create designs that automatically adapt to content changes. Buttons resize, cards expand, and layouts adjust intelligently. This saves hours of manual adjustment and makes your designs feel more like real UI behavior. 5. Create Interactive Prototypes Easily With Figma’s prototyping tools, you can showcase: Realistic user flows Micro-interactions Transitions and animations Mobile, tablet, and desktop behaviors These prototypes help clients and developers understand functionality long before development starts. 6. Take Advantage of Plugins Figma has an enormous plugin ecosystem that can speed up your work: Iconify – access thousands of icons Content Reel – generate text, avatars, and data Autoflow – create quick user flows Mockup plugins – place screens inside device frames Plugins reduce manual effort and improve efficiency instantly. 7. Use Version Control and Components Library Figma’s built-in version history lets you: Track changes Revert mistakes Maintain clean progress Shared libraries help teams collaborate across multiple files without duplicating components. 8. Simplify Handoff With Inspect Mode Developers can easily view: Exact CSS properties Spacing and sizing Assets ready for export Design tokens Figma removes the typical “design-to-development gap” and ensures accurate implementation. Final Thoughts Figma isn’t just a design tool, it’s a complete ecosystem for UI/UX design, collaboration, prototyping, and delivery . When used strategically, it upgrades the entire workflow and helps teams deliver polished, user-centered products faster. If you want to design smarter, not harder, Figma is the tool to privilege.
- QEMU vs. FPGA: Understanding the Differences in Emulating and Prototyping Any ISA
With the evolution of hardware design and development, two tools have become fundamental for those working on Instruction Set Architectures (ISA) QEMU and FPGA boards. Although both serve as key resources for developing, testing, and experimenting with different ISAs (such as RISC-V, ARM, x86, etc.), they operate in significantly different ways. This blog highlights the key distinctions between QEMU and FPGA boards and their use cases across various architectures. Key Features of QEMU Across Architectures: Ease of Use: QEMU can be installed on standard systems (PCs or servers), enabling developers to work with different ISAs without needing specific hardware. Cost-Effective: As a free, open-source tool, QEMU provides a cost-effective solution for developers to emulate a wide range of ISAs. Software Emulation: QEMU simulates the target architecture’s instruction set, allowing developers to test code configurations and features of multiple ISAs without hardware limitations. What are FPGA Boards? FPGA (Field-Programmable Gate Arrays) boards are hardware devices designed to prototype and implement specific ISA designs at the hardware level. Unlike software emulation, FPGAs provide real-world testing platforms where developers can configure the architecture and observe its behavior in real-time. Key Features of FPGA Boards for Any Architecture: Hardware Prototyping: FPGAs allow the implementation of ISA-specific designs (e.g., RISC-V, ARM), providing accurate insights into the performance and real-time behavior of the hardware. Customization: FPGAs offer highly customizable environments where users can configure the hardware to match their specific ISA requirements and experiment with different core designs. Real-Time Processing: Since FPGAs execute instructions at the hardware level, they deliver real-time processing capabilities. This makes them ideal for applications that require low-latency response and performance tuning. Scalability: FPGA boards can scale to support various ISA implementations, ranging from simple cores to complex multi-core architectures. Speed and Runtime Limitations of FPGA Boards While FPGA boards support high clock frequencies (up to 300-400 MHz or more in certain designs), real-world performance is often constrained by factors like routing complexity, timing constraints, and resource usage. Achieving clock speeds consistently above 100 MHz can be challenging for complex designs. Hardware engineers often employ iterative cycles of compiling, testing, and optimizing clock speeds to reach desired performance levels. Additionally, runtime limitations on FPGA boards include constraints like memory bandwidth and resource bottlenecks, which can affect performance. Strategies such as pipelining, partitioning, and efficient resource management are often necessary to optimize designs for different ISAs. Source: Link Use Cases and Applications QEMU is best suited for software engineers who need to test applications and firmware targeting different ISAs in a virtualized environment. Whether it's any other architecture, QEMU provides a safe, cost-free platform for debugging and simulation. It is ideal for early-stage development where physical hardware is not necessary. FPGA Boards, on the other hand, are invaluable for hardware engineers and researchers who need to prototype and verify ISA designs in real-world conditions. For example, suppose you are developing a custom RISC-V core or tuning an ARM design for a specific use case. In that case, FPGA boards allow you to test performance, latency, and resource utilization in a physical setting. The insights gained here are crucial for final hardware implementation. Comparing QEMU and FPGA Boards Both QEMU and FPGA boards provide critical support for ISA development, but they serve different purposes. The choice between the two depends on whether you are focused on software or hardware development. Aspect QEMU FPGA Boards Nature Software-based emulation Hardware-based prototyping Cost Free and open-source Requires investment in FPGA hardware Setup Time Quick setup on a standard PC Requires hardware setup and configuration Performance Limited by host system capabilities Real-time performance based on hardware design Flexibility Flexible software environment Hardware customization based on project needs Network Capabilities Full network support and integration Historically limited, with newer boards supporting it Use Cases Software testing, debugging, simulation Hardware prototyping, real-world performance analysis
- Key Practices for Effective Full-stack Web Development
Developer experience The Full Stack Development team spends a significant time writing code. A good developer experience implies a grossly improved developer productivity. Some ways to improve the DX, thereby improving the quality of life and hence the productivity include: Setting up eslint/tslint/prettier so that the IDE can take care of the mundane tasks like enforcing code formatting, highlighting possible code quality issues, enabling early bug detection. Integrating eslint/tslint/prettier within CI/CD Using tools like Storybook with Chromatic enabling UI/UX reviews early on Basic documentation which will enable an easy onboarding process - viz. Instructions on setting up projects for development on local environment, database structure, API documentation, mock data/API sandboxes. Depending on the project, if appropriate, adopt a test driven development. Dependencies & Code maintainability Choosing dependencies is crucial. Using libraries that are no longer maintained might limit upgrading core frameworks in the near future and might require forking repos to update those libraries and then update the core frameworks. Another issue with obsolete libraries is the security risks. From a productivity perspective, some libraries do not actively host documentation, and new developers find it difficult to use those libraries by referring to code snippets without the documentation. Furthermore, having libraries without type safety make it extremely difficult for new developers to use those libraries without documentation. As such, as far as possible, either long term stable libraries should be used. If only small fractions of libraries are required, possibly create in-house libraries. Long term maintainability with a large number of external dependencies can turn into a maintenance nightmare. Team culture Having a positive culture that facilitates open communication, respects individual differences, appreciates contributions, makes people feel valued, has a large impact on how products are built. Code reviews should be what developers look up to, in terms of learning new aspects, instead of dreadful processes. Design system For frontends, having a design system, or components that could easily be abstracted to a library can have a large impact on costs in the long term. When a product pivots, or when companies launch new products, the component libraries can be easily be reused across products. As such, having a design system that facilitates themes can greatly reduce the repetitive efforts of styling and maintaining component libraries. Shipping often Shipping often and early enables feedback earlyon. In this agile-world, regular shipping enables stakeholder participation, identifies issues early, and makes sure that the developers are aligned with the product goals. If releasing to production isn’t an option, having multiple environments can enable internal stakeholder feedback.
- Revamping Full-stack Projects: Strategies for Long-term Success
In the agile world, with dynamic requirements, large projects might need a revamp for numerous reasons. The need for a revamp might arise to improve performance and thereby user experience, to upgrade some libraries that have security vulnerabilities, or to remove certain libraries which are no longer supported and port to new/better libraries. Rebranding might require revamping as well. As products scale, cost optimization might need revamping. Revamps are also frequently driven by a need to merge multiple applications into a single app, usually internal-facing apps. With multiple such reasons overlapping, my team and I had to revamp some projects in the last few years. Here are some of our learnings with the challenges we faced. Moving to a design system Business temptations for shipping optics and associated operational hazards Content architecture vs design changes vs simultaneous content and design changes Dependencies that broke us Developer experience Moving to a design system One of our projects involved an admin panel for a product suite of 5 products. The product suite is a SaaS offering. The admin panel for end users had StyledComponents. Moving to a design system - standardising all components meant great time savings for the long term. However, that implied porting an existing code base of 25+ man-years. We found a sweet spot where we migrated component by component to the design system. Very small PRs where every PR involved changes from only 1 page for only 1 type of component made code review, testing and UX review easy. The new design system had a theme that styled the new components similar to the old implementation. The idea is to switch the theme after all components are migrated. That way, the new design release is a large change visually, but a small code diff on the day theme is flipped. If such small code is not shipped often, the branches become stale, and in a large team that ships new features often, rebasing becomes a large task in itself. Hence the need to ship as often as possible. Business temptations for shipping optics and associated operational hazards Business teams demand large releases that are good optics for marketing. However, those come at a risk. The users using the app face challenges when there are large changes. Large shipping could cause an increase in support tickets if there’s any confusion during app usage, specially from the internal stakeholders. When there are content architecture changes, during release, a different set of fields is required for the old vs new versions. The new fields could be derived from the old ones, hence the need to maintain the old fields for a while. Such old fields become a tech debt and have to be removed later. This brings to another aspect - whether the changes are graphic design changes or content architecture changes or both. For CMS-driven large static websites we developed, for changes involving such old redundant fields, and a large number of new fields, having Storybook updated enabled the content managers to check the old and new components on Storybook. We’ve been using Chromatic for the past few years, and it has helped us enable content managers to quickly experiment with components. Content architecture vs design changes vs simultaneous content and design changes Often, designers creatively add more fields to the UI while redesigning. This mixes two tasks up - graphic design change and content architecture. Separating graphic design changes and content architecture changes helps. Content architecture changes need to be shipped extremely small. Whereas large graphic design changes could be made in large releases. Dependencies that broke us NextJS shipped a stable app directory/app routing. We used a CSS-in-JS library ChakraUI. This library does not support app directory and only works with client-side rendering if used in app directory. If we would have used TailwindCSS, this would not have happened. However, using TailwindCSS, and building components with some other library which would facilitate accessibility would have taken more time (Radix is one way to use primitives and use TailwindCSS). The trade-off of using ChakraUI paid off in the first few years of the project, but now is a blocker for upgrading NextJS and using new features NextJS offers. Another roadblock is that the ChakraUI plugin for Storybook does not support newer Storybook versions. Essentially, we’ll need a complete re-write to exit ChakraUI at some point. In one of our projects, we used a navigation library - navi. This library stopped supporting ReactJS v17 onwards. We had to fork the library and upgrade it since it is no longer supported. While the dev team might want to take some risks and use libraries, when such libraries aren’t supported in newer versions, replacing such core features require large refactoring. Developer experience When setting up a project, prettier and eslint need to be the priority. Missing these out, and enforcing rules later is expensive. Having these set during the project setup makes it easier to onboard new developers. Instead of maintaining documentation or addressing formatting rules during code reviews or memorising rules without documentation, setting up prettier and eslint enforce rules on the codebase, using IDE/IDE plugin features. This saves time and reduces developer fatigue as well as irritation. While some teams consider writing types/interfaces in TypeScript an unnecessary overhead, when building large projects, knowing types really helps in knowing what might break during development instead of runtime. The learning curve associated with learning TypeScript isn’t much, and GitHub Copilot has made things easier as well.
- Server Performance Prediction using ML Models - Part 1
This blog is the first part of the series in "Server Performance Prediction using Machine Learning Models" OVERVIEW: In the semiconductor industry, a Silicon Cycle takes nearly a year or more from conception to the silicon being available. As soon as the concept comes in, there is immense pressure on the marketing and sales teams to come up with performance numbers for this new generation of the silicon i.e. processor. There is a need for a way to find out what the score could be on this new generation. Since the new processor is not physically available, companies run several benchmarks on simulators/emulators for getting the performance scores. This process has two major drawbacks: Running a benchmark on a simulator takes hours more than running it on the physical processor Such simulators are very expensive and a limited number of them are available There should be some way of projecting new scores based on the scores of the older generation of processors. In this project, we aim to predict performance for a benchmark on newer generation processors by training Machine Learning Models on older generation data. METHODOLOGY: Performance Parameters: We identified two performance parameters, namely IPC (Instructions per cycle) and Performance Runtime. The Runtime is actually derived from the Instructions per cycle, assuming a particular clock speed. In the current iteration of the research, we will limit our work to IPC and Runtime prediction. Data Gathering and pre-processing: We gathered our data on two generations of the Graviton processors, namely Graviton 2 and Graviton 3. We captured data for benchmark suites named SPECInt and SPECFp designed by the Standard Performance Evaluation Corporation. For each benchmark in these SPEC benchmark suites, we have captured the following counters at every 100 millisecond intervals: Sr. No: Counter Name 1. L1I_TLB_REFILL 2. L1D_TLB_REFILL 3. L3D_CACHE_ALLOCATE 4. L3D_CACHE_REFILL 5. L3D_CACHE 6. l1i_cache 7. l1i_cache_refill 8. l1d_cache 9. l1d_cache_refill 10. l2d_cache 11. l2d_cache_refill 12. br_mis_pred 13. br_pred 14. mem_access 15. stall_backend 16. stall_frontend 17. ASE_SPEC 18. VFP_SPEC 19. L1I_CACHE_REFILL 20. L2D_CACHE_REFILL 21. INST_SPEC 22. BR_RETIRED 23. BR_MIS_PRED_RETIRED 24. branch-loads 25. MEM_ACCESS_RD Why SPEC? The reason that we captured data on the SPEC benchmark suite is because these benchmarks are designed to capture most of the areas of the CPU. For example, a benchmark like gcc stresses the memory, whereas, another benchmark like x264 stresses on vectorization capabilities of the CPU. Let us call a set of counters collectively as “The Snapshot” of the System. On the other hand, we also capture instructions and cycles for every 100 milliseconds. During post processing, we calculate the Instructions-Per-Cycle (IPC) for every snapshot. For our machine learning model discussed further, we need a set of Input variables and an output variable. The Snapshot of the System serves as the input set of variables (X). The output variable is the Ratio of the IPCs of the newer generation of CPUs to the older generation. Let us have a look at how this training data is generated. Post Processing the counters: For a benchmark, the captured data is available in 7 CSV files. Each file has a few counters (for eg. L1D_TLB_REFILL) captured at every 100 milliseconds intervals over the duration of the benchmark. There are 7 files because only a limited number of counters can be captured during a given benchmark run. Hence, we run the benchmark several times in order to capture all the necessary counters.Inside a counter file, there are many columns, out of which only 3 are important to us for post processing. These are time, count, and type. We filter the dataframe that is read from the csv file for only these 3 columns. For example, a row in the filtered dataframe would look like - 0.300514373,1778078,L1D_TLB_REFILL. The row entry means that the counter L1D_TLB_REFILL is captured at 300 milliseconds from the start of the test, for which the count is 1778078. Once we have the counter files for a benchmark, we apply post processing to the data, where we calculate the Per-Kilo-Instruction (PKI) metric for each counter. For example, we calculate L1D_TLB_REFILL_PKI for the L1D_TLB_REFILL counter for every capture of the L1D_TLB_REFILL counter. So for example, if the corresponding instructions capture for the above example is like this - 0.300514373,618104386,instructions, it means that the instructions count was 618104386 at 300 milliseconds. To calculate the PKI count for each counter, we use the following formula: PKI_Count = (Counter/instructions)*1000 The formula gives us the Per-Kilo-Instructions value of a particular counter. For example, for the 300 milliseconds capture example above, the PKI Count would be calculated as follows: L1D_TLB_REFILL_PKI = L1D_TLB_REFILL/instructions*1000 = 1778078/618104386*1000 = 2.876662972 Hence the L1D_TLB_REFILL_PKI for this particular capture at 300 milliseconds is equal to 2.876662972. Additionally, an Instructions Per Cycle (IPC) count is calculated for each capture of the counters. The formula for IPC simply divides the number of Instructions by the number of cycles. IPC = instructions/cycles It tells you how many instructions were executed for a given clock cycle of the CPU. We consider it a measure of the performance of the CPU. Likewise, the PKI count calculation (same as shown above) is done for all the 25 counters captured. As a result, we have a post processed dataframe which has the following columns: time, IPC, and 25 PKI counters. This is the 2-dimensional dataframe for a benchmark. The post processing is repeated for all the benchmarks, thus creating 2-dimensional dataframes for each benchmark. These dataframes are concatenated one after another and are indexed by the benchmark name, which is our 3rd dimension. Hence, we stack all the 2D dataframes into a 3D dataframe indexed by the benchmark name. Plotting a comparative IPC plot When a benchmark runs on a newer generation processor, it is generally expected that it would take less time to run than on the older generation processor. For example, consider a benchmark named ‘500.perlbench_r_checkspam_0’ . It takes less time to run on Graviton 3 as compared to Graviton 2. Below is a plot of the calculated IPC for the entire duration of the ‘500.perlbench_r_checkspam_0’ benchmark. As seen above, the benchmark on G3 runs in lesser time compared to the same benchmark on G2. The IPC is higher as well. Dynamic Time Warping, Generating Training Data Let us consider an example while taking Graviton 2 (G2) as the older generation and Graviton 3 (G3) as the newer generation used while training the machine learning model. Now for each benchmark, we calculate a new column called the “IPC Ratio”. This is the ratio of the IPC values of each row of G3 dataframe mapped to a row of G2 dataframe for that benchmark. We perform this mapping using Dynamic Time Warping. Why IPC Ratio? For a given snapshot of the system (X), the IPC ratio of G3 to G2 is the same irrespective of the benchmark being run. This is because the architecture of the systems are designed in such a way that if the “state” of the older system is the same at a point during any two benchmark runs, then the IPC Ratios of the new generation to the older generation are equal for both the benchmarks. This is true irrespective of the benchmark being run. Note that only the ratio is the same and not the raw IPC numbers. Hence, we choose IPC Ratio as the output (Y) parameter Dynamic Time Warping Since we need to find the ratio of the IPCs at each capture of the snapshot, there should be an intelligent way to map the benchmark run instances. If we observe the comparative IPC plot above, we notice that the benchmark runs in lesser time on G3 than on G2. This means that more instructions are executed in the same amount of time on G3 than on G2. We calculated the cumulative sum of the number of instructions executed for each row, and store it in a column named “instructions_cumulative_sum”. The graph of the cumulative instructions for G3 and G2 can be seen below (The numbers on the Y-Axis are in the order of 10^12). The example belongs to the benchmark named '502.gcc_r_gcc-pp_3' We map the captured instances by matching the instructions_cumulative_sum column. This is because taking the IPC ratio makes sense only when we are taking the ratio of rows at nearly the same time during the benchmark run. Dynamic time warping is a method that calculates an optimal match between two given sequences with certain rules. For example, First value of sequence 1 is mapped to first value of sequence 2 Last value of sequence 1 mapped to last value of sequence 2 Intermediate mappings have monotonically increasing indices Below is an example of the mapping path generated using the Dynamic Time Warping algorithm We have time on the X-axis, and instructions_cumulative_sum on the Y-axis. The line above represents G2 and the line below represents G3. The orange part in the graph is a lot of lines that represent mapping from one row of G2 to a row of G3 based on the cumulative instructions count. Below are a few initial actual mappings for the above example using the DTW algorithm ((row_g2, row_g3), cumulative_sum_g2, cumulative_sum_g3) (0, 0), 460190954.0, 672156965.0),((1, 0), 1106484590.0, 672156965.0),((2, 1), 1724588976.0, 1657376501.0),((3, 2), 2249696930.0, 2344648605.0),((4, 3), 2780771235.0, 3021707178.0),((5, 3), 3313773511.0, 3021707178.0),((6, 4), 3797070083.0, 3687200150.0),((7, 5), 4311980235.0, 4247959294.0),((8, 6), 4851216305.0, 4889782076.0),((9, 7), 5393885229.0, 5538932327.0),((10, 8), 5943214302.0, 6192166460.0),((11, 8), 6520858849.0, 6192166460.0),((12, 9), 7152711138.0, 7087454578.0)... As seen in the example, the 0th row of G2 is mapped to 0th row of G3, 1st row of G2 with 0th row of G3, 2nd row of G2 with 1st row of G3 and so on. This is done by matching the total number of instructions that are executed as explained above. Calculating IPC Ratio Now since the mapping is done, calculating the IPC ratio is relatively straightforward. We calculate the IPC Ratio as follows: For row i belonging to G2 dataframe, and row j belonging to the G3 dataframe: IPC_Ratio = IPC_j/IPC_i This is done for each (i, j) mapping calculated above. A few insights about IPC_Ratio for '502.gcc_r_gcc-pp_3' : The IPC_Ratio when calculated for the '502.gcc_r_gcc-pp_3' benchmark shows the following distribution which looks pretty much accurate. This is because the improvement is approximately 30% which means the ratio should be 1.3. The comparative IPC plot for G3 vs G2 for this benchmark is shown below The IPC_Ratio calculated for each row after Dynamic time warping (DTW) is shown below Final training data The final training data has the columns ‘time’, ‘instructions’, ‘instructions_cumulative_sum’, ‘IPC’ removed. Hence, the training data has the 25 counters (X variables), and the IPC_Ratio (Y). We use the K Neighbors Regression model for training and inference of IPC Ratio for a given snapshot X. X: Set of variables that define the state of the system at a particular time during the benchmark.Y: IPC Ratio - The Ratio of the Instructions per cycle (IPC) of G3 to G2 at the time when the same number of instructions (approximately) were executed for the benchmark on both the CPUs. Part 1 - Summary In this part, we have covered the captured counters and its post processing, and also the generation of the Final training data. In the next part, we will cover the machine learning algorithm for predicting the IPC Ratio given a snapshot of the system.
- All about Python Kubernetes Client
Kubernetes, also known as K8s, is an open-source system for automating deployment, scaling, and management of containerized applications. Kubernetes is an open source container orchestration engine for automating deployment, scaling, and management of containerized applications. The open source project is hosted by the Cloud Native Computing Foundation (CNCF).So if you are a developer and not familiar with kubernetes CLI, then the kubernetes client will help you to interact with kubernetes.There are several language specific kubernetes clients are available e.g. Python, Java, C#, Javascript etc. Let’s deep dive into the Python Kubernetes Client and understand it using the following steps and examples. Step 1: Install kubernetes Installation guide pip install kubernetes Step 2: Import client and config from kubernetes in your python code from kubernetes import client, config Step 3: Load Kubernetes cluster Configuration try: config.load_incluster_config() except config.ConfigException: try: config.load_kube_config() except config.ConfigException: raise Exception("Could not configure kubernetes python client") Step 4: Interact with Kubernetes resources In step 3 we have loaded kubernetes config so we are ready to perform different operations, or to get different resources of kubernetes using this kubernetes python client api’s. 1. Get Nodes kubectl is the CLI for kubernetes, for getting nodes of clusters you have run commands. A node may be a virtual or physical machine, depending on the cluster. Command: kubectl get nodes Above command will return present nodes in kubernetes cluster To Get the same data using a python client we have to use the class CoreV1Api which we can get from a client that we imported from kubernetes as follows. Using Client: v1 = client.CoreV1Api() v1.list_node() 2. Get Namespaces Namespace in kubernetes is something like group in general term, so if we want to bind our pod, deployment, PV, PVC etc under one label or group, will have create a namespace and while creating each kubernetes resources that we mentioned earlier add –namespace your_namespace_name flag at the end of command. Command: kubectl get namespaces Using Client: v1 = client.CoreV1Api() v1.list_namespace() 3. Get Pods in all Namespaces Pods are the smallest deployable units of computing that you can create and manage in Kubernetes. Pod is a set of containers with shared namespaces and shared file system volumes. Command: kubectl get pods Using Client: v1 = client.CoreV1Api() v1.list_pod_for_all_namespaces() 4. Get Pods in Specific Namespace For finding pods deployed under specific namespace we use below command and similarly we can do using python kubernetes client. Command: kubectl get pods –n your_namespace_here Using Client: v1 = client.CoreV1Api() pod_list = v1.list_namespaced_pod(pod_namespace) pods = [pod.metadata.name + " " + pod.status.phase for pod in pod_list.items] Note: Kubernetes resources (such as pod, deployment etc.) created without namespace are included under default namespace. 5. Create Pod with namespace Command: kubectl apply -f your_pod_yaml_file.yaml Using Client: with open(podYamlFilePath) as f: dep = yaml.safe_load(f) k8s_apps_v1 = client.CoreV1Api() resp = k8s_apps_v1.create_namespaced_pod(body=dep, namespace=pod_namespace) print("Deployment created. status='%s'" % resp.metadata.name) 6.Get pod Status A Pod's status field is a PodStatus object, which has a phase field. The phase of a Pod is a simple, high-level summary of where the Pod is in its lifecycle. Pod status has 4 possible values Pending, Running, Succeeded, Failed and Unknown. Note: After deleting a pod it shows terminating status by some kubectl commands, but this is not a pod phase. Command: kubectl describe pod pod_name --namespace your_pod_namespace Using Client: v1 = client.CoreV1Api() pod = v1.read_namespaced_pod(name=pod_name, namespace=pod_namespace) print(pod.status.phase)#it will print pod status 7. Delete pod with namespace Command: kubectl delete pod your_pod_name —namespace your_pod_namespace Using Client: v1 = client.CoreV1Api() api_response = v1.delete_namespaced_pod(pod_name, pod_namespace) print(api_response) For more information you can see official documentation of Python-Kubernets-Client and for more examples Visit official kubernetes example folder . Conclusion: Kubernetes client helps to interact with kubernetes cluster to perform different tasks and operations programmatically, without running CLI commands as we saw in our article, we can perform much more advanced things using kubernetes client and one of the great things is it is available in different programming languages. References: Kubernetes Docs: https://kubernetes.io/docs/home/ Kubernetes Client: https://github.com/kubernetes-client/
- Server Performance Prediction using ML Models - Part 2
In the first part of the blog, we described the problem that we intend to solve, the data gathering, post processing, and generating the final training data. In the 2nd part, we will take a look at the Machine Learning model we used for training and for inference with new data. Correlation between various counters We have captured various counters for various benchmarks. Here is a graph that shows the correlation between each counter with every other counter. K Neighbors Regression Given a snapshot of the system as a test data row, the K Neighbors algorithm first finds the K nearest among all neighbors using a distance metric such as Euclidean distance (default), Manhattan Distance, Minkowski distance, etc. It then averages the Y values of the K nearest neighbors for the given test row, and assigns the result as the predicted Y value of the test row. Standard Normalization of Counters: In order for the K Neighbors Regression algorithm to calculate these distances in an unbiased manner, we bring all the counters to a comparable scale by using standard normalization. Which means that all the columns will have values that have a standard normal distribution with mean equal to 0 and standard deviation equal to 1. Why did we use K=1? Since we know that given two snapshots whose X values are exactly the same, the ratios would also be the same, we chose K=1 to find the closest neighbor whose input variables match the test data very closely to get a nearly accurate prediction of the IPC ratio. Shown below is a sample of the prediction made using K Neighbors Regression for the IPC Ratio. The above IPC ratio prediction is for the ‘502.gcc_r_gcc-pp_3’ benchmark. The “Actual” line in the graph is present since we have already calculated the IPC Ratio for ‘502.gcc_r_gcc-pp_3’. This dataframe was excluded from the training data for the K Neighbors Regression and was used as a test dataframe. The Runtime can be calculated using the predicted IPC by assuming a particular clock speed of the CPU. We calculate the total number of cycles first, followed by the runtime calculation. The following formula can be used: total_cycles = total_instructions/predicted_ipc predicted_runtime = total_cycles/(2.5*10^9) The above formula for predicted runtime assumes that the clock speed of the processor is 2.5 GHz. The predicted IPC and the runtime for the same benchmark can be seen in the following graph: It shows around 30% improvement, which is close to the predicted value.
- Responsive Next.js Image Using Aspect Ratio
One of our customers at Whileone wants to build cards for their website which contains an image and some other content. Image will cover its container and should adjust its dimensions accordingly without cropping the image. While Using Next.js Image and making it responsive we always faced one challenge that we need to keep the aspect ratio of image so that image will be neat and clean in given space. We can do it by mentioning height and width of image at different breakpoints. Which is actually a time consuming and trial and error method, So we came up with a solution for this problem is css property called aspect-ratio. Below is example Card, Let’s see it in two scenario, With Aspect Ratio and Without Aspect Ratio I’m using Tailwind CSS for styling. Fig. 1 1. Without Aspect Ratio: Without aspect ratio we’ll have the same card on the mobile screen, If we compare the below image (Fig. 2) with the first Image (Fig.1) the bottom corner part of the image get disappeared/cropped on the mobile screen. Fig. 2 2. With Aspect Ratio: With aspect ratio we’ll have the same card on the mobile screen, If we compare the below image (Fig. 3) with the first Image (Fig.1) both are rendered properly. That’s the advantage of Aspect Ratio Property. Let’s see the code what exactly changed, We are using two next image properties Which means Image will take width and height from its parent, and by default Next Image has position absolute so we need its parent to be relative in position. And we need to add the aspect ratio property to its parent, so the question arises, how can we calculate the aspect ratio of an image? So, 1.5 is the aspect ratio for this particular image. Fig. 3 Conclusion: This blog intends to help you understand how aspect ratio works with Next Image, and how It helps us to build responsive images. Here is CodeSandBox link of given example for better understanding, where you can see the code, make changes see the difference: https://codesandbox.io/p/sandbox/next-image-with-aspect-ratio-l6fplz?file=%2Fapp%2Fpage.tsx%3A1%2C1
- AWS Lambda to generate SSH Keys
For the past few months, my team and I at WhileOne Techsoft Pvt. Ltd. have been helping our customer setup a system wherein access to a remote server in the cloud for testing can be granted to users. One of our client’s requirements is to generate SSH keys from the JIRA board. In JIRA use a custom script to generate SSH keys which will help our client for project automation. SSH key pairs are two cryptographically secure keys that can be used to authenticate a client to an SSH server. The private key is retained by the client and should be kept absolutely secret. Why use the AWS Lambda function? AWS Lambda is a serverless compute service that runs code in response to events and automatically manages the underlying compute resource. AWS Lambda automatically runs code in response to multiple events, such as HTTP requests via Amazon API Gateway, modifications to objects in Amazon Simple Storage Service (Amazon S3) buckets, table updates in Amazon DynamoDB, and state transitions in AWS Step Functions. With AWS Lambda, there are no new languages, tools, or frameworks to learn. You can use any third- party library, even native ones. You can also package any code (frameworks, SDKs, libraries, and more) as a Lambda Layer, and manage and share them easily across multiple functions. Lambda natively supports Java, Go, PowerShell, Node.js, C#, Python, and Ruby code, and provides a Runtime API allowing you to use any additional programming languages to author your functions. Steps to generate SSH keys: 1. As a team we have decided to use Ruby as a language in AWS Lambda function. - AWS Lambda functions using Ruby 2.7 which supports architectures such as x86_64 and arm64. - I tried using the OpenSSL- Cipher algorithm which requires an “openssl” gem in Ruby. This algorithm will generate random keys. cipher = OpenSSL:: Cipher.new (‘AES-128-CBC’) Above line will generate the keys with a public and private key pair. But this key pair was not working as expected. - OpenSSL puts a hard limit of 256 bits on key sizes, causing less efficiency. - To overcome above problem, I tried to generate keys using “sshkey” gem which is supported by Ruby with inbuilt function as “sshkey. generate”. This method is quite easy and gives the accurate result for public and private key pair. 2. Once the keys get generated, the next task is to zip these 2 files. As per client requirement, keys should be zipped and sent to the customer as an attachment in an email. - AWS Lambda supports the built function “tar.gz”. But as per requirement I need “.zip” format. - To zip these keys, I have used the “zip” gem in Ruby which is quite easy to use. 3. Next step is to create an Email template and send keys in email format. - First, I tried with SES (Amazon Simple Email Service). I was able to send the emails but it always goes to Junk mail. So, I need to search for some other way to send the emails. - Ruby supports SMTP via “net/smtp”. This method is quite straightforward as add your credentials, make a template and send the email. But this method supports only one attachment in email, which again is a drawback for me since I need to send some PDF documents as an attachment with keys. - To overcome this problem, I have used the “mail” gem which is supported by Ruby. This gem is supported by SMTP. It also supports HTML templates to send the email. 4. Since this AWS Lambda function is going to call from the JIRA board. I need some data from the JIRA board such as clients name, email id etc. In ruby, “jira-ruby” gem is used to fetch the information from the JIRA board. To get the information needed to create API token in Jira board which will act as a password. Use JIRA credentials in the lambda function and get the information. In JIRA, each issue is created with an issue id which is unique. Get all the information from issue id which will be the parameter passed by JIRA board. Steps to call Lambda function in Jira Board: 1. First Create API and add lambda function in POST method. Add any parameters if needed. I need an ID from the JIRA board. Add issue id in URL Query String Parameter. 2. Now to call this API, JIRA supports JIRA webhook. Create a JIRA webhook and add API URL into this. Also add a JQL query when this API should get called. Limitations: AWS Lambda function supports only 3MB space for each function. Since I am using a lot of gems to create this functionality, I need more space for my function. - To resolve this issue, I need to split the function into 2 separate functions each of 3MB. - First function will work to get the information from the JIRA board, create a temporary file and save it into S3 bucket. And also create SSH keys and save those keys in S3 bucket. - Second function will generate an Email template, fetch keys from S3 bucket and send email to the customer. - Now the main task is how these 2 functions will communicate with each other. Since I am going to create an API for this where only 1 function can get called. AWS ruby supports “invoke” functions where you can call other functions. This was a very good way for me to get hands on with AWS Lambda and do a soft landing to understand AWS Lambda. The cost of the lambda functions is in the below table This means no IT department in the company is going to raise eyebrows for cost overheads. These costs are equivalent to negligible or none at all. The same can also be done on other cloud providers namely on GCP which can use Google Cloud functions, or Azure which can use Azure Automation or OCI which uses Oracle functions.












