Success Story: How We Built a Trusted SRE Partnership with Our Client

Akshay Bhide
2 days ago
4 min read

In the world of Site Reliability Engineering (SRE), trust, knowledge, and execution matter more than anything else. When our team was presented with the opportunity to support one of the leading clients in the inference systems domain, we knew the competition would be fierce. Many well-established and much larger organizations were bidding for the same project. Yet, we saw this as an opportunity to prove that expertise, dedication, and the right approach can outweigh size and scale.

Despite being a relatively small organization, we brought to the table something unique: deep benchmarking expertise and domain knowledge that matched the client’s needs. Our ability to quickly understand complex systems, connect the dots across data center operations, and build solutions made us stand apart. This expertise, combined with our willingness to adapt and learn, enabled us to win the contract and take on the responsibility of L1 support for their uptime systems, a task critical to their business continuity.

Early Learning Curve: Building Strong Foundations for SRE

The first few months were not easy. As with any complex system, the uptime infrastructure required us to climb a steep learning curve. We had to quickly grasp:

How incident workloads function in production.
The architectural blocks within the inference ecosystem.
The hosting mechanisms, including the structure of the client’s data centers.
The different ways the system could fail and the potential impact of each failure mode.

Every shift brought new learning opportunities. We immersed ourselves in understanding not just what went wrong, but why it went wrong. Slowly but steadily, our knowledge grew. Each incident became a case study, and each interaction with the client’s engineers enriched our understanding. This was the foundation upon which the rest of our success was built.

Shadow-to-Primary: Transitioning to Responsibility

In the beginning, we worked in 24x7 rotational shifts, shadowing the client’s engineers, who acted as the primary on-call. Whenever an incident occurred, we would huddle with their team for hours, studying every aspect of the problem. From root causes to resolution steps, we ensured that we not only solved the issue but also understood its overall architectural implications.

This approach gave us a top-to-bottom view of the system. We became aware of dependencies, escalation paths, and the critical importance of maintaining near-zero downtime, especially since the client’s end customers had strict SLAs.

A few weeks later, roles were reversed. We stepped into the position of primary on-call, while the client’s engineers moved into a shadow role. This was a defining moment for us — it was proof of the trust the client had started to place in our abilities.

From that point onward, we took ownership of incidents, evaluated dependencies, and escalated to higher-level (L2/L3) teams when necessary. Our timely and correct escalations saved the client from SLA violations in at least two critical cases. By reducing downtime significantly during these incidents, we demonstrated our ability to not only react but also safeguard business continuity.

Innovation: Building Dashboards & Monitoring Tools

As we settled into our responsibilities, we realized that the existing tools were not enough for the kind of proactive monitoring and reporting we envisioned. To bridge this gap, we took the initiative to build custom dashboards that provided visibility and actionable insights.

Building dashboards and monitoring tools

Shift Dashboard: Displayed current on-call engineers, open issues, resolved cases, and escalations in real-time.
Incident Dashboard: Showed day-wise, model-wise, and data center-wise incident trends — becoming an essential tool for weekly analysis.
Weekly Summary Dashboard: Automatically generated detailed reports of the past week’s incidents, including escalation data and issue patterns.

These tools were not part of the original scope, but we believed they were necessary to add value. Over time, they became integral to the client’s weekly analysis process, simplifying their workflows and enhancing decision-making.

Continuous Learning & Adapting to Change

Prediction management systems are dynamic by nature. Weekly deployments, new models, and constant updates meant that the environment was never static. We set up processes to stay on top of these changes, ensuring that our knowledge was always current.

Regular huddles, review meetings, and knowledge-sharing sessions with the client’s engineers became part of our routine. This collaborative approach kept both sides aligned and allowed us to respond quickly to changes in logs, architecture, or deployment practices.

Within 5–6 months, we had grown from a team learning the ropes to a confident, trusted partner capable of handling L1 responsibilities independently while also delivering value-added innovations.

Challenges Faced and Overcome

The journey was not without challenges. We encountered:

New types of incidents: Each time we faced something new, we documented the issue and resolution steps, building a repository for future reference.
Frequent deployments: Required us to stay agile and adapt our processes weekly.
Multiple models and new data centers: Added layers of complexity to monitoring and incident handling.
Incident spikes: At times, a single 8-hour shift would see a barrage of incidents. Our on-call engineers handled these calmly, prioritizing issues, escalating appropriately, and ensuring system stability.

Each challenge was an opportunity to refine our processes, strengthen our knowledge, and enhance the value we delivered to the client.

Conclusion: A Journey of Trust and Value

Looking back, what began as a competitive bid against larger players turned into a remarkable journey of trust, growth, and success. In just a few months, we evolved from observers to primary guardians of system reliability.

Our contributions went beyond the scope of L1 support:

We reduced downtime through effective incident management and timely escalations.
We built custom dashboards that improved visibility, monitoring, and reporting.
We set up a process of continuous learning and adaptation to keep up with dynamic deployments.
We documented and standardized incident handling, making future resolutions faster and more reliable.

Most importantly, we became a trusted partner to our client — not just a support team. Our journey showcased that size is no barrier when expertise, dedication, and innovation come together.

This success story is a testament to our team’s resilience, ability to learn, and determination to deliver value. It reinforced the fact that in today’s fast-moving technology landscape, reliability and trust are the cornerstones of any successful partnership.