CIOReview
CIOREVIEW >> Software Testing >>

Building Modern Day Site Reliability Engineering Team

Anjeneya Dubey, Senior Director of Performance & Site reliability Engineering, McGraw-Hill
Anjeneya Dubey, Senior Director of Performance & Site reliability Engineering, McGraw-Hill

Anjeneya Dubey, Senior Director of Performance & Site reliability Engineering, McGraw-Hill

In today’s world where speed is the most important thing, we have built a highly complex yet fast paced environments, where we are so much bogged down on delivering features rapidly to our customers that reliability is sometimes overlooked. Keeping up with the reliability goals of ever changing technological landscape of your digital environments is of utmost importance especially when you fast tracked your development pipelines into cloud environments, move to containerization with devops practices. Traditionally the development teams focused on developing the features and operations team focused on making sure that the changes won’t break anything however the challenge was that both the teams don’t work closely enough to understand the change and risks involved resulting in either slowing down the release cycles or causing reliability issues. DevOps practice solves this by bringing Dev and Ops closer and Site Reliability Engineering comes to the rescue that completely resolves this as it focusses on the reliability of the products to its core from all aspects of software development keeping the same fast pace and agility.

​ If your organization has embarked onto the devops journey and want to speed up the product delivery without negatively impacting the reliability of the products, think about building and implementing Site Reliability engineering (SRE) practice 

Now, what is Site Reliability Engineering?

Invented at Google and as per their definition SRE is how a software engineer designs an operations function. As the name suggests the key thing SRE focuses is reliability of our product and services by ensuring the product is built right from the beginning. Below are their high-level responsibilities:

• Advocate of reliability and inject best practices into the product development
• Measure and keep SLIs, SLOs, SLAs and Error Budgets
• Build, Manage, deploy & Maintain the production environment
• Own CI CD Pipeline & operation of the production system
• React on incidents
• Own Problem management& RCAs
• Practice Infrastructure as Code, Monitoring as Code basically Everything as Code and Solve operational issues with software

What’s the typical structure of the SRE team?

The whole objective of having SREs is to bring operations as close to the development cycle as possibleand makes service reliability as everyone’s problem. SREs usually are a central team yet embedded deeply into agile teams working very closely with the product engineers, product managers and looking at every user story through reliability lens. They Createinfrastructure tools, platform and services to make engineering more efficient. Providing the guiding principles and hand holding product engineers on infrastructure engineering best practices. The number of SRE assigned to each team may vary depending on the reliability needs of the application. The SREs have tremendous awareness of what could go wrong, coupled with a strong desire to prevent it. They connect the performance of the service with design decisions in a regular meeting which in my opinion is an immensely powerful feedback loop.

Pillars of SREs

Making business service more reliable requires lot of work from all aspects of product and infrastructure engineering. Following are the key pillars of SREs

• Availability

Availability & uptime of the business services are the most important things for SREs. They work very closely with the product engineering and performance engineering teams to understand the key failure points. Simulate failure/chaos scenarios through game days to constantly learn how their systems behave during stress conditions and find and fix issues before it ever gets to production. Making the products more resilient to failures and ensuregraceful recovery from infrastructure or service disruptions, the build automation to Auto scale computing resources to meet demand, buildself-healing systems and mitigate disruptions such as misconfigurations or transient network issues.

• Monitoring

SREs keep a track of service health, performance and availability throughout its lifecycle. Spend a lot of time to understand the performance characteristics of each and every application component and set up proactive monitoring for anomalies. The monitoring needs to be intelligent enough to alert humans only if there is an action needed. SREs are software engineers so they solve production problems through automation and continuously make systems self-heal. SREs continuously look at the alerts that were triggered on week by week basisand take a look at the top ones for identifying & prioritizing self-healing candidates through automation.

• Engineering Efficiency

Another facet to SREs is to make overall engineering process efficient by building self-serviceautomated tools and frameworks that product engineers can off the shelf use to write, deploy and test super secure, scalable and reliable application code. They own the CICD pipeline, release management and pass-fail decision making for production changes.

• Operational excellence

SREs apply the same engineering discipline and best practice that we use for application code to our entire environment. They focus on defining the entire workload (applications, infrastructure) as code and deploy togetherto environments. Along with that they implement operations procedures as code and automate their execution by triggering them in response to events. By performing operations as code, you limit human error and enable consistent responses to events across the board.

• Change management

Majority of the software issues are caused by changes creating a constant conflict of interest between the traditional operations teams and development teams on the change velocity and service stability. This is addressed by introducing the SLIs, SLOs and error budgets. SREs constantly monitor and measure the service level objectives for each identified Service level indicators. If teams consistently exceed their SLOs (for example, 99.9% availability for all services), they may be able to move faster and take on more risk allowing them to have more product engineers than SREs as oppose to a team that is in danger, or isn’t meeting its SLOs, it’s a signal to back off and pause to focus on reliability by assigning more SREs to bring the uptime back on track so that the team can start feature development fast again. Their deployment methodologies encourage frequent, small, reversible changes.

• Incident & Problem Management

SREs are metrics driven and as their goal is to keep the uptime SLOs in check it’s very important to reduce Mean Time to Respond and Resolve production issues. SREs excel in both pre mortem and post mortem exercises. They conduct blameless RCAs and learn from operational failures to drive improvements

Conclusion

If your organization has embarked onto the devops journey and want to speed up the product delivery without negatively impacting the reliability of the products, think about building and implementing Site Reliability engineering (SRE) practice. The Good news is Site reliability engineering is present in all organizations in different names, forms & sizes. Some organizations call them devops teams that takes care of the reliability, some have the operations team, some have production support teams. You just have to bring them under the SRE umbrella, make it engineering centric team with shear focus on reliability and agility. So, let us stop feeding machines the human blood and automate away our operations function with SRE.

Read Also

"Well, How did I (we) get here?"

Louis DiModugno, Chief Data Officer with HSB
How to Build a Techforce

How to Build a Techforce

Christian N. Schmid (Managing Director and Partner), Raffael Kazda (Associate Director), Daniel Wagner (Manager) and Annika Melchert (Senior IT Architect), all core members of the Banking Practice Area of BCG and BCG Platinion
Data Archival - Rest in peace

Data Archival - Rest in peace

Himali Kumar, Director Data Management, AutoZone
What Does RBG's Death Mean for the Investing World?

What Does RBG's Death Mean for the Investing World?

Jenny Abramson, Founder & Managing Partner, Rethink Impact
The New Bridges and Barriers to an Integrated World view

The New Bridges and Barriers to an Integrated World view

Brandon Beals, Director of Data & Analytics, Dot Foods