Building Modern Day Site Reliability Engineering Team

Anjeneya Dubey, Senior Director of Performance & Site reliability Engineering, McGraw-Hill
Anjeneya Dubey, Senior Director of Performance & Site reliability Engineering, McGraw-Hill

Anjeneya Dubey, Senior Director of Performance & Site reliability Engineering, McGraw-Hill

In today’s world where speed is the most important thing, we have built a highly complex yet fast paced environments, where we are so much bogged down on delivering features rapidly to our customers that reliability is sometimes overlooked. Keeping up with the reliability goals of everchanging technological landscapeof your digital environments is of utmost importanceespecially when you fast tracked your development pipelines into cloud environments, move to containerization with devops practices. Traditionally the development teams focused on developing the features and operations team focused on making sure that the changes won’t break anything however the challenge was that both the teams don’t work closely enough to understand the change and risks involved resulting ineither slowing down the release cycles or causing reliability issues. DevOps practice solves this by bringing Dev and Ops closer and Site Reliability Engineering comes to the rescue thatcompletely resolves this as it focusses on the reliability of the products to its core from all aspects of software development keeping the same fast pace and agility.

​ If your organization has embarked onto the devops journey and want to speed up the product delivery without negatively impacting the reliability of the products, think about building and implementing Site Reliability engineering (SRE) practice 

Now, what is Site Reliability Engineering?

Invented at Google and as per their definition SRE is how a software engineer designs an operations function. As the name suggests the key thing SRE focuses is reliability of our product and services by ensuring the product is built right from the beginning. Below are their high-level responsibilities:

• Advocate of reliability and inject best practices into the product development
• Measure and keep SLIs, SLOs, SLAs and Error Budgets
• Build, Manage, deploy & Maintain the production environment
• Own CI CD Pipeline & operation of the production system
• React on incidents
• Own Problem management& RCAs
• Practice Infrastructure as Code, Monitoring as Code basically Everything as Code and Solve operational issues with software

What’s the typical structure of the SRE team?

The whole objective of having SREs is to bring operations as close to the development cycle as possibleand makes service reliability as everyone’s problem. SREs usually are a central team yet embedded deeply into agile teams working very closely with the product engineers, product managers and looking at every user story through reliability lens. They Createinfrastructure tools, platform and services to make engineering more efficient. Providing the guiding principles and hand holding product engineers on infrastructure engineering best practices. The number of SRE assigned to each team may vary depending on the reliability needs of the application. The SREs have tremendous awareness of what could go wrong, coupled with a strong desire to prevent it. They connect the performance of the service with design decisions in a regular meeting which in my opinion is an immensely powerful feedback loop.

Pillars of SREs

Making business service more reliable requires lot of work from all aspects of product and infrastructure engineering. Following are the key pillars of SREs

• Availability

Availability & uptime of the business services are the most important things for SREs. They work very closely with the product engineering and performance engineering teams to understand the key failure points. Simulate failure/chaos scenarios through game days to constantly learn how their systems behave during stress conditions and find and fix issues before it ever gets to production. Making the products more resilient to failures and ensuregraceful recovery from infrastructure or service disruptions, the build automation to Auto scale computing resources to meet demand, buildself-healing systems and mitigate disruptions such as misconfigurations or transient network issues.

• Monitoring

SREs keep a track of service health, performance and availability throughout its lifecycle. Spend a lot of time to understand the performance characteristics of each and every application component and set up proactive monitoring for anomalies. The monitoring needs to be intelligent enough to alert humans only if there is an action needed. SREs are software engineers so they solve production problems through automation and continuously make systems self-heal. SREs continuously look at the alerts that were triggered on week by week basisand take a look at the top ones for identifying & prioritizing self-healing candidates through automation.

• Engineering Efficiency

Another facet to SREs is to make overall engineering process efficient by building self-serviceautomated tools and frameworks that product engineers can off the shelf use to write, deploy and test super secure, scalable and reliable application code. They own the CICD pipeline, release management and pass-fail decision making for production changes.

• Operational excellence

SREs apply the same engineering discipline and best practice that we use for application code to our entire environment. They focus on defining the entire workload (applications, infrastructure) as code and deploy togetherto environments. Along with that they implement operations procedures as code and automate their execution by triggering them in response to events. By performing operations as code, you limit human error and enable consistent responses to events across the board.

• Change management

Majority of the software issues are caused by changes creating a constant conflict of interest between the traditional operations teams and development teams on the change velocity and service stability. This is addressed by introducing the SLIs, SLOs and error budgets. SREs constantly monitor and measure the service level objectives for each identified Service level indicators. If teams consistently exceed their SLOs (for example, 99.9% availability for all services), they may be able to move faster and take on more risk allowing them to have more product engineers than SREs as oppose to a team that is in danger, or isn’t meeting its SLOs, it’s a signal to back off and pause to focus on reliability by assigning more SREs to bring the uptime back on track so that the team can start feature development fast again. Their deployment methodologies encourage frequent, small, reversible changes.

• Incident & Problem Management

SREs are metrics driven and as their goal is to keep the uptime SLOs in check it’s very important to reduce Mean Time to Respond and Resolve production issues. SREs excel in both pre mortem and post mortem exercises. They conduct blameless RCAs and learn from operational failures to drive improvements

Conclusion

If your organization has embarked onto the devops journey and want to speed up the product delivery without negatively impacting the reliability of the products, think about building and implementing Site Reliability engineering (SRE) practice. The Good news is Site reliability engineering is present in all organizations in different names, forms & sizes. Some organizations call them devops teams that takes care of the reliability, some have the operations team, some have production support teams. You just have to bring them under the SRE umbrella, make it engineering centric team with shear focus on reliability and agility. So, let us stop feeding machines the human blood and automate away our operations function with SRE.

Read Also

Cloud Adoption-The Key to Business Success

Cloud Adoption-The Key to Business Success

Pankaj Sabnis, Principal Architect, Cloud Computing, Infogain
Software Quality in 2016: The State of the Art

Software Quality in 2016: The State of the Art

Capers Jones, VP & CTO, Namcook Analytics LLC
Onshore, Offshore, and Models for Testing Teams in Light of Recent Data Breaches

Onshore, Offshore, and Models for Testing Teams in Light of Recent Data Breaches

Jennifer Bonine, VP, Global Delivery and Solutions, tap|QA LLC
Shortcut Time-to-Market with Automated Code Testing

Shortcut Time-to-Market with Automated Code Testing

John Chang, Head of Solution Design, CAST