January 19, 2021

DevOps 2.0: An Insight To Site Reliability Engineering (SRE)

How often do you focus on adopting a new fashion trend and keeping up with it forever?

We bet that hasn’t happened even once. Because soon before the fashion trend fades, a new one comes on board, and you are all hyped about it. Isn’t that right?

The same thing happens with technology as well. Once a new one comes on board, that becomes the most trending one! Such as Site Reliability Engineering (SRE), the much-an adored bridge between development and operations nowadays. By now, there must be a lot of questions in your mind.

Some of them are maybe:
“What is Site Reliability Engineering?”
“What are the SRE principles?”
“Is site reliability engineer a good job?”
“What is the role of a Site Reliability Engineer?”
“What are the similarities between SRE and DevOps?”

In this blog, we are going to answer all of the questions mentioned above. If you have any more questions, you can always type it down in the comment section.

Do you know how much does a Site Reliability Engineer gets paid? It starts from $136,836 per year. Can you believe this? Source: Indeed

But why Site Reliability Engineers are in such high demand? Let’s see from the definition.

What Is Site Reliability Engineer (SRE)?

Site Reliability Engineering is basically creating a bridge between development and Operations departments. It is a discipline that incorporates aspects of software engineering and applies them to infrastructure and operations problems. The main goals are to create scalable and highly reliable software systems.

According to Benjamin Treynor, founder of Google’s Site Reliability Team, SRE is “what happens when a software engineer is tasked with what used to be called operations.”

So, from where did the concept of SRE come from? To tell you that we have to go back to the year 2003. In that year, Benjamin Treynor was in charge of a production team whose end goal was to make Google websites more available so that they are always able to provide service.

Being a software engineer, Benjamin trained the way to work in a way the way he could have worked if he were a Site Reliability Engineer. He tasked the team to spend half of their time with the operations team so that they can understand the problem and contribute to the development in a better way. The team Benjamin Treynor managed is Google’s SRE team now.

You might ask now, and we already have DevOps dealing with both development and operations. Why do we need SRE, then? Is there any similarity between these two? Let’s look into the principles and key aspects of both to find out!

What is the relationship between SRE and DevOps?

From our previous blogs ITIL Vs. DevOps, you all know about DevOps already. Right? DevOps is basically a set of practices to build a culture of collaboration between the development and operations teams.

DevOps aims to achieve these 5 key points:

  • Reduce organizational silos
  • Accept failure as normal
  • Implement gradual changes
  • Leverage tooling and automation
  • Measure everything

The SRE principles are also aligned in a way so that all the above-mentioned points can be achieved. Let’s see how that can be done!

1. Reduce organizational silos:

  • SRE shares ownership with developers to create shared responsibility
  • SREs use the same tools that developers use, and vice versa

2. Accept failure as normal:

  • SREs embrace risk
  • SRE quantifies failure and availability in a prescriptive manner using Service Level Indicators and Service Level Objectives
  • SRE mandates blameless post mortems

3. Implement gradual changes:

  • SRE allows developers and product owners to function faster by reducing the cost of failure

4. Leverage tooling and automation:

  • SREs have the charter to automate menial tasks away

5. Measure everything:

  • SRE defines prescriptive ways to measure values
  • SRE fundamentally believes that systems operation is a software problem

Hope we cleared the air of confusion here? Now, let’s see what all a Site Reliability Engineer has to take care of.

What is the role of an SRE?

We gave you a brief idea about the job role of Site Reliability Engineer.
Take a look at the following points, and you will find out the details:

Site reliability engineers communicate with other engineers, product owners, and customers and come up with targets and measures. This helps them to ensure system availability. One can easily understand the perfect time to take action once all have agreed upon a system’s uptime and availability.

They introduce error budgets in order to measure risk, balance availability, and feature development. When there are no unrealistic reliability targets, a team has the flexibility to deliver updates and improvements to a system.

SRE believes in reducing toil. That results in automating tasks that require a human operator to work manually.

A site reliability engineer should have an in-depth understanding of the systems and their connectivity.

Site reliability engineers have the task of discovering the problems early to reduce the cost of failure.

Conclusion

Remember how the hand of the king used to handle everything on the king’s behalf? Back in that time, the kings used to choose the most intelligent person of the council to be the hand because the hand used to look after everything is running smoothly as well as strategizing
everything by collaborating with the king.

A Site Reliability Engineer is also the same. The one on whom the entire project depends.

Do you think you’d be interested in choosing this career path? All you need to do is take up an SRE training and apply for an SRE certification! So what do you think? Ready to do the same?