By Adrian Gonciarz, QA / SRE Director
The origin of Site Reliability Engineering
Site Reliability Engineering (SRE) takes its roots in Google, where the idea was started in 2004 by Ben Treynor Sloss, who was tasked with improving the system’s performance, availability, and stability.
Having a programmer’s background, Ben approached the task as if he were a Site Reliability Engineer and incorporated methods used commonly in code development as opposed to operations. It was still a time when DevOps, culturally, was still young, and the traditional division of developers handled only programming tasks, while the deployment and maintenance were operated by a separate team of administrators. He assigned half of the time of his team members to operational work while the remaining time they concentrated on improving the codebase with tools that would make the software easier to monitor in the production environment. Site Reliability Engineering quickly became a standard approach across the whole organization, and, years after, is one of the most important branches in the IT industry.
If we were to summarize in one sentence the aim of SRE in modern organizations, it would be to provide all necessary means to achieve the required availability (for example 99.99% of the time) of the system.
Who is a Site Reliability Engineer and what do they do?
A Site Reliability Engineer is someone that comes from a very wide background covering system administration, application development, software testing, and business analysis. Their main goal is to work closely with the development and operational (also known as DevOps) teams to improve the resiliency, observability, and overall reliability of the system using programming methods.
Site Reliability Engineers analyze different layers of the system: from the infrastructure of underlying machines and databases where the environment is running through deployment orchestration (Kubernetes), to applications’ memory and CPU consumption, to the highest layers of application functions such as HTTP requests, their latency, and errors.
They also utilize other sources of data such as logs, metrics, and error reporting tools. Pretty much everything that gives them meaningful insight into the system’s health and performance. These are commonly known as the Four Golden Signals of SRE.
There are more sophisticated statistical tools that can exercise data gathered into a mathematical equation to check specific parts of the system against potential outages, namely Service Level Indicators (SLI) and Service Level Objective (SLO). Incorporating these, The engineers can continuously monitor the status of particular endpoints or a whole system with easy-to-read green/yellow/red statuses.
One of the key assumptions of SRE is “zero-toil”. In practice, it means that we, the engineers, want to eliminate the manual aspect of the which can be repetitive, boring, and prone to human error. Everything that can be automated - should be done this way. In a complex system with a huge number of components that can automatically scale up, it wouldn’t be possible to properly implement the necessary mechanisms, if done manually.
How SRE is implemented in Kitopi?
Even though our SRE teams consists of only a few engineers, we make sure that the principles of SRE are strongly imposed all over the components of the system. At the center of our activities lies providing ownership and power to relevant teams. In our case, each development team is responsible for a part of the Kitopi’s architecture and as such, it is critical that every one of them is properly monitored and any potential degradation is quickly picked up and alerted.
We use Dynatrace as a tool for most of our activities. We divided different components of the systems into so-called Management Zones, each zone belonging to a relative team. This way, components owned by a team are separately monitored and alerted to the proper Slack channel. We rely heavily on automatically detected anomalies (such as increased failure rate, slower response times, etc.) but for tracking applications' health we also use SLOs for the most important endpoints of the system. We had a lot of problems with false alarms being triggered due to oversensitive anomaly detection settings, so we cooperated closely with development teams in order to make sure only meaningful problems were picked up and alerted. The time required for a reaction against degradation went down significantly due to building the culture of reliability ownership in teams.
Recently, we’ve been using a new feature of Dynatrace, the Grail Engine, which allows us to use events and logs as sources of analytical information. In other words, we can get meaningful observability input by running computations on logs and events collected upon certain user actions in the system. It gives us a huge advantage in terms of observing trends in quickly fluctuating data.
Summary
Site Reliability Engineering is still a young, dynamically growing discipline of the IT industry. It focuses mainly on improving the reliability of systems by increasing observability, solid alerting, and tools such as SLI and SLO. The important part of the SRE engineers job is automating tasks and learning from past outages. They cooperate closely with developers and DevOps engineers.
At Kitopi we’re proud to have grown a mature culture of SRE in teams that take ownership of the reliability of their applications. And we’re not stopping there!
Comments