In part one of this mini-series, I wrote about how the old ways of systems administration is broken and shared a story from my personal experiences. Today I’ll be talking about how it’s evolved with Site Reliability Engineering (also known as Service Reliability Engineering or SRE).
SRE is born
Interestingly, this is a concept I was pushing back around 2008-2010, but unfortunately my ideas fell on deaf ears. How do Netflix, Facebook, Dropbox, and Google provide such reliable service? First I can tell you what they don’t do – they don’t throw more hardware or ram at the problem.
So how do you do it?
First everyone has to agree that service reliability is everyone’s problem. It is the reason our jobs exist! Management must agree, because the way a company thinks and operates needs to change from the top down, and the focus needs to be on service reliability.
Operations and Development teams need to be aligned so that outages are everyone’s problem. Remember the story in my previous blog post? What do you think would have happened if I had forced the developers to reboot the servers every 12 hours, then every 24 hours? Would it have taken weeks or months to resolve the problem? Of course not.
You need your Ops team to know how to code, and understand development. I’m not saying they have to actually be on Dev team, but they need to understand coding, speak the language, and script their way out of issues. Most importantly, they must be on the hook if things go wrong.
At Google and Facebook, their developers are on call. If a release suddenly causes a significant amount of paging/failures, they don’t hire more Ops employees. The developers fill the gap until the issue is fixed, which clearly happens very quickly.
No one wants to be getting paged all night, especially developers. If they’re getting paged due to a bug in their software, you can bet the issue will be resolved in days, not weeks. Making developers responsible for the service reliability means they are required to think about failure in their software development – they’ll have to design in graceful failure modes, and expect their underlying infrastructure to be unreliable. If their little widget takes down the entire site, you can be sure they’re going to be up until it’s corrected.
The bottom line is that software and hardware come together to provide a service, and one isn’t more important than the other. Major wins in reliability can be realized if you align software design, automation, and hardware design.
At Pythian, we offer this as a managed service. Our team has the experience, background, and skills to help get you to this promise land of high reliable, scalable systems. We are building a team of Site Reliability Engineers to help companies benefit and transition to this new paradigm in systems administration, where service reliability is everyone’s problem.