Service Reliability: How Systems Administration is Evolving – Part Two

Posted in: Technical Track

In part one of this mini-series, I wrote about how the old ways of systems administration is broken and shared a story from my personal experiences. Today I’ll be talking about how it’s evolved with Site Reliability Engineering (also known as Service Reliability Engineering or SRE).

SRE is born

Interestingly, this is a concept I was pushing back around 2008-2010, but unfortunately my ideas fell on deaf ears.  How do Netflix, Facebook, Dropbox, and Google provide such reliable service? First I can tell you what they don’t do – they don’t throw more hardware or ram at the problem.

So how do you do it?

First everyone has to agree that service reliability is everyone’s problem.  It is the reason our jobs exist! Management must agree, because the way a company thinks and operates needs to change from the top down, and the focus needs to be on service reliability.

Operations and Development teams need to be aligned so that outages are everyone’s problem. Remember the story in my previous blog post? What do you think would have happened if I had forced the developers to reboot the servers every 12 hours, then every 24 hours? Would it have taken weeks or months to resolve the problem? Of course not.

You need your Ops team to know how to code, and understand development. I’m not saying they have to actually be on Dev team, but they need to understand coding, speak the language, and script their way out of issues. Most importantly, they must be on the hook if things go wrong.

At Google and Facebook, their developers are on call. If a release suddenly causes a significant amount of paging/failures, they don’t hire more Ops employees. The developers fill the gap until the issue is fixed, which clearly happens very quickly.

No one wants to be getting paged all night, especially developers. If they’re getting paged due to a bug in their software, you can bet the issue will be resolved in days, not weeks. Making developers responsible for the service reliability means they are required to think about failure in their software development – they’ll have to design in graceful failure modes, and expect their underlying infrastructure to be unreliable. If their little widget takes down the entire site, you can be sure they’re going to be up until it’s corrected.

The bottom line is that software and hardware come together to provide a service, and one isn’t more important than the other. Major wins in reliability can be realized if you align software design, automation, and hardware design.

At Pythian, we offer this as a managed service. Our team has the experience, background, and skills to help get you to this promise land of high reliable, scalable systems. We are building a team of Site Reliability Engineers to help companies benefit and transition to this new paradigm in systems administration, where service reliability is everyone’s problem.

email

Author

Want to talk with an expert? Schedule a call with our team to get the conversation started.

2 Comments. Leave new

Hemant K Chitale
June 26, 2014 4:18 am

“everyone has to agree that service reliability is everyone’s problem”. Good rule.
As for developers fixing production issues, I see the resolution different. Production / Operations should not accept code / configuration that they haven’t validated and proven themselves. See my comments in the previous blog post — have an “Interface” team in Operations that reviews, monitors and measures what is running in Test before it is permitted to go into Production. As you rightly say the Operations team must be able to understand Development and should NOT be “I don’t care to see what you do in Development”.

Reply

Agreeing that Service Reliability is every one’s problem – is only the beginning. People should come out of their hard-shells and accept it whole-heartedly and discuss and sort out the issue at hand. Every one should understand that the service availability/reliability depends both on software and hardware. And these two together are the basic building blocks to provide a rugged and reliable service.

Reply

Leave a Reply

Your email address will not be published. Required fields are marked *