The world of systems administration is changing, and it’s affecting everyone involved. Today’s blog post is the first of two in mini-series dedicated to Service Reliability: How Systems Administration is Evolving.
The days of formulas that tell us, You need “X” system admins, for every “Y” physical server, and every “Z” VMs are coming to a close. Even the world of IT management is changing.
Why? Because as scale increases, it’s simply impossible to continue at the pace we were at. Google saw this in the mid 2000’s and began the next evolution of systems administration. They recognized that there was no way they could scale up the way things had been managed for decades. In fact, Netflix came to a similar conclusion.
Interestingly, what Google did has nothing to do with technology, rather to do with the philosophy of systems administration. They started a new group which they originally called Production Engineering, and was later renamed Site Reliability Engineering, also known as Service Reliability Engineering or SRE. At its core, SRE changes the fundamental thinking of IT management. It recognizes site reliability as everyone’s responsibility. Some might say that’s obvious, but in the past it wasn’t.
The old way is broken
Most companies have two very separate and distinct groups. Operations and Development. Historically these two groups are highly siloed, and in some cases, do not get along very well. Why? It comes down to philosophy, really.
Operations folks are driven to ensure systems are up, secure, and reliable. Developers, on the other hand, are driven to create cool new features and applications. Here lies one of the biggest problems.
Years back I worked as an Operations Director, and had a counterpart on the development side who was the Software Engineering Director. We had just completed releasing a major update for one of our platforms, and very quickly we saw we had major issues. Our primary application servers (25+ physical boxes) were becoming unstable after about 12 hours of production load (I won’t go into why this happened, that’s a story for another day.) We quickly identified this, so the Ops team began rebooting these boxes in a rolling fashion. They were boxes that had some specialized hardware in them, and starting/stopping, then testing them took about 15-30 minutes each. We had a team of about 5 people, which was not a 24/7 group. Clearly this caused significant pain for our Operations staff. We determined that part of the problem was a memory leak. Due to the nature of the release, rolling back simply was not an option.
The initial response I received was that we would just have to deal with it for now, as there were a few other pressing issues they wanted to resolve first. After many sleepless nights and lost weekends, we finally were able to get a update so the systems only needed to be rebooted daily, 7 days a week. It stayed this way for months.
But why? It was because the software team, and the management we both reported to, was far more interested in hitting deadlines for features, and new functionality – not how much sleep, or how many days off our Ops employees were getting. I was told on more than one occasion that high availability and recovery were Ops problems, not Development problems.
The core of this problem is simple. Development felt that service reliability was 100% an Operations problem. Our new release takes 2x more ram? Add more ram to 100 servers! Our new application requires 20 new servers? Sure, with some work it could be cut down to 2-3, but just get the 20 servers. That’s easy!
Without naming names, has anyone else faced this issue? Comment below. Stay tuned for part two, where I’ll be discussing the birth of SRE, how it’s allowed systems administration to evolve, and how to achieve it.