The world of systems administration is changing, and it’s affecting everyone involved. Today’s blog post is the first of two in mini-series dedicated to Service Reliability: How Systems Administration is Evolving.
The days of formulas that tell us, You need “X” system admins, for every “Y” physical server, and every “Z” VMs are coming to a close. Even the world of IT management is changing.
Why? Because as scale increases, it’s simply impossible to continue at the pace we were at. Google saw this in the mid 2000’s and began the next evolution of systems administration. They recognized that there was no way they could scale up the way things had been managed for decades. In fact, Netflix came to a similar conclusion.
Interestingly, what Google did has nothing to do with technology, rather to do with the philosophy of systems administration. They started a new group which they originally called Production Engineering, and was later renamed Site Reliability Engineering, also known as Service Reliability Engineering or SRE. At its core, SRE changes the fundamental thinking of IT management. It recognizes site reliability as everyone’s responsibility. Some might say that’s obvious, but in the past it wasn’t.
The old way is broken
Most companies have two very separate and distinct groups. Operations and Development. Historically these two groups are highly siloed, and in some cases, do not get along very well. Why? It comes down to philosophy, really.
Operations folks are driven to ensure systems are up, secure, and reliable. Developers, on the other hand, are driven to create cool new features and applications. Here lies one of the biggest problems.
Years back I worked as an Operations Director, and had a counterpart on the development side who was the Software Engineering Director. We had just completed releasing a major update for one of our platforms, and very quickly we saw we had major issues. Our primary application servers (25+ physical boxes) were becoming unstable after about 12 hours of production load (I won’t go into why this happened, that’s a story for another day.) We quickly identified this, so the Ops team began rebooting these boxes in a rolling fashion. They were boxes that had some specialized hardware in them, and starting/stopping, then testing them took about 15-30 minutes each. We had a team of about 5 people, which was not a 24/7 group. Clearly this caused significant pain for our Operations staff. We determined that part of the problem was a memory leak. Due to the nature of the release, rolling back simply was not an option.
The initial response I received was that we would just have to deal with it for now, as there were a few other pressing issues they wanted to resolve first. After many sleepless nights and lost weekends, we finally were able to get a update so the systems only needed to be rebooted daily, 7 days a week. It stayed this way for months.
But why? It was because the software team, and the management we both reported to, was far more interested in hitting deadlines for features, and new functionality – not how much sleep, or how many days off our Ops employees were getting. I was told on more than one occasion that high availability and recovery were Ops problems, not Development problems.
The core of this problem is simple. Development felt that service reliability was 100% an Operations problem. Our new release takes 2x more ram? Add more ram to 100 servers! Our new application requires 20 new servers? Sure, with some work it could be cut down to 2-3, but just get the 20 servers. That’s easy!
Without naming names, has anyone else faced this issue? Comment below. Stay tuned for part two, where I’ll be discussing the birth of SRE, how it’s allowed systems administration to evolve, and how to achieve it.
I have seen similar issues.
Also I have worked both sides of this issue, Dev and Ops, many times simultaneously. Many times I have thought that having some experience in Ops is a good thing for developers to have, and vice versa.
That kind of experiences leads to having insights into issues that might otherwise only be learned of after the fact. (firefighting)
Barring that, there needs to be more collaboration between the two groups.
This does work both ways.
One of the limitations that Dev teams run into for testing is a lack of resources.
While Dev boxes may have limited data sets, not as much storage as prod, etc, Testing and QA systems should be pretty much mirror copies of production, with requisite amounts of data.
If the Dev effort is for a new systems, the future Prod environment could be used.
Testing is often inadequate not due to lack of diligence of Dev, but due to lack of sufficient resources provided by Ops.
And Ops may not have the budget to provide this.
Solution? Ops and Dev collaborate and present this to management, with projected costs.
The costs should include the cost of HW, setup, license, etc, but more importantly, the cost of *not* doing so.
War stories: Major SAP upgrade. SAP team decides to eliminate a HW app server and replace it with a Virtual Server. This has not been tested.
What could go wrong? Ops objected strenuously, but to no avail.
It took two weeks to fix this after production went live
Same upgrade: I found an issue with some Materialized views in a reporting database. These were refreshed daily from the SAP system. So the SAP team lead said to drop them. “No one uses those”
I happened to know differently, but again my objections were to now avail.
A couple of days later the R&D engineers were asking me why there data was now 3 days old.
Hubris strikes again.
I have worked in smaller organisations where the Production / Operations team also manages the Development environment. The Production environment is much better prepared for whatever is delivered by Development. Unfortunately, the silo mentality is what I see in my current organisation. It leads to frustration (intense frustration) on both sides. Why can’t Operations have an “Interface” team that is prepped a few weeks in advance — so that Operations actually knows what is coming down the line and gets access to and monitors the Test environment to see the applications / code / methods being used before it is ported to Production ? Why is it expected that Operations doesn’t understand what Development is doing ?