I’ll be blogging about reliability more in the coming days and weeks, but I’m going to quickly get warmed up with this post. I’ll point out two worthy reads in the domain of reliability.
First, Lifehacker.com has a great post entitled Fix the Machine, not the Person, and from my vantage point, they are totally right. Here’s a sample quote of the kind of valuable and clear thinking on display:
An organization is not just a pile of people, it’s also a set of structures. It’s almost like a machine made of men and women. Think of an assembly line. If you just took a bunch of people and threw them in a warehouse with a bunch of car parts and a manual, it’d probably be a disaster. Instead, a careful structure has been built: car parts roll down on a conveyor belt, each worker does one step of the process, everything is carefully designed and routinized. Order out of chaos.
And when the system isn’t working, it doesn’t make sense to just yell at the people in it—any more than you’d try to fix a machine by yelling at the gears. True, sometimes you have the wrong gears and need to replace them, but more often you’re just using them in the wrong way. When there’s a problem, you shouldn’t get angry with the gears—you should fix the machine.
This was published just a few weeks ago on a very high-profile online media property, so you’ve probably seen it. However, it triggered the memory of a much more important, beautiful, and hugely inspirational piece of writing that influenced me a lot in the founding days of Pythian, one that you probably haven’t seen because it’s admittedly a bit older. It is anything but obsolete, so I’m happy to draw your attention to it. It’s called They Write the Right Stuff and was published in Fast Company in 1996. Here’s a taste:
The group writes software this good because that’s how good it has to be. Every time it fires up the shuttle, their software is controlling a $4 billion piece of equipment, the lives of a half-dozen astronauts, and the dreams of the nation. Even the smallest error in space can have enormous consequences: the orbiting space shuttle travels at 17,500 miles per hour; a bug that causes a timing problem of just two-thirds of a second puts the space shuttle three miles off course.
NASA knows how good the software has to be. Before every flight, Ted Keller, the senior technical manager of the on-board shuttle group, flies to Florida where he signs a document certifying that the software will not endanger the shuttle. If Keller can’t go, a formal line of succession dictates who can sign in his place.
Bill Pate, who’s worked on the space flight software over the last 22 years, says the group understands the stakes: “If the software isn’t perfect, some of the people we go to meetings with might die.”
Imagine how these circumstances might focus your mind on the kinds of disciplines and systems you will need to adopt in order to create genuinely reliable systems. Imagine how “yelling at someone” is definitely not a credible way to improve reliability in these circumstances.
Then, remember how the megatrends shaping our industry have relentlessly ratcheted up the stakes in database infrastructure management over the course of these past fifteen years since They Write the Right Stuff was written. (I’ll write more on this in the coming days). And realize that our success as production engineers today and in the coming decade will depend on our adoption of the necessary systems and disciplines to create breakthrough quality and reliability. That’s a big part of what we’re working on at Pythian.