Interesting question on human mistakes was posted on the DBA Managers Forum discussions today.
As human beings, we are sometimes make mistakes. How do you make sure that your employees won’t make mistakes and cause downtime/data loss/etc on your critical production systems?
I don’t think we can avoid this technically, probably working procedures is the solution.
I’d like to hear your thoughts.
I typed my thoughts and as I was finishing, I thought that it makes sense to post it on the blog too so here we go…
The keys to prevent mistakes are low stress levels, clear communications and established processes. Not a complete list but I think these are the top things to reduce the number of mistakes we make managing data infrastructure or for that matter working in any critical environment be it IT administration, aviation engineering or medical surgery field. It’s also a matter of personality fit – depending on your balance between mistakes tolerance and agility required, you will favor hiring one individual or another.
Regardless of how much you try, there are still going to be human errors and you have to account for them in the infrastructure design and processes. The real disasters happen when many things align like several failure combined with few human mistakes. The challenge is to find the right balance between efforts invested in making no mistakes and efforts invested into making your environment errors-proof to the point when risk or human mistake is acceptable to the business.
Those are the general ideas.
Just a few examples of the practical solutions to prevent mistakes when it comes to Oracle DBA:
- test production actions on a test system before applying in production
- have a policy to review every production change by another senior member of a team
- watch over my shoulder policy working on production environments – i.e. second pair of eye all the time
- employee training, database recovery bootcamp
- discipline of performing routing work under non-privileged accounts
Some of the items to limit impact of the mistakes:
- multiples database controlfiles for Oracle database (in case DBA manually does something bad to one of them – I saw this happen)
- standby database with delayed recovery or flashback database (for Oracle)
- no SPOF architecture
- Oracle RAC, MySQL high availability setup (like sharding or replication), SQL*Server cluster — architecture examples that limit impact of human mistakes affecting a single hardware component
Both lists can go on very long. Old article authored by Paul Vallee is very relevant top this topic — The Seven Deadly Habits of a DBA…and how to cure them.
Feel free to post your thoughts and example. How do you approach human mistakes in managing production data infrastructure?
14 Comments. Leave new
Business Automation Software has helped in the respect, since it allows you to “automate” many of the mundane, often repeated processes, that sysadmins often perform on autopilot, as this is when mistakes are typically made as the brain is not focused on the task at hand.
Whilst these software solutions (from HP, BMC and others) don’t eliminate human error completely, it allows you templatize many things, and when things do go awry, you have check-points to “rollback” to a sane state. These systems can also be used to enforce adherence to security policy and business process.
They’re often complex to implement and relatively costly, but when implemented properly can have a very strong return on investment, and are effective not just in the DB domain, but across entire chunks of the datacenter.
Thanks for reply Tony. I agree.
There are also much cheaper and simpler solutions addressing smaller scope — selective automation of certain tasks. For example, space management is one of the most routine tasks but I observed numerous incidents caused by Storage Administrator’s fat finger during mundane operation of adding more space.
In addition to testing production actions on a test system, always include rollback procedures. If something manages to go wrong in spite of your testing, being able to return quickly to a steady state minimizes the damage and allows for thorough, non-panic mode research into the problem.
Good point T.J.
Another relevant policy is to always have clear escalation process and ensure it’s clear when to trigger it and ask for help.
Yeah – and you need people who actually care…
Pride is a major driver. People who have pride themselves as professionals take steps themselves to prevent failures. No matter how much protection you build, a DBA without that pride factor will find a hole to slip through. What you mentioned so far are true; but not enough when you have people who rely on finding the loopholes.
One of the DBAs I used to manage didn’t attend to a critical situation. When asked he retorted – “but I was not given training on how to operate the Blackberry”. True story!
Human error is what logical backups are for.
There is no conflict between letting developers do whatever they want and guarding the database from it. It is entirely possible for gentle communication to reconcile the two.
What about SPQR architecture? (Obscure joke, long ago Joe Celko pointed out the Romans did major construction that has lasted millenia without an engineering discipline).
I should also add that there should be “a healthy fear of the return key” – the concept coined internally at Pythian by our CEO Andrew Waitman. Which is another way of saying think twice, act once.
https://www.oxforddictionaries.com/definition/ohnosecond
After 10 years as DBA I agree with Tanel and Arup approach. I can add two things to that – from manager perspective : people will act more smart and careful if they recon that it will be prize for good work and some kind of punishment if they do something wrong. Of course both have to be reasonable and accurate for situation.
– from technical point of view – scripts could help prevent fat fingers errors
Re “a healthy fear of the return key” => this comes only with sound AAA (authentication, authorization, audit) on both OS and DB levels. Just the fact that you can go back in time and “replay” what has been done on the systems reduces significantly incidents in the “caused by change” category.. of course not entirely => it happened to me in each job at least once (“shutdown immediate.. damn that was the other session/window”).
Re “Oracle RAC.. architecture examples that limit impact of human mistakes affecting a single hardware component” => I tend to disagree on this one. The amount of errors (including human errors) introduced by the Oracle RAC components far outweighs increased availability due to protection against HW failure. Even on theoretical level, adding one additional component with non-zero chance of failure reduces the availability calculation. The real-life examples more than confirm it.. BTW – this topic is mentioned as well in “Pro Oracle Database 10g RAC on Linux: Installation, Administration, and Performance”; ISBN 978-1590595244 on see page 12: “..In fact, it is highly likely that the level of availability achieved for clustered database will be less than that achieved for a single-instance database, because there are more components in the product stack for a clustered database..”
I would add one additional generic “preventive measure” => standardize-test-document-communicate-repeat.
I actually wouldn’t disagree that RAC introduction is more complex then I pictured it but that’s the whole different discussion. I more referenced it as an example because it’s commonly known technical solution to reduce impact for failure a certain components. Again I used it mostly for the common perception that RAC does work this way (and I agree there is more to it).
[…] err is human and to manage err is management. Alex Gorbachev fascinates us by adding his thoughts to a forum […]
[…] to read these 2 good articles on the net. Handling Human Errors – Alex Gorbachev The Seven Deadly Habits of a DBA…and how to cure them – Paul […]