Kevin recently mentioned one very nice blog. I was going through some posts there and this entry reminded me one story. I’m sure many of you can recall similar cases.
I worked on one site for a while and during 2.5 years it didn’t face a single media corruption of Oracle datafiles. Not that it’s a low profile site – quite the opposite and storage infrastructure was setup very well there – no SPOF, mirrored inside SAN boxes and between boxes, redundant switches, HBAs, controllers, you name it – dream of a DBA. Even change management procedures were followed thoroughly.
But one day, my fellow DBA (who is usually extremely cautious and reviews his actions at least twice) overwrote a controlfile with some crap. Even the fact that controlfiles were on raw devices didn’t prevent this disaster from happening. Trivial error as we found out later – a DBA mistakenly swapped arguments of a tar command (like “tar cvf * file.tar” instead of “tar cvf file.tar *”) and tar happily used controlfile as a tape device. :) End result – 10 minutes outage while I was figuring out what happened, dd’ing controlfile image from another mirror and starting the instance. By the way, it was a RAC database and, of course, RAC didn’t help – surprisingly for some managers.
So they were kind of protected with multiplexed controlfiles even though recovery wasn’t transparent (wouldn’t it be nice if Oracle could survive loss of minority of multiplexed controlfiles – just like CRS with voting disks?). Interesting, that online redo logs were not multiplexed and recovery could have been a bit trickier should the current redo log be overwritten. The reason for that was that they had already quadruple mirroring and people were blindly ignoring human factor and Mr. Murphy – “it must be enough if we already mirrored it 4 times”.
What we see? Well implemented protection against one class of problems while ignoring obvious threats from another side. Perhaps, because of all kind of vendors making fuss about their technology and its importance, while nobody focusing attention on the areas that require low investments but as much important or even more.
In my experience human factor risk is one areas that is heavily underestimated most of the times.
So what are your stories?
No comments