While I was at that fine conference in Scotland, one of our clients did some maintenance on their Windows server where several databases were running. We had just begun supporting this machine, and hadn’t had a chance to test a reboot. And for some reason, backup/recovery wasn’t — until recently — a DBA responsibility at that organization, so that wasn’t under our supervision.
So one fine evening, there was a scheduled maintenance, and one of the databases didn’t shutdown cleanly (thanks to mis-configured Windows services, if I recall correctly). Consequently, the database crashed and later didn’t come back up. That’s a bit odd — crash-recovery should have worked with no problems, but instead it required media recovery.
My team-mate Neil tried recovery, and found that the database was requesting two-weeks-old archivelogs. Weird. We tried to restore from tape, but you know how it goes if it’s someone else who does the backups and knows how tape manager is configured, and all those details. After a while is was clear that rushing didn’t make any sense in the middle of the night, and the storage people were not available until the morning.
When I looked at it in the morning, the error message rang a bell: “Datafile 1 needs media recovery” in combination with the request for very old archivelogs. My immediate guess — the database is started with an old copy of the controlfile (I had seen that happen before, after someone messed around with relocations and screwed up init.ora).
On closer examination, we figured out that the controlfile SCN was actually current while the SCNs of the datafiles were way off. There were no copies of the datafiles on the server so it seemed like someone had restored the datafiles. Weird…
After more detailed investigation, Alex Fatkulin figured out that the database had been put into backup mode two weeks ago and, bingo, the datafile headers were frozen. (By the way, Alex has just joined Pythian and started in my team. A great addition I should say!)
An attempt to restore all archivelogs failed: a few gaps couldn’t be restored from tape. What a surprise! Anyway, to this day, we can’t fully explain why that happened, or what was going on with backups. But, at least the responsibility for backup/recovery is moving to the DBA team. Who would have thought of that? ;-)
The moral of the story: do not leave datafiles in backup mode. If you use hot backups outside of RMAN, such as snapshot technologies, take care to implement monitoring so that the database doesn’t stay in backup mode for much time. We usually set up this check in our monitoring tool when backup mode is used.
Another moral: let everyone do his job. Database backup/recovery is part of the DBA’s responsibilities.
Another interesting story is how someone lost 5 databases, but that might be a good topic for another post.