While I was at that fine conference in Scotland, one of our clients did some maintenance on their Windows server where several databases were running. We had just begun supporting this machine, and hadn’t had a chance to test a reboot. And for some reason, backup/recovery wasn’t — until recently — a DBA responsibility at that organization, so that wasn’t under our supervision.
So one fine evening, there was a scheduled maintenance, and one of the databases didn’t shutdown cleanly (thanks to mis-configured Windows services, if I recall correctly). Consequently, the database crashed and later didn’t come back up. That’s a bit odd — crash-recovery should have worked with no problems, but instead it required media recovery.
My team-mate Neil tried recovery, and found that the database was requesting two-weeks-old archivelogs. Weird. We tried to restore from tape, but you know how it goes if it’s someone else who does the backups and knows how tape manager is configured, and all those details. After a while is was clear that rushing didn’t make any sense in the middle of the night, and the storage people were not available until the morning.
When I looked at it in the morning, the error message rang a bell: “Datafile 1 needs media recovery” in combination with the request for very old archivelogs. My immediate guess — the database is started with an old copy of the controlfile (I had seen that happen before, after someone messed around with relocations and screwed up init.ora).
On closer examination, we figured out that the controlfile SCN was actually current while the SCNs of the datafiles were way off. There were no copies of the datafiles on the server so it seemed like someone had restored the datafiles. Weird…
After more detailed investigation, Alex Fatkulin figured out that the database had been put into backup mode two weeks ago and, bingo, the datafile headers were frozen. (By the way, Alex has just joined Pythian and started in my team. A great addition I should say!)
An attempt to restore all archivelogs failed: a few gaps couldn’t be restored from tape. What a surprise! Anyway, to this day, we can’t fully explain why that happened, or what was going on with backups. But, at least the responsibility for backup/recovery is moving to the DBA team. Who would have thought of that? ;-)
The moral of the story: do not leave datafiles in backup mode. If you use hot backups outside of RMAN, such as snapshot technologies, take care to implement monitoring so that the database doesn’t stay in backup mode for much time. We usually set up this check in our monitoring tool when backup mode is used.
Another moral: let everyone do his job. Database backup/recovery is part of the DBA’s responsibilities.
Another interesting story is how someone lost 5 databases, but that might be a good topic for another post.
13 Comments. Leave new
All you probably had to do was take the datafiles out of backup mode :
alter database datafile x end backup ;
and then open the db …
Seen this many times and it’s one of the reasons you should always use RMAN
of failover clusters !
Imaging a fail-over during backup using begin/end backup commands … it would
fail ;-)
cheers,
Kurt
If the datafile was in backup mode it will show up in v$backup view.
Then you can just mount the database and issue
alter database datafile ‘xxxxx’ end backup;
for all the datafiles in backup mode
and the issue
alter database open;
If the datafile was in backup mode it will show up in v$backup view with status=’YES’
Then you can just mount the database and issue
alter database datafile ‘xxxxx’ end backup;
for all the datafiles in backup mode
and the issue
alter database open;
You can also just say
alter database end backup;
if all the files were in backup mode
at that point oracle will copy the hot backup scn over to checkpoint scn.
When a datafile is put in backup mode the full header is not frozen. Only the checkpoint scn is frozen. There is a hot backup scn in the header which keeps incrementing. When end backup is issued this hot backup scn is copied over to the checkpoint scn.
“My immediate guess — the database is started with an old copy of the controlfile ”
Wont’t that result in
Datafile is too new or datafile is in future kind of error instead of file needs recovery error ?
Did you guys really try to recover a database before checking to see if it was still in backup mode?
Making me a little nervous here people.
Amit,
Regarding end backup – I should have mentioned that end backup obviously was tried but ended up with ORA-600 during crash recovery.
Wont’t that result in Datafile is too new or datafile is in future kind of error instead of file needs recovery error ?
I recall that the message is confusingly the same “Datafile 1 needs media recovery” during open resetlogs.
Kurt,
Indeed, failover point you made is very good. Cheers.
Did you guys really try to recover a database before checking to see if it was still in backup mode?
We had all reasons to believe that RMAN had been used for backups. Actually, placing database in the backup mode wasn’t part of regular backup activity. Looks like someone was trying to perform something extraordinary at that time and left DB in backup mode. In any case, the decision was to postpone actions until the right people are available and to have some time for more investigations. And rightly so.
Making me a little nervous here people.
Heh… Easy to say when you already know the outcome and you can always say – “wasn’t that obvious to any knowledgeable DBA?”. Few times I have faced the situations when I was mad on myself after the issue was found – how did I manage to miss such an obvious detail.
The only reason to be nervous that I can see is that DBA’s were intentionally not involved into backup and recovery procedures and it seems that responsible group didn’t exercise test restore. Other than that, there was no crazy actions done in the middle of the night. Investigation was done and the root cause determined. Existing pieces of the database were preserved. The most appropriate restore strategy was followed.
I do have few concerns and suspicious why we had ORA-600 on crash recovery but I believe I should leave it aside.
Have had that a couple of time but not for such a long period.
in 9i on mount mode you do have
alter database end backup;
I have checks in my shutdown scripts that specifically check for this condition before shutting down adn starting up.
also have reports that come out daily in my daily checklists to validate the fact.
since we use emc bcv’s we have had a number of incidences where the scripts got hung due to bad establish causing us issues of all sort. better be proactive then get stuck finding tapes
Fuad,
Agree – monitoring is the way to go and we do the same with all databases that we know have a chance to be in backup mode. There are more disadvantages of having a database in backup mode than just problem with crash recovery.
>ended up with ORA-600 during crash recovery
Chocolate ORA-600’s ?
[…] (try it I am sure that you wont leave it) script for performance tuning. Alex Gorbachev wrote about what will happen when you leave database in backup mode with a real story. Another stylish DBA of Pythian Group Christo Kutrovsky wrote about his […]
>ended up with ORA-600 during crash recovery
Chocolate ORA-600’s ?
It seems to be contagious!