The story you are about to read is based on actual events. Names and paths have been changed to protect the innocent. I call this scenario “The Perfect Storm” because it took just the right combination of events and configurations. Sadly, this doesn’t make it an unlikely occurrence, so I’m posting it here in hopes that you’ll be able to save yourselves before it’s too late.
I have always had a preternatural dislike for using REDUNDANCY as a retention policy for Oracle RMAN, greatly preferring RECOVERY WINDOW instead, simply because REDUNDANCY doesn’t really guarantee anything valuable to me, whereas RECOVERY WINDOW guarantees that I’ll be able to do a point-in-time recovery to anytime within the past x days. Plus, I had already been burned once by a different client using REDUNDANCY. With the story I’m about to tell, this dislike has turned into violent hatred. I’m going to be light on the technical details, but I hope you’ll still feel the full pain.
First some table setting:
- Standalone 10.2.0.2 instance (no RAC, no DataGuard/Standby)
- RMAN retention policy set to REDUNDANCY 2
- Backups stored in the Flash Recovery Area (FRA)
A few months ago, we had a datafile corruption on this relatively new instance (data had been migrated from an old server about a week prior). The on-call DBA followed up the page by checking for corruptions in the datafile with this command:
RMAN> backup check logical datafile '/path/to/foobar_data.dbf';
This, my friends, led to the major fall, though we did not know it for many hours. You see, the FRA was already almost full. This causes the FRA to automatically delete obsolete files to free up space. That last backup command, while only intended to check for logical corruption, did actually perform a backup of the file, and rendered the earliest backup of the file obsolete since there were two newer copies. That earliest file happened to be from the level 0 backup from which we would later want to restore.
Of course, at first we didn’t know why the file was missing. Logs showed that it was on disk no less than two hours before the problem started. Later, scanning the alert log for the missing backup filename yielded this:
Deleted Oracle managed file /path/to/flash_recovery_area/FOO_DB/backupset/2008_12_01/o1_xxxx.bkp
Oracle deleted the one backup file that we needed!
Even worse, it wasn’t until this time on a Monday night that we realized that the level 0 taken the previous weekend had failed to push the backup files to tape because of a failure on the NetBackup server. The problem was reported as part of Monday morning’s routine log checks, but the missing files had not yet been pushed to tape.
In the end, we were able to drop and restore the tablespace to a previous point in time on a test instance from another backup file and exp/imp data back over. It was ugly, but it got things back online. Many DBAs better than myself gave their all on this mission.
To summarize, the ingredients:
- Oracle RMAN
- CONFIGURE RETENTION POLICY TO REDUNDANCY 2;
- Flash Recovery Area near full, obediently deleting obsolete files.
- Tape backup failure
Add in an innocent backup command and . . . BOOM! Failure Surprise.
The two biggest points to take away are:
- Tape backup failures are still serious backup failures and should be treated as such, even if you backup to disk first.
- REDUNDANCY is not a viable retention policy. In my house, it is configuration non grata.
Memories , like the corner of my mind.
Misty water-colored memories.
Of the way we were
Of the smiles we left behind
Smiles we gave to one another
For the way we were
Can it be that it was all so simple then?
Or has time re-written every line?
If we had the chance to do it all again
Tell me, would we? could we?
No. we definitely could not, should not do it all again.
I’ve never liked redundancy based retention, recovery window always seemed more precise, although there are a couple of oddities here My understanding is that the datafile backup should have failed when it found the corruption unless configured to ignore it, also having the backupsets contain one file each seems to be required.
I’ve seen the tape backup thing as well working at a site with a 4 day SLA to fix tape failures and a three day archivelog disk retention policy, has some interesting conversations with the storage guys on that one.
[…] Having covered IBM, let’s switch to Oracle and Eric Emrick, where he talks about database continuity for a while, and then let’s go over to Pythian’s Don Seiler and his talk on how RMAN isn’t good for retention policies, in his blog RMAN Redundancy is not a Viable Policy. […]
[…] leave a comment » Originally posted on The Pythian Group blog. […]
what about – backup validate check logical …?
Thank you so much for this precious information.
I am starting to use recovery window instead of redundancy for my RMAN backups but I have a concern:
Every once in a while my backups will fail and it takes a couple of days for a technician to notice this. If my recovery window is 2 days and I don’t take a backup for 3 days then it will automatically delete my only good backup!!! Is this incorrect? Will it have the intelligence not to delete the last backup?
Late to the party but in answer to Michael B above, RMAN should *not* delete any backup files (backups of Archive Redo Logs, data files, control files etc) that would be required to meet your “recovery window”.
For example, if you have a recovery window of 3 days and your last good backup of the datafiles was a week ago, this backup will not be deleted (ever) until you’ve got another one that is older than 3 days but newer than the previous one.