RMAN Redundancy is not a Viable Retention Policy

Posted in: Technical Track

The story you are about to read is based on actual events. Names and paths have been changed to protect the innocent. I call this scenario “The Perfect Storm” because it took just the right combination of events and configurations. Sadly, this doesn’t make it an unlikely occurrence, so I’m posting it here in hopes that you’ll be able to save yourselves before it’s too late.

I have always had a preternatural dislike for using REDUNDANCY as a retention policy for Oracle RMAN, greatly preferring RECOVERY WINDOW instead, simply because REDUNDANCY doesn’t really guarantee anything valuable to me, whereas RECOVERY WINDOW guarantees that I’ll be able to do a point-in-time recovery to anytime within the past x days. Plus, I had already been burned once by a different client using REDUNDANCY. With the story I’m about to tell, this dislike has turned into violent hatred. I’m going to be light on the technical details, but I hope you’ll still feel the full pain.

First some table setting:

  • Standalone 10.2.0.2 instance (no RAC, no DataGuard/Standby)
  • RMAN retention policy set to REDUNDANCY 2
  • Backups stored in the Flash Recovery Area (FRA)

A few months ago, we had a datafile corruption on this relatively new instance (data had been migrated from an old server about a week prior). The on-call DBA followed up the page by checking for corruptions in the datafile with this command:

RMAN> backup check logical datafile '/path/to/foobar_data.dbf';

This, my friends, led to the major fall, though we did not know it for many hours. You see, the FRA was already almost full. This causes the FRA to automatically delete obsolete files to free up space. That last backup command, while only intended to check for logical corruption, did actually perform a backup of the file, and rendered the earliest backup of the file obsolete since there were two newer copies. That earliest file happened to be from the level 0 backup from which we would later want to restore.

Of course, at first we didn’t know why the file was missing. Logs showed that it was on disk no less than two hours before the problem started. Later, scanning the alert log for the missing backup filename yielded this:

Deleted Oracle managed file 
/path/to/flash_recovery_area/FOO_DB/backupset/2008_12_01/o1_xxxx.bkp

Oracle deleted the one backup file that we needed!

Even worse, it wasn’t until this time on a Monday night that we realized that the level 0 taken the previous weekend had failed to push the backup files to tape because of a failure on the NetBackup server. The problem was reported as part of Monday morning’s routine log checks, but the missing files had not yet been pushed to tape.

In the end, we were able to drop and restore the tablespace to a previous point in time on a test instance from another backup file and exp/imp data back over. It was ugly, but it got things back online. Many DBAs better than myself gave their all on this mission.

To summarize, the ingredients:

  1. Oracle RMAN
  2. CONFIGURE RETENTION POLICY TO REDUNDANCY 2;
  3. Flash Recovery Area near full, obediently deleting obsolete files.
  4. Tape backup failure

Add in an innocent backup command and . . . BOOM! Failure Surprise.

The two biggest points to take away are:

  1. Tape backup failures are still serious backup failures and should be treated as such, even if you backup to disk first.
  2. REDUNDANCY is not a viable retention policy. In my house, it is configuration non grata.
email
Want to talk with an expert? Schedule a call with our team to get the conversation started.

About the Author

Oracle database administrator for The Pythian Group, headquartered in Ottawa, Ontario, Canada. I am located in Manitowoc, Wisconsin, USA. OCP 10gR2 DBA

8 Comments. Leave new

Bradd Piontek
March 4, 2009 5:57 pm

Memories , like the corner of my mind.
Misty water-colored memories.
Of the way we were
Scattered pictures,
Of the smiles we left behind
Smiles we gave to one another
For the way we were
Can it be that it was all so simple then?
Or has time re-written every line?
If we had the chance to do it all again
Tell me, would we? could we?

No. we definitely could not, should not do it all again.

Reply

I’ve never liked redundancy based retention, recovery window always seemed more precise, although there are a couple of oddities here My understanding is that the datafile backup should have failed when it found the corruption unless configured to ignore it, also having the backupsets contain one file each seems to be required.

I’ve seen the tape backup thing as well working at a site with a 4 day SLA to fix tape failures and a three day archivelog disk retention policy, has some interesting conversations with the storage guys on that one.

Reply
Log Buffer #138: A Carnival of the Vanities for DBAs
March 6, 2009 12:05 pm

[…] Having covered IBM, let’s switch to Oracle and Eric Emrick, where he talks about database continuity for a while, and then let’s go over to Pythian’s Don Seiler and his talk on how RMAN isn’t good for retention policies, in his blog RMAN Redundancy is not a Viable Policy. […]

Reply
RMAN Redundancy is not a Viable Retention Policy « die Seilerwerks
August 19, 2009 2:17 pm

[…] leave a comment » Originally posted on The Pythian Group blog. […]

Reply

what about – backup validate check logical …?

Reply

Thank you so much for this precious information.

Reply

I am starting to use recovery window instead of redundancy for my RMAN backups but I have a concern:

Every once in a while my backups will fail and it takes a couple of days for a technician to notice this. If my recovery window is 2 days and I don’t take a backup for 3 days then it will automatically delete my only good backup!!! Is this incorrect? Will it have the intelligence not to delete the last backup?

Reply

Late to the party but in answer to Michael B above, RMAN should *not* delete any backup files (backups of Archive Redo Logs, data files, control files etc) that would be required to meet your “recovery window”.

For example, if you have a recovery window of 3 days and your last good backup of the datafiles was a week ago, this backup will not be deleted (ever) until you’ve got another one that is older than 3 days but newer than the previous one.

Make sense?

Cheers,

Ian

Reply

Leave a Reply

Your email address will not be published. Required fields are marked *