Fun with Oracle Data Guard Broker

Posted in: Oracle

Issues during the switchover to standby database

The beginning

So, there we were attending to our jobs and tending to our customers when an emergency alert came in: a production server had gone down. All bells and whistles were sounded and an emergency bridge phone was opened. A hardware failure had brought the server down and the repair time was unknown, so we have to failover to the standby database to restore the service. This is a simple physical standby database with Data Guard Broker in place to take care of the basic things.

Simply said and simply done. For reasons out of the scope of this post, we disabled the Broker and manually opened the database and all went well, except that the Broker had at some point changed the log_archive_dest_1 initialisation parameter to be “location=”USE_DB_RECOVERY_FILE_DEST”, valid_for=(STANDBY_LOGFILE,STANDBY_ROLE)“. This was the first problem, but known from our previous tests, so we simply changed it to be “location=”USE_DB_RECOVERY_FILE_DEST”, valid_for=(All_LOGFILES,ALL_ROLES)” and everything worked as expected. Kids, test your DR scenarios while you still can :).

The rebuilding

For reasons out of the scope of this blog post, flashback database was not enabled on the primary side, so after the crash and a few hours of production data being added to the new primary database, we decided that the best way to continue from here was to rebuild the old primary as a standby and execute a switchover to get everything back to where it was before the outage.

The main caveat here is that the servers are located on opposite sides of the Atlantic ocean and this is quite a database with some 2TiBs of data that does not get compressed much during the backup, plus a few GiBs of archived logs generated daily. This means that using the standard procedure to rebuild a standby database will incur a severe delay due to the time it takes to take a new full backup of the new-primary, copy it over to the new-standby site and then subsequent log shipping to get both sides in sync.

Enter Oracle documentation. I find that Oracle documentation is some of the best I’ve had the opportunity to work with, although it’s getting a bit cluttered and incoherent sometimes due to the many different features, options, new versions and partial updates that are only visible in MOS. Well, it may not be the best anymore, but I still use it a lot as a trusted source of information.

So there is the procedure of rebuilding a primary database as a standby from its own backup and then sync it up with the primary. You can see the documentation in the following link.

And off we go. Get the restore scripts ready, kick off the restore and wait for it to complete.

In case someone is curious, here are the scripts we used. Simple sloppy scripts that do the job.

First the bash script that wraps the RMAN call and allows us to execute it in nohup mode. You know, the network connection always drops at the most interesting moment.


[email protected]

export NLS_DATE_FORMAT='YYYY-MM-DD:hh24:mi:ss'

if [ ! -d "log" ]; then
  mkdir log 

. oraenv <<< $ORACLE_SID

TS=`date +'%Y%m%d-%H.%M'`
WORK_DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" && pwd )"
echo "Begin RMAN ${RMAN_COMMAND} on $SID"
rman cmdfile=${WORK_DIR}/${RMAN_COMMAND}.rcv log=${RMAN_LOG}
echo "#"
echo "End RMAN ${RMAN_COMMAND}  on $SID"
mail -s "RMAN script ${RMAN_COMMAND} completed on ${SID}" ${EMAIL_RECIPIENT} < ${RMAN_LOG}

exit 0

And this is the very basic RMAN script.

connect target /
connect catalog rman/[email protected]
      SET UNTIL SCN 123456789;

After the restore completed it was time for the recover in which, yes you guessed right, there was a missing archived log and there is more fun to be had.

Of course, there was a backup of the missing archived log on the site we were trying to rebuild so we recovered it and applied it to the standby but then, surprise! See the line 22 in the snippet below:

SQL> recover standby database;
ORA-00279: change 14943816301 generated at 04/09/2018 00:02:23 needed for thread 1
ORA-00289: suggestion : +FRA/conpro_uk2/archivelog/2018_04_11/thread_1_seq_15211.2442.973138947
ORA-00280: change 14943816301 for thread 1 is in sequence #15211

Specify log: {=suggested | filename | AUTO | CANCEL}
ORA-00279: change 14943827295 generated at 04/09/2018 00:07:18 needed for thread 1
ORA-00289: suggestion : +FRA/conpro_uk2/archivelog/2018_04_11/thread_1_seq_15212.458.973138947
ORA-00280: change 14943827295 for thread 1 is in sequence #15212
ORA-00278: log file '+FRA/conpro_uk2/archivelog/2018_04_11/thread_1_seq_15211.2442.973138947' no longer needed for this recovery

ORA-00279: change 14943841236 generated at 04/09/2018 00:13:56 needed for thread 1
ORA-00289: suggestion : +FRA
ORA-00280: change 14943841236 for thread 1 is in sequence #15213
ORA-00278: log file '+FRA/conpro_uk2/archivelog/2018_04_11/thread_1_seq_15212.458.973138947' no longer needed for this recovery

ORA-00308: cannot open archived log '+FRA'
ORA-17503: ksfdopn:2 Failed to open file +FRA   <-- at this point logs from 15213 are gone from FRA
ORA-15045: ASM file name '+FRA' is not in reference form

ORA-01547: warning: RECOVER succeeded but OPEN RESETLOGS would get error below
ORA-01152: file 1 was not restored from a sufficiently old backup
ORA-01110: data file 1: '+DATA/conpro_uk2/datafile/system.298.973129969'

Enter Oracle bugs. Yes ladies and gentlemen, upon restoring the missing archived log from the backup, the database started some FRA clean up process that deleted the newest archived logs! For these, of course, there was no backup as they had been created between the last backup and the crash. For those interested in the details, here is the bug we believe we may have swallowed while riding our bikes: “Bug 17370174 : NEW ARCHIVELOG FILES ARE DELETED FIRST FROM THE FRA”.

So, here we are, the old-primary, now an almost new-standby, is missing some archived logs to be able to sync up with the current primary.
Fortunately, these archived logs had been shipped and applied to the old-standby, now the primary, before the crash, so we were able to restore and ship them to the other side of the Atlantic and everything was up and running again, including Data Guard Broker … after a few hours.

And, after applying some eighty archived logs, here we go again: ORA-00365

Media Recovery Log +FRA/PRIMARY/archivelog/2018_04_11/thread_1_seq_15291.2126.973169799
Errors with log +FRA/PRIMARY/archivelog/2018_04_11/thread_1_seq_15291.2126.973169799
MRP0: Background Media Recovery terminated with error 365
Errors in file /apps/oracle/diag/rdbms/PRIMARY/PRIMARY/trace/PRIMARY_pr00_13538.trc:
ORA-00365: the specified log is not the correct next log
Managed Standby Recovery not using Real Time Apply
Recovery interrupted!
MRP0: Background Media Recovery process shutdown (PRIMARY)

This issue was more complicated. Another bug entered the game “Bug 19181583 – Allow Recover Until Time for a Redo Log that has the activation bit set (Doc ID 19181583.8)” and a severity 1 SR was opened with Oracle. Also, we escalated internally to our IPC for him to take a look.

Working around this one required creating a fresh standby controlfile from the current primary, ship it over to the standby server and replace the current controlfiles with it.

Also, as ASM is involved, we had to do some RMAN magic to have the controlfile pointing to the right datafiles.

Not the cleanest process, it may have been better to start a bit down in the ASM directory tree to avoid trying to catalog files like the ASM password file but, again, it did the job.

RMAN> catalog start with '+DATA';

Starting implicit crosscheck backup at 11-APR-18
using target database control file instead of recovery catalog
allocated channel: ORA_DISK_1
channel ORA_DISK_1: SID=1037 device type=DISK
allocated channel: ORA_DISK_2
channel ORA_DISK_2: SID=3 device type=DISK
allocated channel: ORA_DISK_3
channel ORA_DISK_3: SID=118 device type=DISK
allocated channel: ORA_DISK_4
channel ORA_DISK_4: SID=234 device type=DISK
Crosschecked 1111 objects
Finished implicit crosscheck backup at 11-APR-18

Starting implicit crosscheck copy at 11-APR-18
using channel ORA_DISK_1
using channel ORA_DISK_2
using channel ORA_DISK_3
using channel ORA_DISK_4
Finished implicit crosscheck copy at 11-APR-18

searching for all files in the recovery area
cataloging files...
cataloging done

List of Cataloged Files
File Name: +FRA/PRIMARY/ARCHIVELOG/2018_04_11/thread_1_seq_15211.2442.973138947
File Name: +FRA/PRIMARY/ARCHIVELOG/2018_04_11/thread_1_seq_15212.458.973138947
File Name: +FRA/PRIMARY/ARCHIVELOG/2018_04_11/thread_1_seq_15291.2126.973169799
File Name: +FRA/PRIMARY/CONTROLFILE/current.256.922462525

searching for all files that match the pattern +DATA

List of Files Unknown to the Database
File Name: +DATA/orapwasm
File Name: +DATA/PRIMARY/spfileCONPRO.ora
File Name: +DATA/TEST/CONTROLFILE/Current.257.922321133

Do you really want to catalog the above files (enter YES or NO)? YES
cataloging files...
cataloging done

List of Cataloged Files
File Name: +DATA/PRIMARY/TEMPFILE/TEMP2.272.973170913
File Name: +DATA/PRIMARY/TEMPFILE/TEMP2.271.973170913

List of Files Which Where Not Cataloged
File Name: +DATA/orapwasm
  RMAN-07517: Reason: The file header is corrupted
File Name: +DATA/PRIMARY/spfileCONPRO.ora
  RMAN-07518: Reason: Foreign database file DBID: 0  Database Name:
File Name: +DATA/PRIMARY/ONLINELOG/group_20.381.948561469
  RMAN-07529: Reason: catalog is not supported for this file type
File Name: +DATA/PRIMARY/ONLINELOG/group_21.380.948561487

RMAN> list copy of database;

List of Datafile Copies

Key     File S Completion Time Ckp SCN    Ckp Time
------- ---- - --------------- ---------- ---------------
248     1    A 11-APR-18       14947795005 09-APR-18
        Name: +DATA/PRIMARY/datafile/system.298.973129969

231     2    A 11-APR-18       14947795005 09-APR-18
        Name: +DATA/PRIMARY/datafile/sysaux.315.973127749

251     3    A 11-APR-18       14947795005 09-APR-18
       Name: +DATA/PRIMARY/datafile/sfsdata.302.973129281

246     93   A 11-APR-18       14947795005 09-APR-18
        Name: +DATA/PRIMARY/datafile/sfsdata.300.973129815

RMAN> switch database to copy;

datafile 1 switched to datafile copy "+DATA/PRIMARY/datafile/system.298.973129969"
datafile 2 switched to datafile copy "+DATA/PRIMARY/datafile/sysaux.315.973127749"
datafile 3 switched to datafile copy "+DATA/PRIMARY/datafile/bodata_new.295.973130253"
datafile 4 switched to datafile copy "+DATA/PRIMARY/datafile/users.358.973111005"
datafile 5 switched to datafile copy "+DATA/PRIMARY/datafile/wsidx.285.973130519"
datafile 6 switched to datafile copy "+DATA/PRIMARY/datafile/wsdata.284.973130527"
datafile 7 switched to datafile copy "+DATA/PRIMARY/datafile/gt_data.283.973130533"

RMAN> report schema;

RMAN-06139: WARNING: control file is not current for REPORT SCHEMA
Report of database schema for database with db_unique_name PRIMARY

List of Permanent Datafiles
File Size(MB) Tablespace           RB segs Datafile Name
---- -------- -------------------- ------- ------------------------
1    9000     SYSTEM               ***     +DATA/PRIMARY/datafile/system.298.973129969
2    27610    SYSAUX               ***     +DATA/PRIMARY/datafile/sysaux.315.973127749
3    1876     BODATA_NEW           ***     +DATA/PRIMARY/datafile/bodata_new.295.973130253
4    32715    USERS                ***     +DATA/PRIMARY/datafile/users.358.973111005
5    100      WSIDX                ***     +DATA/PRIMARY/datafile/wsidx.285.973130519
92   10240    SFSDATA              ***     +DATA/PRIMARY/datafile/sfsdata.302.973129281
93   10240    SFSDATA              ***     +DATA/PRIMARY/datafile/sfsdata.300.973129815

List of Temporary Files
File Size(MB) Tablespace           Maxsize(MB) Tempfile Name
---- -------- -------------------- ----------- --------------------
1    8000     TEMP2                8000        +DATA/STANDBY/tempfile/temp2.360.967677721
2    1000     LMTEAMTEMP           30000       +DATA/STANDBY/tempfile/lmteamtemp.318.967677721
3    500      MGMTTEMP             30000       +DATA/STANDBY/tempfile/mgmttemp.309.967677721
4    500      LICMANTEMP           30000       +DATA/STANDBY/tempfile/licmantemp.345.967

After this, a reboot of the database got it back up and working and applying archived logs again, although with a little noise in the alert log.

The Data Guard switch over

So, we finally got to the point where we were at the normal DR situation, that is, the original standby is now our primary and vice versa.

After getting a maintenance window, the applications were brought down and the switch over was about to be initiated and… here we go again.

GMGRL> show database old-primary

Database - old-primary

  Role:            PHYSICAL STANDBY
  Intended State:  APPLY-ON
  Transport Lag:   11 minutes 3 seconds <-- This means that something is not right, but let's hope that DG Broker takes care of it. Apply Lag: 0 seconds Real Time Query: OFF Instance(s): PRIMARY Database Status: SUCCESS DGMGRL> switchover to old-primary;
Performing switchover NOW, please wait...
Error: ORA-16775: target standby database in broker operation has potential data loss <-- No, it doesn't

Unable to switchover, primary database is still "new-primary"

It was the Broker changing log_archive_dest_1 again. Fixed it, the missing archived logs where shipped and now, yes, this is the one… no.
The old-primary failed to shutdown. We believe that due to something like “Bug 20139391 – Dataguard Broker calls CRS to restart for Databases not being managed by CRS (Doc ID 20139391.8)” although this is not our case. We were running out of time and the maintenance window had been stretched way too far already.
Our IPC got involved here again and helped complete a manual switchover after getting rid of the Broker and all was fine.

The end, well… not really

During all this hassle we were unable to actually troubleshoot what was wrong with Data Guard Broker. Was it a series of bugs? Or was there something wrong in our, quite old, configuration?

So we set up a dummy Data Guard Broker configuration with an empty database and started our testing. Another IPC was involved here to help us figure out what may be wrong and first hit right in the forehead: we were using a bequeath connection for dgmgrl which is the source of many issues during a switch over.
The Broker requires a static listener service registered on both ends of the replica and uses it to connect to the remote database to perform shutdowns and startups during the change of roles.
Funny thing is that it is able to connect to the remote database to shut it down but it won’t be able to connect to start it up again, hence the role switch fails.
During our tests, the Broker was able to simply continue the process after we manually started/mounted the remote database but we didn’t get a chance to test this during the production outage.

So: try to avoid bequeath connections with dgmgrl it will do you no harm and will save you a few headaches.

The second important finding is related to how the Data Guard Broker collects database configuration parameters and sets them upon startup.

In the alert log of the databases we can see the following commands executed by DGB as part of its startup process. Note the highlighted line number 8.

ALTER SYSTEM SET log_archive_format='PRIMARY_%t_%r_%s.dbf' SCOPE=SPFILE SID='PRIMARY';
ALTER SYSTEM SET standby_file_management='AUTO' SCOPE=BOTH SID='*';
ALTER SYSTEM SET archive_lag_target=1800 SCOPE=BOTH SID='*';
ALTER SYSTEM SET log_archive_max_processes=4 SCOPE=BOTH SID='*';
ALTER SYSTEM SET log_archive_min_succeed_dest=1 SCOPE=BOTH SID='*';

So, it is the Broker that changes the parameters related to a standby configuration but, why these?
According to Oracle documentation:

Associated with each database are various properties that the DMON process uses to control the database’s behavior. The properties are recorded in the configuration file as a part of the database’s object profile that is stored there. Many database properties are used to control database initialization parameters related to the Data Guard environment.

To ensure that the broker can update the values of parameters in both the database itself and in the configuration file, you must use a server parameter file to control static and dynamic initialization parameters. The use of a server parameter file gives the broker a mechanism that allows it to reconcile property values selected by the database administrator (DBA) when using the broker with any related initialization parameter values recorded in the server parameter file.

When you set values for database properties in the broker configuration, the broker records the change in the configuration file and propagates the change to all of the databases in the Data Guard configuration.

So, we’ve learnt the hard way that once the Broker is in charge, we must use the Broker to change standby related parameters or it will revert the changes during the next startup time. In this case, we were unlucky to have a setting in place that added even more noise to the overall situation.

Final thoughts

At the end of the day, even the best make assumptions and mistakes. The first assumption we made is that everything was properly set up and ready in case of a DR scenario. We didn’t have much opportunity to test it, so we had to work with this assumption. It meant that the setup was quite good, but not good enough.
The main mistake here was not a mistake per se in the sense that is was something done the wrong way, the mistake was not questioning what we were doing and, more importantly, how were we doing it. We simply assumed, and here it is again, that a bequeath connection that allows us to change the configuration and even a shutdown of the remote database will be valid for a switchover operation, which it was not.

So, lessons learned here: test your DR scenarios, never make assumptions and question what you are doing, even more if you are not doing it frequently or following instructions given by others, and that includes Oracle support. There is always room to learn something new and the bigger the failure, the better the experience.

Want to talk with an expert? Schedule a call with our team to get the conversation started.

About the Author

Data Project Engineer
First of all father, then husband and finally Oracle database consultant. I love technology in general and managing data in particular. Trying to learn one new thing every day.

1 Comment. Leave new

Marco Patzwahl
March 18, 2019 5:02 am

Well written article. Good job!


Leave a Reply

Your email address will not be published. Required fields are marked *