Fun with Oracle Data Guard Broker

Jose Rodriguez

May 3, 2018

Tags: Microsoft Sql Server, Technical Track, Postgres, Postgresql

Issues during the switchover to standby database

The beginning

So, there we were attending to our jobs and tending to our customers when an emergency alert came in: a production server had gone down. All bells and whistles were sounded and an emergency bridge phone was opened. A hardware failure had brought the server down and the repair time was unknown, so we have to failover to the standby database to restore the service. This is a simple physical standby database with Data Guard Broker in place to take care of the basic things. Simply said and simply done. For reasons out of the scope of this post, we disabled the Broker and manually opened the database and all went well, except that the Broker had at some point changed the log_archive_dest_1 initialisation parameter to be " location="USE_DB_RECOVERY_FILE_DEST", valid_for=(STANDBY_LOGFILE,STANDBY_ROLE)". This was the first problem, but known from our previous tests, so we simply changed it to be " location="USE_DB_RECOVERY_FILE_DEST", valid_for=(All_LOGFILES,ALL_ROLES)" and everything worked as expected. Kids, test your DR scenarios while you still can :).

The rebuilding

For reasons out of the scope of this blog post, flashback database was not enabled on the primary side, so after the crash and a few hours of production data being added to the new primary database, we decided that the best way to continue from here was to rebuild the old primary as a standby and execute a switchover to get everything back to where it was before the outage. The main caveat here is that the servers are located on opposite sides of the Atlantic ocean and this is quite a database with some 2TiBs of data that does not get compressed much during the backup, plus a few GiBs of archived logs generated daily. This means that using the standard procedure to rebuild a standby database will incur a severe delay due to the time it takes to take a new full backup of the new-primary, copy it over to the new-standby site and then subsequent log shipping to get both sides in sync. Enter Oracle documentation. I find that Oracle documentation is some of the best I've had the opportunity to work with, although it's getting a bit cluttered and incoherent sometimes due to the many different features, options, new versions and partial updates that are only visible in MOS. Well, it may not be the best anymore, but I still use it a lot as a trusted source of information. So there is the procedure of rebuilding a primary database as a standby from its own backup and then sync it up with the primary. You can see the documentation in the following link. And off we go. Get the restore scripts ready, kick off the restore and wait for it to complete. In case someone is curious, here are the scripts we used. Simple sloppy scripts that do the job. First the bash script that wraps the RMAN call and allows us to execute it in nohup mode. You know, the network connection always drops at the most interesting moment.

export ORACLE_SID=PRIMARY
 
 RMAN_COMMAND=restore_PRIMARY
 EMAIL_RECIPIENT=no-reply@pythian.com
 
 export NLS_DATE_FORMAT='YYYY-MM-DD:hh24:mi:ss'
 
 if [ ! -d "log" ]; then
  mkdir log 
 fi
 
 . oraenv <<< $ORACLE_SID
 
 TS=`date +'%Y%m%d-%H.%M'`
 WORK_DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" && pwd )"
 RMAN_LOG=${WORK_DIR}/log/RMAN-${RMAN_COMMAND}_${TS}.log
 date
 echo "Begin RMAN ${RMAN_COMMAND} on $SID"
 rman cmdfile=${WORK_DIR}/${RMAN_COMMAND}.rcv log=${RMAN_LOG}
 echo "#"
 date
 echo "End RMAN ${RMAN_COMMAND} on $SID"
 mail -s "RMAN script ${RMAN_COMMAND} completed on ${SID}" ${EMAIL_RECIPIENT} < ${RMAN_LOG}
 
 exit 0

And this is the very basic RMAN script.

connect target /
 connect catalog rman/XXXXXX@RMANCAT
 RUN
  {
  SET UNTIL SCN 123456789;
  ALLOCATE CHANNEL D1 TYPE DISK;
  ALLOCATE CHANNEL D2 TYPE DISK;
  ALLOCATE CHANNEL D3 TYPE DISK;
  ALLOCATE CHANNEL D4 TYPE DISK;
  ALLOCATE CHANNEL D5 TYPE DISK;
  ALLOCATE CHANNEL D6 TYPE DISK;
  ALLOCATE CHANNEL D7 TYPE DISK;
  ALLOCATE CHANNEL D8 TYPE DISK;
  ALLOCATE CHANNEL D9 TYPE DISK;
  ALLOCATE CHANNEL D10 TYPE DISK;
  ALLOCATE CHANNEL D11 TYPE DISK;
  ALLOCATE CHANNEL D12 TYPE DISK;
  ALLOCATE CHANNEL D13 TYPE DISK;
  ALLOCATE CHANNEL D14 TYPE DISK;
  ALLOCATE CHANNEL D15 TYPE DISK;
  RESTORE DATABASE;
  RECOVER DATABASE;
  }

After the restore completed it was time for the recover in which, yes you guessed right, there was a missing archived log and there is more fun to be had. Of course, there was a backup of the missing archived log on the site we were trying to rebuild so we recovered it and applied it to the standby but then, surprise! See the line 22 in the snippet below:

SQL> recover standby database;
 ORA-00279: change 14943816301 generated at 04/09/2018 00:02:23 needed for thread 1
 ORA-00289: suggestion : +FRA/conpro_uk2/archivelog/2018_04_11/thread_1_seq_15211.2442.973138947
 ORA-00280: change 14943816301 for thread 1 is in sequence #15211
 
 
 Specify log: {=suggested | filename | AUTO | CANCEL}
 AUTO
 ORA-00279: change 14943827295 generated at 04/09/2018 00:07:18 needed for thread 1
 ORA-00289: suggestion : +FRA/conpro_uk2/archivelog/2018_04_11/thread_1_seq_15212.458.973138947
 ORA-00280: change 14943827295 for thread 1 is in sequence #15212
 ORA-00278: log file '+FRA/conpro_uk2/archivelog/2018_04_11/thread_1_seq_15211.2442.973138947' no longer needed for this recovery
 
 
 ORA-00279: change 14943841236 generated at 04/09/2018 00:13:56 needed for thread 1
 ORA-00289: suggestion : +FRA
 ORA-00280: change 14943841236 for thread 1 is in sequence #15213
 ORA-00278: log file '+FRA/conpro_uk2/archivelog/2018_04_11/thread_1_seq_15212.458.973138947' no longer needed for this recovery
 
 
 ORA-00308: cannot open archived log '+FRA'
 ORA-17503: ksfdopn:2 Failed to open file +FRA <-- at this point logs from 15213 are gone from FRA
 ORA-15045: ASM file name '+FRA' is not in reference form
 
 
 ORA-01547: warning: RECOVER succeeded but OPEN RESETLOGS would get error below
 ORA-01152: file 1 was not restored from a sufficiently old backup
 ORA-01110: data file 1: '+DATA/conpro_uk2/datafile/system.298.973129969'

Enter Oracle bugs. Yes ladies and gentlemen, upon restoring the missing archived log from the backup, the database started some FRA clean up process that deleted the newest archived logs! For these, of course, there was no backup as they had been created between the last backup and the crash. For those interested in the details, here is the bug we believe we may have swallowed while riding our bikes: "Bug 17370174 : NEW ARCHIVELOG FILES ARE DELETED FIRST FROM THE FRA". So, here we are, the old-primary, now an almost new-standby, is missing some archived logs to be able to sync up with the current primary. Fortunately, these archived logs had been shipped and applied to the old-standby, now the primary, before the crash, so we were able to restore and ship them to the other side of the Atlantic and everything was up and running again, including Data Guard Broker ... after a few hours. And, after applying some eighty archived logs, here we go again: ORA-00365

Media Recovery Log +FRA/PRIMARY/archivelog/2018_04_11/thread_1_seq_15291.2126.973169799
 Errors with log +FRA/PRIMARY/archivelog/2018_04_11/thread_1_seq_15291.2126.973169799
 MRP0: Background Media Recovery terminated with error 365
 Errors in file /apps/oracle/diag/rdbms/PRIMARY/PRIMARY/trace/PRIMARY_pr00_13538.trc:
 ORA-00365: the specified log is not the correct next log
 Managed Standby Recovery not using Real Time Apply
 Recovery interrupted!
 MRP0: Background Media Recovery process shutdown (PRIMARY)
 Completed: ALTER DATABASE RECOVER MANAGED STANDBY DATABASE THROUGH ALL SWITCHOVER DISCONNECT USING CURRENT LOGFILE

This issue was more complicated. Another bug entered the game " Bug 19181583 - Allow Recover Until Time for a Redo Log that has the activation bit set (Doc ID 19181583.8)" and a severity 1 SR was opened with Oracle. Also, we escalated internally to our IPC for him to take a look. Working around this one required creating a fresh standby controlfile from the current primary, ship it over to the standby server and replace the current controlfiles with it. Also, as ASM is involved, we had to do some RMAN magic to have the controlfile pointing to the right datafiles. Not the cleanest process, it may have been better to start a bit down in the ASM directory tree to avoid trying to catalog files like the ASM password file but, again, it did the job.

RMAN> catalog start with '+DATA';
 
 Starting implicit crosscheck backup at 11-APR-18
 using target database control file instead of recovery catalog
 allocated channel: ORA_DISK_1
 channel ORA_DISK_1: SID=1037 device type=DISK
 allocated channel: ORA_DISK_2
 channel ORA_DISK_2: SID=3 device type=DISK
 allocated channel: ORA_DISK_3
 channel ORA_DISK_3: SID=118 device type=DISK
 allocated channel: ORA_DISK_4
 channel ORA_DISK_4: SID=234 device type=DISK
 Crosschecked 1111 objects
 Finished implicit crosscheck backup at 11-APR-18
 
 Starting implicit crosscheck copy at 11-APR-18
 using channel ORA_DISK_1
 using channel ORA_DISK_2
 using channel ORA_DISK_3
 using channel ORA_DISK_4
 Finished implicit crosscheck copy at 11-APR-18
 
 searching for all files in the recovery area
 cataloging files...
 cataloging done
 
 List of Cataloged Files
 =======================
 File Name: +FRA/PRIMARY/ARCHIVELOG/2018_04_11/thread_1_seq_15211.2442.973138947
 File Name: +FRA/PRIMARY/ARCHIVELOG/2018_04_11/thread_1_seq_15212.458.973138947
 File Name: +FRA/PRIMARY/ARCHIVELOG/2018_04_11/thread_1_seq_15291.2126.973169799
 File Name: +FRA/PRIMARY/CONTROLFILE/current.256.922462525
 
 searching for all files that match the pattern +DATA
 
 List of Files Unknown to the Database
 =====================================
 File Name: +DATA/orapwasm
 File Name: +DATA/PRIMARY/spfileCONPRO.ora
 File Name: +DATA/PRIMARY/TEMPFILE/LMTEAMTEMP.275.973170913
 (...)
 File Name: +DATA/TEST/CONTROLFILE/Current.257.922321133
 File Name: +DATA/ASM/ASMPARAMETERFILE/REGISTRY.253.922315255
 
 Do you really want to catalog the above files (enter YES or NO)? YES
 cataloging files...
 cataloging done
 
 List of Cataloged Files
 =======================
 File Name: +DATA/PRIMARY/TEMPFILE/LMTEAMTEMP.275.973170913
 File Name: +DATA/PRIMARY/TEMPFILE/MGMTTEMP.274.973170913
 File Name: +DATA/PRIMARY/TEMPFILE/LICMANTEMP.273.973170913
 File Name: +DATA/PRIMARY/TEMPFILE/TEMP2.272.973170913
 File Name: +DATA/PRIMARY/TEMPFILE/TEMP2.271.973170913
 (...)
 
 List of Files Which Where Not Cataloged
 =======================================
 File Name: +DATA/orapwasm
  RMAN-07517: Reason: The file header is corrupted
 File Name: +DATA/PRIMARY/spfileCONPRO.ora
  RMAN-07518: Reason: Foreign database file DBID: 0 Database Name:
 File Name: +DATA/PRIMARY/ONLINELOG/group_20.381.948561469
  RMAN-07529: Reason: catalog is not supported for this file type
 File Name: +DATA/PRIMARY/ONLINELOG/group_21.380.948561487
 (...)
 
 RMAN> list copy of database;
 
 List of Datafile Copies
 =======================
 
 Key File S Completion Time Ckp SCN Ckp Time
 ------- ---- - --------------- ---------- ---------------
 248 1 A 11-APR-18 14947795005 09-APR-18
  Name: +DATA/PRIMARY/datafile/system.298.973129969
 
 231 2 A 11-APR-18 14947795005 09-APR-18
  Name: +DATA/PRIMARY/datafile/sysaux.315.973127749
 
 251 3 A 11-APR-18 14947795005 09-APR-18
 (...)
  Name: +DATA/PRIMARY/datafile/sfsdata.302.973129281
 
 246 93 A 11-APR-18 14947795005 09-APR-18
  Name: +DATA/PRIMARY/datafile/sfsdata.300.973129815
 
 
 RMAN> switch database to copy;
 
 datafile 1 switched to datafile copy "+DATA/PRIMARY/datafile/system.298.973129969"
 datafile 2 switched to datafile copy "+DATA/PRIMARY/datafile/sysaux.315.973127749"
 datafile 3 switched to datafile copy "+DATA/PRIMARY/datafile/bodata_new.295.973130253"
 datafile 4 switched to datafile copy "+DATA/PRIMARY/datafile/users.358.973111005"
 datafile 5 switched to datafile copy "+DATA/PRIMARY/datafile/wsidx.285.973130519"
 datafile 6 switched to datafile copy "+DATA/PRIMARY/datafile/wsdata.284.973130527"
 datafile 7 switched to datafile copy "+DATA/PRIMARY/datafile/gt_data.283.973130533"
 (...)
 
 
 RMAN> report schema;
 
 RMAN-06139: WARNING: control file is not current for REPORT SCHEMA
 Report of database schema for database with db_unique_name PRIMARY
 
 List of Permanent Datafiles
 ===========================
 File Size(MB) Tablespace RB segs Datafile Name
 ---- -------- -------------------- ------- ------------------------
 1 9000 SYSTEM *** +DATA/PRIMARY/datafile/system.298.973129969
 2 27610 SYSAUX *** +DATA/PRIMARY/datafile/sysaux.315.973127749
 3 1876 BODATA_NEW *** +DATA/PRIMARY/datafile/bodata_new.295.973130253
 4 32715 USERS *** +DATA/PRIMARY/datafile/users.358.973111005
 5 100 WSIDX *** +DATA/PRIMARY/datafile/wsidx.285.973130519
 (...)
 92 10240 SFSDATA *** +DATA/PRIMARY/datafile/sfsdata.302.973129281
 93 10240 SFSDATA *** +DATA/PRIMARY/datafile/sfsdata.300.973129815
 
 List of Temporary Files
 =======================
 File Size(MB) Tablespace Maxsize(MB) Tempfile Name
 ---- -------- -------------------- ----------- --------------------
 1 8000 TEMP2 8000 +DATA/STANDBY/tempfile/temp2.360.967677721
 2 1000 LMTEAMTEMP 30000 +DATA/STANDBY/tempfile/lmteamtemp.318.967677721
 3 500 MGMTTEMP 30000 +DATA/STANDBY/tempfile/mgmttemp.309.967677721
 4 500 LICMANTEMP 30000 +DATA/STANDBY/tempfile/licmantemp.345.967
 (...)

After this, a reboot of the database got it back up and working and applying archived logs again, although with a little noise in the alert log.

The Data Guard switch over

So, we finally got to the point where we were at the normal DR situation, that is, the original standby is now our primary and vice versa. After getting a maintenance window, the applications were brought down and the switch over was about to be initiated and... here we go again.

GMGRL> show database old-primary
 
 Database - old-primary
 
  Role: PHYSICAL STANDBY
  Intended State: APPLY-ON
  Transport Lag: 11 minutes 3 seconds <-- This means that something is not right, but let's hope that DG Broker takes care of it. Apply Lag: 0 seconds Real Time Query: OFF Instance(s): PRIMARY Database Status: SUCCESS DGMGRL> switchover to old-primary;
 Performing switchover NOW, please wait...
 Error: ORA-16775: target standby database in broker operation has potential data loss <-- No, it doesn't
 
 Failed.
 Unable to switchover, primary database is still "new-primary"

It was the Broker changing log_archive_dest_1 again. Fixed it, the missing archived logs where shipped and now, yes, this is the one... no. The old-primary failed to shutdown. We believe that due to something like "Bug 20139391 - Dataguard Broker calls CRS to restart for Databases not being managed by CRS (Doc ID 20139391.8)" although this is not our case. We were running out of time and the maintenance window had been stretched way too far already. Our IPC got involved here again and helped complete a manual switchover after getting rid of the Broker and all was fine.

The end, well... not really

During all this hassle we were unable to actually troubleshoot what was wrong with Data Guard Broker. Was it a series of bugs? Or was there something wrong in our, quite old, configuration? So we set up a dummy Data Guard Broker configuration with an empty database and started our testing. Another IPC was involved here to help us figure out what may be wrong and first hit right in the forehead: we were using a bequeath connection for dgmgrl which is the source of many issues during a switch over. The Broker requires a static listener service registered on both ends of the replica and uses it to connect to the remote database to perform shutdowns and startups during the change of roles. Funny thing is that it is able to connect to the remote database to shut it down but it won't be able to connect to start it up again, hence the role switch fails. During our tests, the Broker was able to simply continue the process after we manually started/mounted the remote database but we didn't get a chance to test this during the production outage. So: try to avoid bequeath connections with dgmgrl it will do you no harm and will save you a few headaches. The second important finding is related to how the Data Guard Broker collects database configuration parameters and sets them upon startup. In the alert log of the databases we can see the following commands executed by DGB as part of its startup process. Note the highlighted line number 8.

ALTER SYSTEM SET log_archive_trace=0 SCOPE=BOTH SID='PRIMARY';
 ALTER SYSTEM SET log_archive_format='PRIMARY_%t_%r_%s.dbf' SCOPE=SPFILE SID='PRIMARY';
 ALTER SYSTEM SET standby_file_management='AUTO' SCOPE=BOTH SID='*';
 ALTER SYSTEM SET archive_lag_target=1800 SCOPE=BOTH SID='*';
 ALTER SYSTEM SET log_archive_max_processes=4 SCOPE=BOTH SID='*';
 ALTER SYSTEM SET log_archive_min_succeed_dest=1 SCOPE=BOTH SID='*';
 ALTER SYSTEM SET fal_server='STANDBY' SCOPE=BOTH;
 ALTER SYSTEM SET log_archive_dest_1='location="USE_DB_RECOVERY_FILE_DEST"',' valid_for=(STANDBY_LOGFILE,STANDBY_ROLE)' SCOPE=BOTH SID='PRIMARY';

So, it is the Broker that changes the parameters related to a standby configuration but, why these? According to Oracle documentation:

Associated with each database are various properties that the DMON process uses to control the database's behavior. The properties are recorded in the configuration file as a part of the database's object profile that is stored there. Many database properties are used to control database initialization parameters related to the Data Guard environment. To ensure that the broker can update the values of parameters in both the database itself and in the configuration file, you must use a server parameter file to control static and dynamic initialization parameters. The use of a server parameter file gives the broker a mechanism that allows it to reconcile property values selected by the database administrator (DBA) when using the broker with any related initialization parameter values recorded in the server parameter file. When you set values for database properties in the broker configuration, the broker records the change in the configuration file and propagates the change to all of the databases in the Data Guard configuration.

So, we've learnt the hard way that once the Broker is in charge, we must use the Broker to change standby related parameters or it will revert the changes during the next startup time. In this case, we were unlucky to have a setting in place that added even more noise to the overall situation.

Final thoughts

At the end of the day, even the best make assumptions and mistakes. The first assumption we made is that everything was properly set up and ready in case of a DR scenario. We didn't have much opportunity to test it, so we had to work with this assumption. It meant that the setup was quite good, but not good enough. The main mistake here was not a mistake per se in the sense that is was something done the wrong way, the mistake was not questioning what we were doing and, more importantly, how were we doing it. We simply assumed, and here it is again, that a bequeath connection that allows us to change the configuration and even a shutdown of the remote database will be valid for a switchover operation, which it was not. So, lessons learned here: test your DR scenarios, never make assumptions and question what you are doing, even more if you are not doing it frequently or following instructions given by others, and that includes Oracle support. There is always room to learn something new and the bigger the failure, the better the experience.

Insight and analysis of technology and business strategy