How to accurately measure data guard lag events

Sean Scott

March 19, 2019

Tags: Business Insights

If you work with Data Guard, the lag between the primary and standby database is an informative but difficult metric to measure. I've encountered some "solutions" that use the COMPLETION_TIME in (G)V$ARCHIVED_LOG for the standby destination as the time of apply, but this isn't accurate. COMPLETION_TIME on the standby represents the time the log was written to the filesystem, not when it was applied. To verify this, compare a query of V$ARCHIVED_LOG to the alert log entries on the standby:

SQL> select sequence#, dest_id, applied, completion_time from v$archived_log where sequence# = 47 order by 1, 2;
 
  SEQUENCE# DEST_ID APPLIED COMPLETION_TIME
 ---------- ---------- --------- -------------------
 47  1 NO 2019-03-13 21:13:02
 47  2 YES 2019-03-13 21:13:02
 
 [oracle@ol7-183-dg2 trace]$ egrep -B1 'Media Recovery Log.*o1\_mf\_1\_47' alert_cdb1.log
 2019-03-13T21:21:15.148922+00:00
 Media Recovery Log /u01/app/oracle/CDB1_STBY/archivelog/2019_03_13/o1_mf_1_47_g8lwlyws_.arc
 --
 2019-03-13T21:22:22.658726+00:00
 PR00 (PID:4198): Media Recovery Log /u01/app/oracle/CDB1_STBY/archivelog/2019_03_13/o1_mf_1_47_g8lwlyws_.arc
 --
 2019-03-13T21:24:24.180440+00:00
 Media Recovery Log /u01/app/oracle/CDB1_STBY/archivelog/2019_03_13/o1_mf_1_47_g8lwlyws_.arc
 [oracle@ol7-183-dg2 trace]$
 
 [oracle@ol7-183-dg2 trace]$ ls -l /u01/app/oracle/CDB1_STBY/archivelog/2019_03_13/o1_mf_1_47_g8lwlyws_.arc
 -rw-r-----. 1 oracle oinstall 79360 Mar 13 21:13 /u01/app/oracle/CDB1_STBY/archivelog/2019_03_13/o1_mf_1_47_g8lwlyws_.arc
 [oracle@ol7-183-dg2 trace]$

According to V$ARCHIVED_LOG, this log "completed" at 21:13:02. However, the last entry for this redo log in the alert log is timestamped 21:24:24. That's a difference of over 11 minutes! (The multiple entries in the alert log are due to interruptions in transport and failed apply. The timestamp of the last entry will be used later as the "apply date" for the log.) The timestamp of the log on the filesystem matches COMPLETION_TIME. I wanted a way to report the true lag—the delta between the primary's completion time and the apply time in the standby's alert log. The ideal solution would be accessible to a client and allow reporting on multiple Data Guard configurations. Surely Oracle wouldn't make this too difficult, would they? It turns out the solution is not that simple. Testing was done using Oracle 18c on Oracle Linux 7.6. This solution should work for other versions and operating systems, as well. I used Tim Hall's helpful Vagrant Builds to quickly set up and configure my test environment. This method requires that both the primary and standby database are available. History can only be reported for events present in the current alert log. Keep this in mind if ADR or a similar process is used to rotate the standby's alert log. The V$ARCHIVED_LOG view on the primary is aware of logs on both the primary and standby databases. The same view on the standby database doesn't know about logs on the primary. The content of standby entries differs slightly, too. On the primary, the NAME field shows the file name for files on the primary. It's unaware of file names on the standby, which will be different with Oracle Managed Files. I get information on the archive log files from both primary and standby via a database link:

create database link cdb1_stby connect to MYUSER identified by "MyPassword" using 'cdb1_stby';

The meat of this technique takes advantage of the X$DBGALERTEXT table. X$DBGALERTEXT is an in-memory view of the database alert log. V$DIAG_ALERT_EXT may also be used, but this view sometimes exhibits performance problems. But X$ tables can only be read by privileged accounts and SELECT can't be granted:

SQL> conn myuser/MyPassword@cdb1
 Connected.
 SQL> select count(*) from sys.X$DBGALERTEXT;
 select count(*) from sys.X$DBGALERTEXT
  *
 ERROR at line 1:
 ORA-00942: table or view does not exist
 
 SQL> conn / as sysdba
 Connected.
 SQL> grant select on sys.X$DBGALERTEXT to MYUSER;
 grant select on sys.X$DBGALERTEXT to MYUSER
  *
 ERROR at line 1:
 ORA-02030: can only select from fixed tables/views

Fortunately, there's a workaround. Create a view on the table and grant select on the view:

create view v_X$DBGALERTEXT as select * from sys.X$DBGALERTEXT;
 select count(*) from v_X$DBGALERTEXT;
 grant select on v_X$DBGALERTEXT to MYUSER;

If performance isn't a concern and the existing view is acceptable, grant SELECT privileges on V$DIAG_ALERT_EXT:

grant select on V$DIAG_ALERT_EXT to MYUSER;

The next step is to get records from the standby's alert log associated with redo log apply. These include "Media Recovery Log" in the MESSAGE_TEXT columns of both X$DBGALERTEXT and V$DIAG_ALERT_EXT. The alert log record needs to be joined to a matching row in the standby's V$ARCHIVED_LOG table. It might seem obvious to extract the thread and sequence directly but this may not always be straightforward. It would depend on the value of LOG_ARCHIVE_FORMAT and could change over the life of the alert log. This gets into potentially complex parsing of the message. Instead, I chose to extract the file name using substr(a.message_text, instr(a.message_text, 'Log ') + 4). This pulls everything after the word "Log " in the message:

Media Recovery Log /u01/app/oracle/CDB1_STBY/archivelog/2019_03_13/o1_mf_1_47_g8lwlyws_.arc

That's the same value found for NAME in the standby V$ARCHIVED_LOG view. Or, nearly so. There's an extra character at the end:

SQL> select distinct length(substr(a.message_text, instr(a.message_text, 'Log ') + 4)) from sys.v_X$DBGALERTEXT@cdb1_stby a where a.message_text like '%Media Recovery Log %';
 
 LENGTH(SUBSTR(A.MESSAGE_TEXT,INSTR(A.MESSAGE_TEXT,'LOG')+4))
 ------------------------------------------------------------
  73
 
 SQL> select distinct length(name) from v$archived_log@cdb1_stby where dest_id = 1;
 
 LENGTH(NAME)
 ------------
  72

That non-printing character can be removed with REGEXP_REPLACE: regexp_replace(substr(a.message_text, instr(a.message_text, 'Log ') + 4), '[[:space:]]', ''). The timestamp in the alert log also must be cast as a date and any time zone differences accounted for. There may be multiple entries for the same file so only the last (MAX) timestamp is needed. I now have a query to extract the actual apply time and redo log name from the standby alert log:

col name for a72
 alter session set nls_date_format='YYYY-MM-DD HH24:MI:SS';
 
  select max(cast((from_tz(cast(a.originating_timestamp as timestamp), 'GMT') at time zone DBTIMEZONE) as date)) as apply_time
 , regexp_replace(substr(a.message_text, instr(a.message_text, 'Log ') + 4), '[[:space:]]', '') as name
  from sys.v_X$DBGALERTEXT@cdb1_stby a
  where a.message_text like '%Media Recovery Log %'
 group by regexp_replace(substr(a.message_text, instr(a.message_text, 'Log ') + 4), '[[:space:]]', '');
 
 APPLY_TIME  NAME
 ------------------- ------------------------------------------------------------------------
 2019-03-13 21:24:24 /u01/app/oracle/CDB1_STBY/archivelog/2019_03_13/o1_mf_1_49_g8lwo8xr_.arc
 2019-03-13 21:24:24 /u01/app/oracle/CDB1_STBY/archivelog/2019_03_13/o1_mf_1_51_g8lwoct0_.arc
 2019-03-13 21:24:24 /u01/app/oracle/CDB1_STBY/archivelog/2019_03_13/o1_mf_1_48_g8lwo81n_.arc
 2019-03-13 21:24:24 /u01/app/oracle/CDB1_STBY/archivelog/2019_03_13/o1_mf_1_47_g8lwlyws_.arc
 2019-03-13 21:24:24 /u01/app/oracle/CDB1_STBY/archivelog/2019_03_13/o1_mf_1_50_g8lwocsv_.arc
 ...

The redo log name can be used to join to V$ARCHIVED_LOG on the standby database:

 select max(cast((from_tz(cast(a.originating_timestamp as timestamp), 'GMT') at time zone DBTIMEZONE) as date)) as apply_time
 , s.sequence#
 , s.thread#
 , regexp_replace(substr(a.message_text, instr(a.message_text, 'Log ') + 4), '[[:space:]]', '') as name
  from sys.v_X$DBGALERTEXT@cdb1_stby a
 , v$archived_log@cdb1_stby s
  where a.message_text like '%Media Recovery Log %'
  and s.name = regexp_replace(substr(a.message_text, instr(a.message_text, 'Log ') + 4), '[[:space:]]', '')
 group by s.sequence#
 , s.thread#
 , regexp_replace(substr(a.message_text, instr(a.message_text, 'Log ') + 4), '[[:space:]]', '')
 order by s.sequence#, 1;
 
 APPLY_TIME  SEQUENCE#  THREAD# NAME
 ------------------- ---------- ---------- ------------------------------------------------------------------------
 2019-03-13 21:24:24  47 1 /u01/app/oracle/CDB1_STBY/archivelog/2019_03_13/o1_mf_1_47_g8lwlyws_.arc
 2019-03-13 21:24:24  48 1 /u01/app/oracle/CDB1_STBY/archivelog/2019_03_13/o1_mf_1_48_g8lwo81n_.arc
 2019-03-13 21:24:24  49 1 /u01/app/oracle/CDB1_STBY/archivelog/2019_03_13/o1_mf_1_49_g8lwo8xr_.arc
 2019-03-13 21:24:24  50 1 /u01/app/oracle/CDB1_STBY/archivelog/2019_03_13/o1_mf_1_50_g8lwocsv_.arc
 2019-03-13 21:24:24  51 1 /u01/app/oracle/CDB1_STBY/archivelog/2019_03_13/o1_mf_1_51_g8lwoct0_.arc
 ...

(If the standby's alert log is rotated or truncated, this query can be used as a basis to periodically insert historic redo timestamps into a separate history table.) The thread and sequence numbers are then joined to V$ARCHIVED_LOG on the primary database and use the primary COMPLETION_TIME to calculate the true lag between primary and standby for each log:

 with lag as (
  select p.thread#
 , p.sequence#
 , x.applied
 , p.first_time
 , p.next_time
 , p.completion_time as primary_apply_time
 , decode(x.applied, 'YES', x.completion_time, sysdate) as standby_completion_time
 , x.standby_apply_time
 , round(to_number(decode(x.applied, 'YES', x.standby_apply_time, sysdate) - p.completion_time) * 1440, 2) as apply_delta_minutes
  from v$archived_log p
 , v$archive_dest pd
 ,(select max(cast((from_tz(cast(a.originating_timestamp as timestamp), 'GMT') at time zone DBTIMEZONE) as date)) as standby_apply_time
 , s.sequence#
 , s.thread#
 , s.applied
 , s.completion_time
  from sys.v_X$DBGALERTEXT@cdb1_stby a
 , v$archived_log@cdb1_stby s
  where a.message_text like '%Media Recovery Log %'
  and s.name = regexp_replace(substr(a.message_text, instr(a.message_text, 'Log ') + 4), '[[:space:]]', '')
 group by s.sequence#
 , s.thread#
 , s.applied
 , s.completion_time
 , regexp_replace(substr(a.message_text, instr(a.message_text, 'Log ') + 4), '[[:space:]]', '')
  ) x
  where p.thread# = x.thread#
  and p.sequence# = x.sequence#
  and p.dest_id = pd.dest_id
  and pd.target = 'PRIMARY'
  and p.standby_dest = 'NO'
 order by 9 desc, p.sequence#)
  select *
  from lag
 order by apply_delta_minutes desc;
 
  THREAD# SEQUENCE# APPLIED FIRST_TIME  NEXT_TIME PRIMARY_APPLY_TIME STANDBY_COMPLETION_ STANDBY_APPLY_TIME APPLY_DELTA_MINUTES
 ---------- ---------- --------- ------------------- ------------------- ------------------- ------------------- ------------------- -------------------
 1  47 YES 2019-03-13 21:10:36 2019-03-13 21:13:02 2019-03-13 21:13:02 2019-03-13 21:13:02 2019-03-13 21:24:24  11.37
 1  48 YES 2019-03-13 21:13:02 2019-03-13 21:14:15 2019-03-13 21:14:15 2019-03-13 21:14:16 2019-03-13 21:24:24  10.15
 1  49 YES 2019-03-13 21:14:15 2019-03-13 21:14:16 2019-03-13 21:14:16 2019-03-13 21:14:16 2019-03-13 21:24:24  10.13
 1  50 YES 2019-03-13 21:14:16 2019-03-13 21:14:19 2019-03-13 21:14:19 2019-03-13 21:14:19 2019-03-13 21:24:24  10.08
 1  51 YES 2019-03-13 21:14:19 2019-03-13 21:14:19 2019-03-13 21:14:19 2019-03-13 21:14:19 2019-03-13 21:24:24  10.08

If APPLIED for the log on the standby is NO, SYSDATE is substituted to show the current delay. Out of curiosity, I checked the lag reported by the standby's COMPLETION_TIME. I added

round(to_number(decode(x.applied, 'YES', x.completion_time, sysdate) - p.completion_time) * 1440, 2) as completion_delta_minutes

. It demonstrated this isn't an accurate way to report lag!:

 THREAD# SEQUENCE# APPLIED FIRST_TIME  NEXT_TIME PRIMARY_APPLY_TIME STANDBY_COMPLETION_ STANDBY_APPLY_TIME COMPLETION_DELTA_MINUTES APPLY_DELTA_MINUTES
 ---------- ---------- --------- ------------------- ------------------- ------------------- ------------------- ------------------- ------------------------ -------------------
 1  47 YES 2019-03-13 21:10:36 2019-03-13 21:13:02 2019-03-13 21:13:02 2019-03-13 21:13:02 2019-03-13 21:24:24  0  11.37
 1  48 YES 2019-03-13 21:13:02 2019-03-13 21:14:15 2019-03-13 21:14:15 2019-03-13 21:14:16 2019-03-13 21:24:24 .02  10.15
 1  49 YES 2019-03-13 21:14:15 2019-03-13 21:14:16 2019-03-13 21:14:16 2019-03-13 21:14:16 2019-03-13 21:24:24  0  10.13
 1  50 YES 2019-03-13 21:14:16 2019-03-13 21:14:19 2019-03-13 21:14:19 2019-03-13 21:14:19 2019-03-13 21:24:24  0  10.08
 1  51 YES 2019-03-13 21:14:19 2019-03-13 21:14:19 2019-03-13 21:14:19 2019-03-13 21:14:19 2019-03-13 21:24:24  0  10.08

I even saw examples of negative lag where the log was written earlier on the standby than the primary!:

 THREAD# SEQUENCE# APPLIED FIRST_TIME  NEXT_TIME PRIMARY_APPLY_TIME STANDBY_COMPLETION_ STANDBY_APPLY_TIME COMPLETION_DELTA_MINUTES APPLY_DELTA_MINUTES
 ---------- ---------- --------- ------------------- ------------------- ------------------- ------------------- ------------------- ------------------------ -------------------
 1  33 YES 2019-03-13 14:28:08 2019-03-13 14:28:13 2019-03-13 14:28:38 2019-03-13 14:28:13 2019-03-13 14:31:28 -.42  2.83

I must give credit and thanks to Franck Pachot, whose post Data Guard Gap History inspired me to find a way to retrieve this information from a client using only SQL. This avoids the need to be on the host and doesn't require access to the filesystem. It also shouldn't be difficult to extend this technique to centralize reporting and monitoring for multiple Data Guard configurations.

Insight and analysis of technology and business strategy

How to accurately measure data guard lag events

Pythian Blogs

Tell us how we can help!