Reviewing the operation modes of Oracle GoldenGate BigQuery Handler

Posted in: Big Data, Cloud, Oracle, Technical Track

GoldenGate for Big Data 12.3.2.1.1 introduces a new target – Google BigQuery.

BigQuery handler can work in two Audit log modes:
1. auditLogMode = true
2. auditLogMode = false

I want to review the differences between these two operation modes supported by the BigQuery handler. I’m going to use a simple, single table using Oracle -> BigQuery replication.

Let’s create the source table:

SQL> create table rasskazov.tst(id number primary key, name1 varchar2(20), name2 varchar2(20), name3 varchar2(20));

Table created.

I’m going to use the following extract and pump configuration files. This configuration is suitable for testing purposes only. Production systems should use password encryption. I’m using a non-default remote host port because GoldenGate for Big Data is running on the same test VM.

[oracle@ol6-121-rac1 dirprm]$ cat bigext.prm 
EXTRACT bigext
USERID ggate@testdb, PASSWORD ...
EXTTRAIL ./dirdat/or
TABLE RASSKAZOV.TST;

oracle@ol6-121-rac1 dirprm]$ cat bigpump.prm 
EXTRACT bigpump
userid ggadmin@testdb, password "..."
RMTHOST ol6-121-rac1, MGRPORT 7839
RMTTRAIL /u01/app/oracle/product/oggbd/dirdat/or
TABLE RASSKAZOV.TST;

Let’s enable supplemental logging and create the extracts:

GGSCI (ol6-121-rac1.localdomain) 1> dblogin USERID ggate@testdb, PASSWORD ...
Successfully logged into database.

GGSCI (ol6-121-rac1.localdomain as ggate@testdb) 2> add trandata rasskazov.tst                       

Logging of supplemental redo data enabled for table RASSKAZOV.TST.
TRANDATA for scheduling columns has been added on table 'RASSKAZOV.TST'.
TRANDATA for instantiation CSN has been added on table 'RASSKAZOV.TST'.
GGSCI (ol6-121-rac1.localdomain as ggate@testdb) 3> add extract bigext, INTEGRATED TRANLOG, BEGIN NOW
EXTRACT (Integrated) added.


GGSCI (ol6-121-rac1.localdomain as ggate@testdb) 4> add exttrail ./dirdat/or, extract bigext
EXTTRAIL added.

GGSCI (ol6-121-rac1.localdomain as ggate@testdb) 5> ADD EXTRACT bigpump, EXTTRAILSOURCE ./dirdat/or
EXTRACT added.

GGSCI (ol6-121-rac1.localdomain as ggate@testdb) 6> ADD RMTTRAIL /u01/app/oracle/product/oggbd/dirdat/or, EXTRACT bigpump
RMTTRAIL added.

GGSCI (ol6-121-rac1.localdomain as ggate@testdb) 7> start extract bigext 

Sending START request to MANAGER ...
EXTRACT BIGEXT starting


GGSCI (ol6-121-rac1.localdomain as ggate@testdb) 8> start extract bigpump

Sending START request to MANAGER ...
EXTRACT BIGPUMP starting

Now I’m going to prepare the replicats. Sample parameter files are located in $OGGHOME/AdapterExamples/big-data/bigquery directory. I’m going to create two replicats, one for each audit mode. The auditLogMode = true replicat uses the following bigquery.props file:

[oracle@ol6-121-rac1 dirprm]$ cat bigquery.props 
gg.handlerlist=bigquery

gg.handler.bigquery.type=bigquery
gg.handler.bigquery.projectId=goldengate-test-project
gg.handler.bigquery.datasetId=testdataset
gg.handler.bigquery.credentialsFile=/u01/app/oracle/product/oggbd/dirprm/credentials.json
gg.handler.bigquery.connectionTimeout=20000
gg.handler.bigquery.readTimeout=30000
gg.handler.bigquery.pkUpdateHandling=delete-insert

# If audit log mode is set to true, specify the metacolumns template as well
gg.handler.bigquery.auditLogMode=true
gg.handler.bigquery.metaColumnsTemplate=${optype},${position}

goldengate.userexit.writers=javawriter
javawriter.stats.display=TRUE
javawriter.stats.full=TRUE

gg.log=log4j
gg.log.level=INFO
gg.report.time=30sec

# Set the classpath here to the BigQuery Java Client driver 
# Link to the Google BigQuery Client Library website
# https://developers.google.com/api-client-library/java/apis/bigquery/v2
# Download the other required jars from Maven Central website
# http://search.maven.org/

gg.classpath=/home/oracle/bigquery/libs/*:/home/oracle/bigquery/*

javawriter.bootoptions=-Xmx512m -Xms32m -Djava.class.path=.:ggjava/ggjava.jar

The second replicat is using the noaudbq.props parameter file. The only difference between bigquery.props and noaudbq.props files is gg.handler.bigquery.auditLogMode. It’s set to “true” in bigquery.props and to “false” in noaudbq.props.

We also need to download required Java classes, set up the gg.classpath parameter and download the JSON credentials file. I’ve also created the project and the BigQuery dataset.

These are replicat’s parameter files:

[oracle@ol6-121-rac1 dirprm]$ cat rbigq.prm 
REPLICAT rbigq
TARGETDB LIBFILE libggjava.so SET property=dirprm/bigquery.props 
DDL INCLUDE ALL
GROUPTRANSOPS 1000
MAPEXCLUDE dbo.excludetable
MAP rasskazov.tst, TARGET A.TST;

[oracle@ol6-121-rac1 dirprm]$ cat rnoaud.prm 
REPLICAT rnoaud
TARGETDB LIBFILE libggjava.so SET property=dirprm/noaudbq.props 
DDL INCLUDE ALL
GROUPTRANSOPS 1000
MAPEXCLUDE dbo.excludetable
MAP rasskazov.tst, TARGET B.TST;

Lets’ create and start our replicats:

GGSCI (ol6-121-rac1.localdomain) 2> add replicat rbigq, exttrail dirdat/or
REPLICAT added.


GGSCI (ol6-121-rac1.localdomain) 3> start rbigq

Sending START request to MANAGER ...
REPLICAT RBIGQ starting

GGSCI (ol6-121-rac1.localdomain) 2> add replicat rnoaud, exttrail dirdat/or
REPLICAT added.


GGSCI (ol6-121-rac1.localdomain) 3> start replicat rnoaud

Sending START request to MANAGER ...
REPLICAT RNOAUD starting

GGSCI (ol6-121-rac1.localdomain) 5> info all

Program     Status      Group       Lag at Chkpt  Time Since Chkpt

MANAGER     RUNNING                                           
REPLICAT    RUNNING     RBIGQ       00:00:00      00:00:05    
REPLICAT    RUNNING     RNOAUD      00:03:17      00:00:06    

Let’s insert the first record.

SQL> insert into rasskazov.tst values(1, 'a', 'b', 'c');

1 row created.

SQL> commit;

Commit complete.

I did not create the destination BigQuery tables but GoldenGate created A_TST and B_TST tables automatically:

rasskazov@cloudshell:~ (goldengate-test-project)$ bq show goldengate-test-project:testdataset.A_TST
Table goldengate-test-project:testdataset.A_TST

   Last modified                Schema               Total Rows   Total Bytes   Expiration   Time Partitioning   Labels
 ----------------- -------------------------------- ------------ ------------- ------------ ------------------- --------
  22 Jan 18:39:51   |- optype: string (required)     0            0
                    |- position: string (required)
                    |- ID: string
                    |- NAME1: string
                    |- NAME2: string
                    |- NAME3: string

rasskazov@cloudshell:~ (goldengate-test-project)$ bq show goldengate-test-project:testdataset.B_TST
Table goldengate-test-project:testdataset.B_TST

   Last modified                Schema               Total Rows   Total Bytes   Expiration   Time Partitioning   Labels
 ----------------- -------------------------------- ------------ ------------- ------------ ------------------- --------
  22 Jan 18:43:01   |- deleted: boolean (required)   0            0
                    |- ID: string
                    |- NAME1: string
                    |- NAME2: string
                    |- NAME3: string

rasskazov@cloudshell:~ (goldengate-test-project)$ bq query --nouse_legacy_sql 'select * from `goldengate-test-project.testdataset.A_TST` order by position'
Waiting on bqjob_re6ae9ce752ffa19_0000016874bd2153_1 ... (0s) Current status: DONE   
+--------+----------------------+----+-------+-------+-------+
| optype |       position       | ID | NAME1 | NAME2 | NAME3 |
+--------+----------------------+----+-------+-------+-------+
| I      | 00000000000000002101 | 1  | a     | b     | c     |
+--------+----------------------+----+-------+-------+-------+

rasskazov@cloudshell:~ (goldengate-test-project)$ bq head goldengate-test-project:testdataset.B_TST
+---------+----+-------+-------+-------+
| deleted | ID | NAME1 | NAME2 | NAME3 |
+---------+----+-------+-------+-------+
|   false | 1  | a     | b     | c     |
+---------+----+-------+-------+-------+

A_TST table is populated by RBIGQ replicat running in log mode. There is the “optype” column that stores operation type, ‘I’ for insert in our case. The “position” column determines the operations order. The B_TST table has a “deleted” column which replicat sets to true if a record has been deleted.

Let’s try to update our record:

SQL> update rasskazov.tst set name1 = 'd';

1 row updated.

SQL> commit;

Commit complete.

rasskazov@cloudshell:~ (goldengate-test-project)$ bq query --nouse_legacy_sql 'select * from `goldengate-test-project.testdataset.A_TST` order by position'
Waiting on bqjob_r12c48e37f4f19771_0000016874d4c43b_1 ... (0s) Current status: DONE   
+--------+----------------------+----+-------+-------+-------+
| optype |       position       | ID | NAME1 | NAME2 | NAME3 |
+--------+----------------------+----+-------+-------+-------+
| I      | 00000000000000002101 | 1  | a     | b     | c     |
| U      | 00000000000000002251 | 1  | d     | NULL  | NULL  |
+--------+----------------------+----+-------+-------+-------+

rasskazov@cloudshell:~ (goldengate-test-project)$ bq head goldengate-test-project:testdataset.B_TST
+---------+----+-------+-------+-------+
| deleted | ID | NAME1 | NAME2 | NAME3 |
+---------+----+-------+-------+-------+
|   false | 1  | d     | NULL  | NULL  |
+---------+----+-------+-------+-------+

We can see the first problem: B_TST table running in auditLogMode = false mode lost the NAME2 and NAME3 values. To prevent this, we have to enable supplemental logging for all columns.

Let’s do it and run the same update statement:

GGSCI (ol6-121-rac1.localdomain as ggate@testdb) 2> add trandata rasskazov.tst allcols

Logging of supplemental redo log data is already enabled for table RASSKAZOV.TST.
TRANDATA for all columns has been added on table 'RASSKAZOV.TST'.
TRANDATA for instantiation CSN has been added on table 'RASSKAZOV.TST'.

SQL> update rasskazov.tst set name1 = 'd';

1 row updated.

SQL> commit;

Commit complete.

rasskazov@cloudshell:~ (goldengate-test-project)$ bq query --nouse_legacy_sql 'select * from `goldengate-test-project.testdataset.A_TST` order by position'
Waiting on bqjob_r516eb3d42790f52_0000016874dbf7a4_1 ... (0s) Current status: DONE   
+--------+----------------------+----+-------+-------+-------+
| optype |       position       | ID | NAME1 | NAME2 | NAME3 |
+--------+----------------------+----+-------+-------+-------+
| I      | 00000000000000002101 | 1  | a     | b     | c     |
| U      | 00000000000000002251 | 1  | d     | NULL  | NULL  |
| U      | 00000000000000002461 | 1  | d     | b     | c     |
+--------+----------------------+----+-------+-------+-------+

rasskazov@cloudshell:~ (goldengate-test-project)$ bq head goldengate-test-project:testdataset.B_TST
+---------+----+-------+-------+-------+
| deleted | ID | NAME1 | NAME2 | NAME3 |
+---------+----+-------+-------+-------+
|   false | 1  | d     | b     | c     |
+---------+----+-------+-------+-------+

Everything looks much better now. We have three records in A_TST table running in log mode and a single record in B_TST table.

Let’s test the delete operation:

SQL> delete rasskazov.tst;

1 row deleted.

SQL> commit;

Commit complete.

rasskazov@cloudshell:~ (goldengate-test-project)$ bq query --nouse_legacy_sql 'select * from `goldengate-test-project.testdataset.A_TST` order by position'
Waiting on bqjob_r1155550c7cdaaf07_0000016874de3322_1 ... (0s) Current status: DONE   
+--------+----------------------+----+-------+-------+-------+
| optype |       position       | ID | NAME1 | NAME2 | NAME3 |
+--------+----------------------+----+-------+-------+-------+
| I      | 00000000000000002101 | 1  | a     | b     | c     |
| U      | 00000000000000002251 | 1  | d     | NULL  | NULL  |
| U      | 00000000000000002461 | 1  | d     | b     | c     |
| D      | 00000000000000002629 | 1  | d     | b     | c     |
+--------+----------------------+----+-------+-------+-------+

rasskazov@cloudshell:~ (goldengate-test-project)$ bq head goldengate-test-project:testdataset.B_TST
+---------+----+-------+-------+-------+
| deleted | ID | NAME1 | NAME2 | NAME3 |
+---------+----+-------+-------+-------+
|    true | 1  | d     | b     | c     |
+---------+----+-------+-------+-------+

Everything works as expected. There is one new record with optype=’D’ in A_TST table and the deleted field is set to true in B_TST table.

But wait, how this can work for B_TST table running in auditLogMode = false, assuming that GoldenGate does not use DML statements because of BigQuery limitations? The answer is Streaming Buffer.

Let’s insert another row, but update it after a while, when the Streaming Buffer is disconnected from a table.

SQL> insert into rasskazov.tst values(2, 'x', 'y', 'z');

1 row created.

SQL> commit;

Commit complete.

rasskazov@cloudshell:~ (goldengate-test-project)$ bq query --nouse_legacy_sql 'select * from `goldengate-test-project.testdataset.A_TST` order by position'
Waiting on bqjob_r62df928333cdddc7_0000016874e09267_1 ... (0s) Current status: DONE   
+--------+----------------------+----+-------+-------+-------+
| optype |       position       | ID | NAME1 | NAME2 | NAME3 |
+--------+----------------------+----+-------+-------+-------+
| I      | 00000000000000002101 | 1  | a     | b     | c     |
| U      | 00000000000000002251 | 1  | d     | NULL  | NULL  |
| U      | 00000000000000002461 | 1  | d     | b     | c     |
| D      | 00000000000000002629 | 1  | d     | b     | c     |
| I      | 00000000000000002775 | 2  | x     | y     | z     |
+--------+----------------------+----+-------+-------+-------+

rasskazov@cloudshell:~ (goldengate-test-project)$ bq head goldengate-test-project:testdataset.B_TST
+---------+----+-------+-------+-------+
| deleted | ID | NAME1 | NAME2 | NAME3 |
+---------+----+-------+-------+-------+
|    true | 1  | d     | b     | c     |
|   false | 2  | x     | y     | z     |
+---------+----+-------+-------+-------+

We can see Streaming Buffer statistics by “bq show –format=prettyjson” command:

rasskazov@cloudshell:~ (goldengate-test-project)$ bq show --format=prettyjson goldengate-test-project:testdataset.B_TST
{
  "creationTime": "1548146579832",
  "etag": "AORyfYL9EKY4VyHTy5tmhQ==",
  "id": "goldengate-test-project:testdataset.B_TST",
  "kind": "bigquery#table",
  "lastModifiedTime": "1548146581019",
  "location": "US",
  "numBytes": "0",
...
  "streamingBuffer": {
    "estimatedBytes": "64",
    "estimatedRows": "5",
    "oldestEntryTime": "1548146580000"
  },
...

rasskazov@cloudshell:~ (goldengate-test-project)$ bq show goldengate-test-project:testdataset.B_TST
Table goldengate-test-project:testdataset.B_TST

   Last modified                Schema               Total Rows   Total Bytes   Expiration   Time Partitioning   Labels
 ----------------- -------------------------------- ------------ ------------- ------------ ------------------- --------
  22 Jan 18:43:01   |- deleted: boolean (required)   0            0
                    |- ID: string
                    |- NAME1: string
                    |- NAME2: string
                    |- NAME3: string

We can see five estimated rows in the Streaming Buffer, but 0 rows for our B_TST table. I assume that GoldenGate uses streaming inserts in de-duplication mode providing insertId property. But this doesn’t work if the record is not in the Streaming Buffer.

Let’s wait for a while and do an update when the Streaming Buffer does not have our row

rasskazov@cloudshell:~ (goldengate-test-project)$ bq show goldengate-test-project:testdataset.B_TST
Table goldengate-test-project:testdataset.B_TST

   Last modified                Schema               Total Rows   Total Bytes   Expiration   Time Partitioning   Labels
 ----------------- -------------------------------- ------------ ------------- ------------ ------------------- --------
  22 Jan 20:01:59   |- deleted: boolean (required)   2            26
                    |- ID: string
                    |- NAME1: string
                    |- NAME2: string
                    |- NAME3: string

SQL> update rasskazov.tst set name1 = 'd' where id = 2;

1 row updated.

SQL> commit;

Commit complete.

rasskazov@cloudshell:~ (goldengate-test-project)$ bq head goldengate-test-project:testdataset.B_TST
+---------+----+-------+-------+-------+
| deleted | ID | NAME1 | NAME2 | NAME3 |
+---------+----+-------+-------+-------+
|    true | 1  | d     | b     | c     |
|   false | 2  | x     | y     | z     |
|   false | 2  | d     | y     | z     |
+---------+----+-------+-------+-------+

We can see an inconsistency here. There are two records with the same ID and there is no way to determine which is the correct one. Let’s wait again until the Streaming Buffer is flushed and run a delete statement:

SQL> delete rasskazov.tst;

1 row deleted.

SQL> commit;

Commit complete.

rasskazov@cloudshell:~ (goldengate-test-project)$ bq head goldengate-test-project:testdataset.B_TST
+---------+----+-------+-------+-------+
| deleted | ID | NAME1 | NAME2 | NAME3 |
+---------+----+-------+-------+-------+
|    true | 1  | d     | b     | c     |
|   false | 2  | x     | y     | z     |
|   false | 2  | d     | y     | z     |
|    true | 2  | d     | y     | z     |
+---------+----+-------+-------+-------+

Here is the third record with ID=2.

In summary, “auditLogMode = false” behavior depends on whether the record is in the Streaming Buffer or not. This mode should be used very carefully because it can lead to inconsistent data.

email

Interested in working with Vyacheslav? Schedule a tech call.

1 Comment. Leave new

Valentin Nikotin
January 31, 2019 3:34 pm

Awesome insight!

Reply

Leave a Reply

Your email address will not be published. Required fields are marked *