GoldenGate 12.2 big data adapters: part 1 – HDFS

Posted in: Big Data, Hadoop, Oracle, Technical Track

December 2015 brought us a new version of GoldenGate, and a new version for Big Data adapters for the GoldenGate. Let’s have a look at what we have now and how it works. I am going to start from the HDFS adapter.

As a first step, we need to prepare our source database for replication. It becomes easier with every new GoldenGate version. We will need to perform several steps:
a) Enable archive logging on our database. This particular step requires downtime.

orcl> alter database mount;
Database altered.
orcl> alter database archivelog;
Database altered.
orcl> alter database open;

b) Enable force logging and minimal supplemental logging. No need to shutdown database for this.

orcl> alter database add supplemental log data;
Database altered.
orcl> alter database force logging;
Database altered.
orcl> SELECT supplemental_log_data_min, force_logging FROM v$database;
SUPPLEME FORCE_LOGGING
-------- ---------------------------------------
YES	 YES

c) Switch parameter “enable_goldengate_replication” to “TRUE”. Can be done online.

orcl> alter system set enable_goldengate_replication=true sid='*' scope=both;
System altered.
orcl>

And we are almost done. Now we can create a schema for a GoldenGate administrator, and provide required privileges. I’ve just granted DBA role to the user to simplify process. In any case you will need it in case of integrated capture. For a production installation I advise you to have a look at the documentation to verify necessary privileges and roles.

orcl> create user ogg identified by welcome1 default tablespace users temporary tablespace temp;
orcl> grant connect, dba to ogg;

Let’s create a test schema to be replicated. We will call it schema on the source as ggtest and I will name the destination schema as bdtest. It will allow us also to check how the mapping works in our replication.

orcl> create tablespace ggtest; -- optional step
orcl> create user ggtest identified by welcome1 default tablespace ggtest temporary tablespace temp;
orcl> grant connect, resource to ggtest;

Everything is ready on our source database for the replication.
Now we are installing Oracle GoledenGate for Oracle to our database server. We can get the software from the Oracle site on the download page in the Middleware section, GoldenGate, Oracle GoldenGate for Oracle databases. We are going to use 12.2.0.1.1 version of the software. The installation is easy – you need to unzip the software and run Installer which will guide you through couple of simple steps. The installer will unpack the software to the destination location, create subdirectories, and register GoldenGate in the Oracle global registry.

[oracle@sandbox distr]$ unzip fbo_ggs_Linux_x64_shiphome.zip
[oracle@sandbox distr]$ cd fbo_ggs_Linux_x64_shiphome/Disk1/
[oracle@sandbox Disk1]$ ./runInstaller

We continue by setting up parameters for Oracle GoldenGate (OGG) manager and starting it up. You can see that I’ve used a default blowfish encryption for the password. In a production environment you may consider another encryption like AES256. I’ve also used a non-default port for the manager since I have more than one GoldenGate installation on my test sandbox.

[oracle@sandbox ~]$ export OGG_HOME=/u01/oggora
[oracle@sandbox ~]$ cd $OGG_HOME
[oracle@sandbox oggora]$ ./ggsci
GGSCI (sandbox.localdomain) 1> encrypt password welcome1 BLOWFISH ENCRYPTKEY DEFAULT
Using Blowfish encryption with DEFAULT key.
Encrypted password:  AACAAAAAAAAAAAIARIXFKCQBMFIGFARA
Algorithm used:  BLOWFISH
GGSCI (sandbox.localdomain) 2> edit params mgr
PORT 7829
userid ogg@orcl,password AACAAAAAAAAAAAIARIXFKCQBMFIGFARA, BLOWFISH, ENCRYPTKEY DEFAULT
purgeoldextracts /u01/oggora/dirdat/*, usecheckpoints
GGSCI (sandbox.localdomain) 3> start manager
Manager started.

Let’s prepare everything for initial load, and later online replication.
I’ve decided to use GoldenGate initial load extract as the way for initial load for the sake of consistency for the resulted dataset on Hadoop.
I prepared the parameter file to replicate my ggtest schema and upload all data to the trail file on remote site. I’ve used a minimum number of options for all my processes, providing only handful of parameters required for replication. Extract options is a subject deserving a dedicated blog post. Here is my simple initial extract:

[oracle@sandbox oggora]$ cat /u01/oggora/dirprm/ini_ext.prm
SOURCEISTABLE
userid ogg@orcl,password AACAAAAAAAAAAAIARIXFKCQBMFIGFARA, BLOWFISH, ENCRYPTKEY DEFAULT
--RMTHOSTOPTIONS
RMTHOST sandbox, MGRPORT 7839
RMTFILE /u01/oggbd/dirdat/initld, MEGABYTES 2, PURGE
--DDL include objname ggtest.*
TABLE ggtest.*;

Then we run the initial load extract in passive node and it will create a trail file with the data. The trail file will be used later for our initial load on the target side.

[oracle@sandbox oggora]$ ./extract paramfile dirprm/ini_ext.prm reportfile dirrpt/ini_ext.rpt
[oracle@sandbox oggora]$ ll /u01/oggbd/dirdat/initld*
-rw-r-----. 1 oracle oinstall 3028 Feb 16 14:17 /u01/oggbd/dirdat/initld
[oracle@sandbox oggora]$

We can also prepare our extract on the source site as well. I haven’t used datapump in my configuration limiting the topology only by simplest and strait-forward extract to replicat configuration. Of course, in any production configuration I would advise using datapump on source for staging our data.
Here are my extract parameters, and how I added it. I am not starting it yet because I must have an Oracle GoldenGate Manager running on the target, and the directory for the trail file should be created. You may have guessed that the Big Data GoldenGate will be located in /u01/oggbd directory.

[oracle@sandbox oggora]$ ggsci
Oracle GoldenGate Command Interpreter for Oracle
Version 12.2.0.1.1 OGGCORE_12.2.0.1.0_PLATFORMS_151211.1401_FBO
Linux, x64, 64bit (optimized), Oracle 12c on Dec 12 2015 02:56:48
Operating system character set identified as UTF-8.
Copyright (C) 1995, 2015, Oracle and/or its affiliates. All rights reserved.
GGSCI (sandbox.localdomain) 1> edit params ggext
extract ggext
userid ogg@orcl,password AACAAAAAAAAAAAIARIXFKCQBMFIGFARA, BLOWFISH, ENCRYPTKEY DEFAULT
--RMTHOSTOPTIONS
RMTHOST sandbox, MGRPORT 7839
RMTFILE /u01/oggbd/dirdat/or, MEGABYTES 2, PURGE
DDL include objname ggtest.*
TABLE ggtest.*;
GGSCI (sandbox.localdomain) 2> dblogin userid ogg@orcl,password AACAAAAAAAAAAAIARIXFKCQBMFIGFARA, BLOWFISH, ENCRYPTKEY DEFAULT
Successfully logged into database.
GGSCI (sandbox.localdomain as ogg@orcl) 3> register extract GGEXT database
2016-02-16 15:37:21  INFO    OGG-02003  Extract GGEXT successfully registered with database at SCN 17151616.
GGSCI (sandbox.localdomain as ogg@orcl) 4> add extract ggext, INTEGRATED TRANLOG, BEGIN NOW
EXTRACT (Integrated) added.

Let’s leave our source site for a while and switch to the target . Our target is going to be a box where we have hadoop client and all requirement java classes.
I used the same box just to save resources on my sandbox environment. You may run different GoldeGate versions on the same box considering, that Manager ports for each of them will be different.
Essentially we need a Hadoop client on the box, which can connect to HDFS and write data there. Installation of Hadoop client is out of the scope for this article, but you can easily get all necessary information from the Hadoop home page .

Having all required Hadoop classes we continue by installing Oracle GoldenGate for Big Data, configuring and starting it up. In the past I received several questions from people struggling to find the exact place where all adapters could be uploaded. The Adapters were well “hidden” on “Oracle edelivery”, but now it is way simpler. You are going to GoldenGate download page on Oracle site and find the section “Oracle GoldenGate for Big Data 12.2.0.1.0” where you can choose the OGG for Linux x86_64, Windows or Solaris. You will need an Oracle account to get it. We upload the file to our linux box, unzip and unpack the tar archive. I created a directory /u01/oggbd as our GoldenGate home and unpacked the tar archive there.
The next step is to create all necessary directories. We start command line utility and create all subdirectories.

[oracle@sandbox ~]$ cd /u01/oggbd/
[oracle@sandbox oggbd]$ ./ggsci
Oracle GoldenGate Command Interpreter
Version 12.2.0.1.0 OGGCORE_12.2.0.1.0_PLATFORMS_151101.1925.2
Linux, x64, 64bit (optimized), Generic on Nov 10 2015 16:18:12
Operating system character set identified as UTF-8.
Copyright (C) 1995, 2015, Oracle and/or its affiliates. All rights reserved.
GGSCI (sandbox.localdomain) 1> create subdirs
Creating subdirectories under current directory /u01/oggbd
Parameter files                /u01/oggbd/dirprm: created
Report files                   /u01/oggbd/dirrpt: created
Checkpoint files               /u01/oggbd/dirchk: created
Process status files           /u01/oggbd/dirpcs: created
SQL script files               /u01/oggbd/dirsql: created
Database definitions files     /u01/oggbd/dirdef: created
Extract data files             /u01/oggbd/dirdat: created
Temporary files                /u01/oggbd/dirtmp: created
Credential store files         /u01/oggbd/dircrd: created
Masterkey wallet files         /u01/oggbd/dirwlt: created
Dump files                     /u01/oggbd/dirdmp: created
GGSCI (sandbox.localdomain) 2>

We are changing port for our manager process from default and starting it up. I’ve already mentioned that the port was changed due to the existence off several GoldenGate managers running from different directories.

GGSCI (sandbox.localdomain) 2> edit params mgr
PORT 7839
.....
GGSCI (sandbox.localdomain) 3> start manager
Manager started.

Now we have to prepare parameter files for our replicat processes. Let’s assume the environment variable OGGHOME represents the GoldenGate home and in our case it is going to be /u01/oggbd.
Examples for the parameter files can be taken from $OGGHOME/AdapterExamples/big-data directories. There you will find examples for flume, kafka, hdfs, hbase and for metadata providers. Today we are going to work with HDFS adapter.
I copied files to my parameter files directory ($OGGHOME/dirprm) and modified them accordingly:

oracle@sandbox oggbd]$ cp /u01/oggbd/AdapterExamples/big-data/hdfs/* /u01/oggbd/dirprm/
oracle@sandbox oggbd]$ vi /u01/oggbd/dirprm/hdfs.props

Here are my values for the hdfs.props file:

[oracle@bigdata dirprm]$ cat hdfs.props
gg.handlerlist=hdfs
gg.handler.hdfs.type=hdfs
gg.handler.hdfs.includeTokens=false
gg.handler.hdfs.maxFileSize=1g
gg.handler.hdfs.rootFilePath=/user/oracle/gg
gg.handler.hdfs.fileRollInterval=0
gg.handler.hdfs.inactivityRollInterval=0
gg.handler.hdfs.fileSuffix=.txt
gg.handler.hdfs.partitionByTable=true
gg.handler.hdfs.rollOnMetadataChange=true
gg.handler.hdfs.authType=none
gg.handler.hdfs.format=delimitedtext
#gg.handler.hdfs.format.includeColumnNames=true
gg.handler.hdfs.mode=tx
goldengate.userexit.timestamp=utc
goldengate.userexit.writers=javawriter
javawriter.stats.display=TRUE
javawriter.stats.full=TRUE
gg.log=log4j
gg.log.level=INFO
gg.report.time=30sec
gg.classpath=/usr/lib/hadoop/*:/usr/lib/hadoop/lib/*:/usr/lib/hadoop-hdfs/*:/usr/lib/hadoop/etc/hadoop/:/usr/lib/hadoop/lib/native/*
javawriter.bootoptions=-Xmx512m -Xms32m -Djava.class.path=ggjava/ggjava.jar

You can find information about all those parameters in oracle documentation here, but there are parameters you will most likely need to change from default:

  • gg.handler.hdfs.rootFilePath – it will tell where the directories and files have to be written on HDFS.
  • gg.handler.hdfs.format – you can setup one of the four formats supported by adapter.
  • goldengate.userexit.timestamp – it will depend from your preferences for transactions timestamps written to your hdfs files.
  • gg.classpath – it will depend from location for your hadoop jar classes and native libraries.

You can see I’ve mentioned the gg.handler.hdfs.format.includeColumnNames parameter. It is supposed to put column name before each value in the output file on HDFS. It may be helpful in some cases, but at the same time it makes the file bigger. If you are planning to create an external Hive table, you may consider commenting on it as I have.
The next parameter file is for our data initialization replicat file. You may consider using a Sqoop or another way to make the initial load for your tables, but I think it makes sense to use the GG replicat if the table size is relatively small. It makes the resulting file-set more consistent with the following replication, since it will be using the same engine and format. So, here is my replicat for initial load:

[oracle@sandbox dirprm]$ cat /u01/oggbd/dirprm/irhdfs.prm
--passive REPLICAT for initial load irhdfs
-- Trail file for this example is located in "dirdat/initld"
-- Command to run REPLICAT:
-- ./replicat paramfile dirprm/irhdfs.prm reportfile dirrpt/ini_rhdfs.rpt
SPECIALRUN
END RUNTIME
EXTFILE /u01/oggbd/dirdat/initld
--DDLERROR default discard
setenv HADOOP_COMMON_LIB_NATIVE_DIR=/usr/lib/hadoop/lib/native
DDL include all
TARGETDB LIBFILE libggjava.so SET property=dirprm/hdfs.props
REPORTCOUNT EVERY 1 MINUTES, RATE
GROUPTRANSOPS 10000
MAP ggtest.*, TARGET bdtest.*;

I was running the initial load in passive mode, without creating a managed process and just running it from command line. Here is an example:

[oracle@sandbox oggbd]$ ./replicat paramfile dirprm/irhdfs.prm reportfile dirrpt/ini_rhdfs.rpt
[oracle@sandbox oggbd]$ hadoop fs -ls /user/oracle/gg/
Found 2 items
drwxr-xr-x   - oracle oracle          0 2016-02-16 14:37 /user/oracle/gg/bdtest.test_tab_1
drwxr-xr-x   - oracle oracle          0 2016-02-16 14:37 /user/oracle/gg/bdtest.test_tab_2
[oracle@sandbox oggbd]$ hadoop fs -ls /user/oracle/gg/bdtest.test_tab_1
Found 1 items
-rw-r--r--   1 oracle oracle        624 2016-02-16 14:37 /user/oracle/gg/bdtest.test_tab_1/bdtest.test_tab_1_2016-02-16_14-37-43.376.txt
[oracle@sandbox oggbd]$ hadoop fs -tail /user/oracle/gg/bdtest.test_tab_1/bdtest.test_tab_1_2016-02-16_14-37-43.376.txt
IBDTEST.TEST_TAB_12016-02-16 19:17:40.7466992016-02-16T14:37:43.37300000000000-100000020121371O62FX2014-01-24:19:09:20RJ68QYM52014-01-22:12:14:30
IBDTEST.TEST_TAB_12016-02-16 19:17:40.7466992016-02-16T14:37:44.89600000000000-100000021552371O62FX2014-01-24:19:09:20HW82LI732014-05-11:05:23:23
IBDTEST.TEST_TAB_12016-02-16 19:17:40.7466992016-02-16T14:37:44.89600100000000-100000022983RXZT5VUN2013-09-04:23:32:56RJ68QYM52014-01-22:12:14:30
IBDTEST.TEST_TAB_12016-02-16 19:17:40.7466992016-02-16T14:37:44.89600200000000-100000024414RXZT5VUN2013-09-04:23:32:56HW82LI732014-05-11:05:23:23
[oracle@sandbox oggbd]$

You can see the Hadoop directories and files created by the initial load.
As soon as the initial load has run we can start our extract and replicat to keep the destination side updated.
We are moving to source and starting our extract prepared earlier.

GGSCI (sandbox.localdomain as ogg@orcl) 6>start extract ggext
Sending START request to MANAGER ...
EXTRACT GGEXT starting

So, moving back to target and preparing our replicat. I used the replicat with the following parameters:

[oracle@sandbox oggbd]$ cat /u01/oggbd/dirprm/rhdfs.prm
REPLICAT rhdfs
-- Trail file for this example is located in "dirdat/or" directory
-- Command to add REPLICAT
-- add replicat rhdfs, exttrail dirdat/or
--DDLERROR default discard
setenv HADOOP_COMMON_LIB_NATIVE_DIR=/usr/lib/hadoop/lib/native
DDL include all
TARGETDB LIBFILE libggjava.so SET property=dirprm/hdfs.props
REPORTCOUNT EVERY 1 MINUTES, RATE
GROUPTRANSOPS 10000
MAP ggtest.*, TARGET bdtest.*;

We are adding the replicat to our configuration and it is going to carry on the replication.

GGSCI (sandbox.localdomain) 1> add replicat rhdfs, exttrail dirdat/or
REPLICAT added.
GGSCI (sandbox.localdomain) 2> start replicat rhdfs
Sending START request to MANAGER ...
REPLICAT RHDFS starting
GGSCI (sandbox.localdomain) 3>

Our replication is up and running, the initial load worked fine, and we can test and see what we have on source and target.
Here is the data in one of our source tables:

orcl> select * from ggtest.test_tab_1;
	   PK_ID RND_STR    USE_DATE	      RND_STR_1  ACC_DATE
---------------- ---------- ----------------- ---------- -----------------
	       1 371O62FX   01/24/14 19:09:20 RJ68QYM5	 01/22/14 12:14:30
	       2 371O62FX   01/24/14 19:09:20 HW82LI73	 05/11/14 05:23:23
	       3 RXZT5VUN   09/04/13 23:32:56 RJ68QYM5	 01/22/14 12:14:30
	       4 RXZT5VUN   09/04/13 23:32:56 HW82LI73	 05/11/14 05:23:23
orcl>

I’ve created an external Hive table for the table test_tab_1 to have better look.

hive> CREATE EXTERNAL TABLE BDTEST.TEST_TAB_1  (tran_flag string, tab_name string, tran_time_utc timestamp, tran_time_loc string,something string, something1 string,
    > PK_ID INT, RND_STR VARCHAR(10),USE_DATE string,RND_STR_1 string, ACC_DATE string)
    > stored as textfile location '/user/oracle/gg/bdtest.test_tab_1';
OK
Time taken: 0.327 seconds
hive> select * from BDTEST.TEST_TAB_1;
OK
I	BDTEST.TEST_TAB_1	2016-02-16 19:17:40.746699	2016-02-16T14:37:43.373000	00000000-10000002012		1	371O62FX	2014-01-24:19:09:20	RJ68QYM5	2014-01-22:12:14:30
I	BDTEST.TEST_TAB_1	2016-02-16 19:17:40.746699	2016-02-16T14:37:44.896000	00000000-10000002155		2	371O62FX	2014-01-24:19:09:20	HW82LI73	2014-05-11:05:23:23
I	BDTEST.TEST_TAB_1	2016-02-16 19:17:40.746699	2016-02-16T14:37:44.896001	00000000-10000002298		3	RXZT5VUN	2013-09-04:23:32:56	RJ68QYM5	2014-01-22:12:14:30
I	BDTEST.TEST_TAB_1	2016-02-16 19:17:40.746699	2016-02-16T14:37:44.896002	00000000-10000002441		4	RXZT5VUN	2013-09-04:23:32:56	HW82LI73	2014-05-11:05:23:23
Time taken: 0.155 seconds, Fetched: 4 row(s)
hive>

You can see the table definition is a bit different from what we have on the source site and you will see why.
We got additional columns on the destination side. Interesting that while some of them have a pretty clear purpose, the other columns are not totally clear and have null values.
The first column is a flag for operation, and it shows what kind of operation we have gotten in this row. It can be “I” for insert, “D” for delete and “U” for an update. The second column is table name. The third one is a timestamp in UTC showing when the transaction occurred. The next one is another time in local timezone informing the time of commit, and the next column has a commit sequence number. Those columns can help you to construct the proper data set for any given time.
Let’s insert and update some row(s) on source and see how it will show up on the target:

orcl> insert into ggtest.test_tab_1 values (5,'TEST_1',sysdate,'TEST_1',sysdate);
1 row created.
orcl> commit;
orcl> update ggtest.test_tab_1 set RND_STR='TEST_1_1' where PK_ID=5;
1 row updated.
orcl> commit;
Commit complete.
orcl>

Let’s check how it is reflected on the destination site. We see two new rows, where each row represents a DML operation. One was for the insert and the second one was for the update.

[highlight="7,8"
hive> select * from BDTEST.TEST_TAB_1;
OK
I	BDTEST.TEST_TAB_1	2016-02-16 19:17:40.746699	2016-02-16T14:37:43.373000	00000000-10000002012		1	371O62FX	2014-01-24:19:09:20	RJ68QYM5	2014-01-22:12:14:30
I	BDTEST.TEST_TAB_1	2016-02-16 19:17:40.746699	2016-02-16T14:37:44.896000	00000000-10000002155		2	371O62FX	2014-01-24:19:09:20	HW82LI73	2014-05-11:05:23:23
I	BDTEST.TEST_TAB_1	2016-02-16 19:17:40.746699	2016-02-16T14:37:44.896001	00000000-10000002298		3	RXZT5VUN	2013-09-04:23:32:56	RJ68QYM5	2014-01-22:12:14:30
I	BDTEST.TEST_TAB_1	2016-02-16 19:17:40.746699	2016-02-16T14:37:44.896002	00000000-10000002441		4	RXZT5VUN	2013-09-04:23:32:56	HW82LI73	2014-05-11:05:23:23
I	BDTEST.TEST_TAB_1	2016-02-16 20:43:32.000231	2016-02-16T15:43:37.199000	00000000000000002041		5	TEST_1	2016-02-16:15:43:25	TEST_1	2016-02-16:15:43:25
U	BDTEST.TEST_TAB_1	2016-02-16 20:43:53.000233	2016-02-16T15:43:56.056000	00000000000000002243		5	TEST_1_1
Time taken: 2.661 seconds, Fetched: 6 row(s)
hive>

It work for deletes too, only flag will be “D” instead of “I” for insert and “U” for updates.

What about DDL support? Let’s truncate the table.

orcl> truncate table ggtest.test_tab_1;
Table truncated.
orcl>

And here, there is nothing in our hdfs files. Maybe I’ve missed something, but it looks like the truncate operation is not creating any record. I need to dig a bit more. I will try to make a separate blog post about DDL support for the Big Data Adapters.
It works pretty well when we create a new table and insert new rows.

It also works if we change one of the existing tables, adding or dropping a column. Let’s try to create a new table here:

orcl> create table ggtest.test_tab_4 (pk_id number, rnd_str_1 varchar2(10),acc_date date);
Table created.
orcl> insert into ggtest.test_tab_4 select * from ggtest.test_tab_2;
1 row created.
orcl> commit;
Commit complete.
orcl>

You can see that it has created a new directory and file for the new table. Additionally, if you add a column the new file will be used for all DML operations for the updated table. It will help to separate tables with different structure.

[oracle@sandbox oggbd]$ hadoop fs -ls /user/oracle/gg/
Found 3 items
drwxr-xr-x   - oracle oracle          0 2016-02-16 15:43 /user/oracle/gg/bdtest.test_tab_1
drwxr-xr-x   - oracle oracle          0 2016-02-16 14:37 /user/oracle/gg/bdtest.test_tab_2
drwxr-xr-x   - oracle oracle          0 2016-02-16 15:56 /user/oracle/gg/bdtest.test_tab_4
[oracle@sandbox oggbd]$ hadoop fs -ls /user/oracle/gg/bdtest.test_tab_4/
Found 1 items
-rw-r--r--   1 oracle oracle        127 2016-02-16 15:56 /user/oracle/gg/bdtest.test_tab_4/bdtest.test_tab_4_2016-02-16_15-56-50.373.txt
[oracle@sandbox oggbd]$ hadoop fs -tail /user/oracle/gg/bdtest.test_tab_4/bdtest.test_tab_4_2016-02-16_15-56-50.373.txt
IBDTEST.TEST_TAB_42016-02-16 20:56:47.0009532016-02-16T15:56:50.371000000000000000000068327IJWQRO7T2013-07-07:08:13:52
[oracle@sandbox oggbd]$

At first glance it looks good, but let’s try to create a table as select.

orcl> create table ggtest.test_tab_3 as select * from ggtest.test_tab_2;
Table created.
orcl>

Not sure if it is an expected behavior or bug, but apparently it is not working. Our replicat is broken and complains that “DDL Replication is not supported for this database”.

[oracle@sandbox oggbd]$ tail -5 ggserr.log
2016-02-16 15:43:37  INFO    OGG-02756  Oracle GoldenGate Delivery, rhdfs.prm:  The definition for table GGTEST.TEST_TAB_1 is obtained from the trail file.
2016-02-16 15:43:37  INFO    OGG-06511  Oracle GoldenGate Delivery, rhdfs.prm:  Using following columns in default map by name: PK_ID, RND_STR, USE_DATE, RND_STR_1, ACC_DATE.
2016-02-16 15:43:37  INFO    OGG-06510  Oracle GoldenGate Delivery, rhdfs.prm:  Using the following key columns for target table bdtest.TEST_TAB_1: PK_ID.
2016-02-16 15:48:49  ERROR   OGG-00453  Oracle GoldenGate Delivery, rhdfs.prm:  DDL Replication is not supported for this database.
2016-02-16 15:48:49  ERROR   OGG-01668  Oracle GoldenGate Delivery, rhdfs.prm:  PROCESS ABENDING.
[oracle@sandbox oggbd]$
[oracle@sandbox oggbd]$ hadoop fs -ls /user/oracle/gg/
Found 2 items
drwxr-xr-x   - oracle oracle          0 2016-02-16 15:43 /user/oracle/gg/bdtest.test_tab_1
drwxr-xr-x   - oracle oracle          0 2016-02-16 14:37 /user/oracle/gg/bdtest.test_tab_2
[oracle@sandbox oggbd]$

What can we say in summary? The replication works, and supports all DML, and some DDL commands. You will need to prepare to get consistent datasets for any given time using flag for operation, and time for the transaction. In my next few posts, I will cover other Big Data adapters for GoldenGate.

email

Authors

Interested in working with Gleb? Schedule a tech call.

About the Author

Regarded by his peers as an Oracle guru, Gleb is known for being able to resolve any problem related to Oracle. He loves the satisfaction of troubleshooting, and his colleagues even say that seeking Gleb’s advice regarding an issue is more efficient than looking it up. Gleb enjoys the variety of challenges he faces while working at Pythian, rather than working on the same thing every day. His areas of speciality include Oracle RAC, Exadata, RMAN, SQL tuning, high availability, storage, performance tuning, and many more. When he’s not working, running, or cycling, Gleb can be found reading.

33 Comments. Leave new

Jim Collings
August 9, 2016 2:54 pm

Hello, nice article. I have one question about the GoldenGate Adapters. Does the GG product see the adapters in the same way that it would see the replicat process. In other words, can you show the status of an adapter that is being sent data or is the adapter kind of like a separate process that has to monitored outside of GG. For example, lets say that GG is replicating data fine to the adapter, but for some reason nothing is being written to the target. From GG do you get back an error. What happens if the adapter fails, does GG show the status of the adapters that are running or do you just have to know that something is wrong and go check the logs from the adapter to find out what the problem is.

Reply
Gleb Otochkin
August 18, 2016 1:09 pm

Hi Jim,
You have a OGG replicat on adapter side and can see the status and details directly from GoldenGate:
GGSCI (bigdatalite.localdomain) 5> info rhbase

REPLICAT RHBASE Last Started 2016-03-29 12:56 Status RUNNING
Checkpoint Lag 00:00:00 (updated 00:00:08 ago)
Process ID 27277
Log Read Checkpoint File dirdat/or000045
2016-03-24 13:09:02.000274 RBA 1018490

The same will be about errors and logging for the adapter side. You work with it as with any other replicat.
Hope it helps,
Gleb

Reply

Hello Gleb –

I’m Reddy, it was a pleasure reading thru your article… very informative! I have a situation where I have to replicate using goldengate from oracle DB into hdfs so my question is are all ddl supported if I were to have ddl replication into hdfs?

By ddl, I’m referring to the following situations

altering existing tables to add new columns
altering existing tales to modify datatype of existing columns
Flexibility to add tables that are created as new ( we are going to perform schema lavel replication )

can you please let us know from prior experience if DDL replication is even possible into hdfs?

Reply

Hi Reddy,
Yes, the DDL commands altering a table structure are supported in OGG to BD. So, if you add or change a column it will be reflected in the next DML transaction. Just keep in mind that the change will be visible and replicated only with the following DML.
Also for different output format it will look slightly different. Like for the csv text on hdfs you are going to have a new row with change number of columns and for JSON you are getting the schema/table definition with new values.
Hope it helps.
Gleb

Reply

Hi,
How to configure high availability for oracle golden gate for big data .lets assume 10 node hadoop cluster . now i have configured oggbd on node5(one of the data node) I am running my replicat process on node5 .The trail files are pumped on node5 . What if this node5 goes down ??. (What about the trail files and the replicat process). What are the options for high availability..?

Desperately waiting for your reply,
Rony

Reply

Hi Rony,
I you want GoldenGate to be high available you need to think about creating a high available environment for the GoldenGate itself. I would not install it to any Hadoop node but prepare a HA solution for GoldenGate. As example you can create an Oracle RAC , put GoldenGate and trail files to a shared FS and configure high availability. The OGG requires clients hadoop software to push data to hadoop cluster but it doesn’t mean the OGG has to be installed to any of your hadoop nodes.

Gleb

Reply

I tried to replicate the process, but I keep getting errors about “OGG-00423 Could not find definition for GGTEST.RAINFALL”

I try to add the DDL option in the param file but seems it is incompatible with the SOURCEISTABLE option.

Reply

I got that error when I was trying to do the replicat process in the destination box

Reply

I tried to use the ASSUMETARGETDEFS option in the irhdfs.prm for the replicat process, but then it claims that “table bdtest.RAINFALL does not exist in target database”

Reply

I think I sorted it out by using SOURCEDEFS and defgen to generate a definition file.

And as the title specified, OGG 12.2 support trail files contains definition directly, thus there’s no other steps in the parameter file.

Reply

It was working for me and all DDL have been uploaded to the trail file. You may try to run a simple extract without SOURCEISTABLE but with DDL included and it should have all the definition too. Let me know if it works for you.
The defgen would be working for sure but it is not flexible and require to recreate the file every time when you change definitions for the objects.

Regards,
Gleb

Hi Tim,

I am getting the same error message, how was the issue resolved?

You have to have DDL uploaded to a trail file (for OGG 12.2) or defgen file if you don’t use DDL option for your extract. In case of earlier version of OGG, like 12.1, you have to use defgen.

Regards,
Gleb

Reply

Hi Gleb,
But how can you maintain a non-transaction table? As you mentioned each action (Insert.Update.Delete) would create a new row with a colume op_type I/U/D. However, I just want to replicate the table, which means 3 inserts + 1 delete should end up with 2 records in the hadoop txt file instead of 4 records. Same apply to update.

If I want to achieve the goal above, how can I configure it?

Thanks,

Reply
Gleb Otochkin
October 17, 2016 9:29 am

Hi Tim,
I am not sure if you can do it with text file. If I have understood you correctly you want to skip some operations like if your have inserted and deleted the same row inside one transaction.
I don’t think it is possible in the current version using a parameter. You have two different mode “tx” and “op” for the handler but is not going to filter any of your DML but enable or disable grouping for messages.
GoldenGate is not batch but rather streaming replication and replicate all committed changes.
Thanks,
Gleb

Hi Gleb,
Sorry I might not make my point clear. What I want to do is to have a result set (in the HDFS) exactly same as the source (in terms of the content).
In the blog you did an update. But instead of a updated record on the HDFS side, you have a new row with op_type = U (as U BDTEST.TEST_TAB_1 2016-02-16 20:43:53.000233 2016-02-16T15:43:56.056000 00000000000000002243 5 TEST_1_1)
What I want is instead of having a record of the update, I want the update happened in the result txt file (Your update should not be a new record, but should really get updated in the HDFS file).
If I want this to happen, anything else I should configure?

Reply
Gleb Otochkin
October 18, 2016 8:34 am

Hi Tim,
I think you may need to consider another platform as destination. HDFS is not really suited to keep a dataset. You create and append data to files in HDFS but not update information there.
I would suggest to look to HDFS (or to Kafka topic) as a log of changes, something more like CDC than a replication of state for an object. You off course can construct state for a object for any given time having initial state and log of changes but it will require some programming.
You may have a look to HBASE instead as a target for your replication to see a state for the object instead of log of changes.

Thanks,
Gleb

Reply
Balu Nimmagadda
November 1, 2016 9:36 am

With the above configuration i was not able to do updates on extract. Is there any extra configuration that needs to be done.

Reply
Balu Nimmagadda
November 1, 2016 9:40 am

2016-11-01 09:29:21 WARNING OGG-01431 Aborted grouped transaction on ‘hrhdfs.EMPLOYEES’, Mapping error.

2016-11-01 09:29:21 WARNING OGG-01003 Repositioning to rba 2542 in seqno 0.

2016-11-01 09:29:21 WARNING OGG-01151 Error mapping from HR.EMPLOYEES to hrhdfs.EMPLOYEES.

Reply

Hi Balu,
It shouldn’t require any extra configuration but your hdfs target should be ready.
Are you getting errors only for updates or to inserts and deletes?
Unfortunately information you have provided doesn’t provide any clue for the error.
I assume the log is from replicat side.
Couple of things to verify:
Check again your parameter files for replicat.
Check existence and permissions for the destination hdfs directory to make sure the user running OGG has rights to write to the directory.

Regards,
Gleb

Reply

i have met the error OGG-15051
that said:oracle.goldengate.util.GGException: Unable to commit transaction, STATUS=ABEND.
what can i do

Reply

Hi Bob,
Unfortunately the information is not sufficient to troubleshoot your problem. I can only guess assuming you are getting the error for the replicat and trying to write to HDFS. You may need to check if the user able to write to your designated location in HDFS.
Thanks,
Gleb

Reply

I have the same issue as Bob in the initial load in passive mode. The trailfile is OK. the gg.classpath is set to include the following items (according to Oracle’s doc) :
Hadoop_Home/etc/hadoop
Hadoop_Home/share/hadoop/common/lib/*
Hadoop_Home/share/hadoop/common/*
Hadoop_Home/share/hadoop/hdfs/lib/*
Hadoop_Home/share/hadoop/hdfs/*

However the replica process is ABENDed with the error OGG-15051.

Reply

Hi Luke,
I would be glad to help you but the information you have provided was not enough to troubleshoot. What exactly the error say ? Usually it is more or less clear from the error (you have a error log and report files for the replicat).

Thanks,
Gleb

Reply

Hello Gleb,

I have implemented this hdfs adapter and able to load data to hadoop hdfs. Thanks for your detailed explanation. But I am facing an issue in applying transform fuctions in the replicat using colmap(). Not sure whether this some issue with hdfs handler or something else. I am not able to use that colmap() for any new columns to target. eg : MAP “SRC_DB”.”SRC_TABLE” , TARGET TGT_DB.”TGT_TABLE”, COLMAP(USEDEFAULTS,tran_date = @DATENOW()) this does not work saying ” ERROR OGG-00919 Oracle GoldenGate Delivery, TST.prm: Error in COLMAP clause.” But if I use COLMAP(USEDEFAULTS) alone, it will work. I am not looking or populating current_timestamp, but checking how can I use the transform functions in Bigdata 12.2? Please help me resolve this

Reply

This is what it shows in the report
…(USEDEFAULTS, TEST=’ABCD’)…
^
Error in COLMAP clause. Unrecognized clause or element.

Reply
Gleb Otochkin
March 21, 2017 10:27 am

Hi Kumar,
To use colmap functionality you have to use Metadata Provider in your configuration to define the target.
Here is link to documentation:
https://docs.oracle.com/goldengate/bd123010/gg-bd/GADBD/GUID-F90702A9-0FFB-42D2-9261-89E2BB952C65.htm#GADBD569

So far i see documentation for Avro, java and Hive metadata providers.

Thanks,
Gleb

Reply

Hello, nice article. I have one question about the GoldenGate Adapters.How do I set the time format for writing HDFS?The time format is now ‘YYYY-MM-DD:HH24:MI:SS’, but I would like to set it as’ YYYY-MM-DD HH24:MI:SS’. Reduce a colon.how can I configure it?

Reply
Gleb Otochkin
July 12, 2017 9:25 am

Hi Jim,
What version of OGG do you run? Interesting that in my current version I see exactly YYYY-MM-DD HH24:MI:SS format of the date. And OGG documentation says:
“Datetime (DATE, TIME, and TIMESTAMP) columns can accept datetime and character columns, as well as string literals. Literals must be enclosed within single quotes. To map a character column to a datetime column, make certain it conforms to the Oracle GoldenGate external SQL format of YYYY-MM-DD HH:MI:SS.FFFFFF.”

Anyway you can use @DATE function in your mapping like :

MAP ggtest.*, TARGET ggtest.*,
COLMAP ( USEDEFAULTS,
ACC_DATE = @DATE(‘DD/MM/YY HH:MI:SS’,’YYYY-MM-DD HH:MI:SS’,ACC_DATE)
);

In this case it converts ‘YYYY-MM-DD HH24:MI:SS’ to ‘DD/MM/YY HH:MI:SS’ and here is the result:
[oracle@bigdatalite oggbd]$ hadoop fs -cat ogg/ggtest.test_tab_1/ggtest.test_tab_1_2017-07-12_10-10-13.677.txt
IGGTEST.TEST_TAB_12017-07-12 14:08:15.0001762017-07-12T10:10:13.674000000000002400076288234Test042017-07-12 10:08:12Test0412/07/17 10:08:12

Thanks,
Gleb

Reply
Andrew Tindle
March 13, 2018 3:51 pm

Hi Gleb,

Thanks for the post – it helped me very quickly to set up a Proof of Concept environment.

One question I have – is there any way to configure the GG Adapter to place the files into sub-directories for each table, so that a Hive Partitioned table can be created ?

Or is the best approach to achieve this to:
1. Create a non-partitioned External table on top of the replicated HDFS data
2. Create a partitioned table
3. Run a Job to do:
INSERT
SELECT FROM
?

Thanks
Andrew

Reply
Venkata Bommineni
June 17, 2018 2:48 pm

Hi Gleb,

Thanks for the post and I am just following the same steps in doing my initial load. I ran into issues while creating the extract file for a 100 GB table ( 2 gig limit in size). I added this command on extract side – RMTFILE /mnt/ogg/dirdat/initld, MAXFILES 999, MEGABYTES 1500 and it created multiple files. Issue solved on Source Extract.

Now the issue is in processing those files on target side. As per this document in irhdfs.prm, how to process multiple files? You param file specifically calls a file name initld – EXTFILE /u01/oggbd/dirdat/initld

Thanks,
Venkat

Reply
Gleb Otochkin
June 18, 2018 9:11 am

Hi Venkat,
Can you tell me files names you have gotten for the dataset? Are they look like trail file name + sequential number?
When you specify a trail file name you put only name for the file without the sequential numbers. GoldenGate replicat by default will try to run all the files stating from sequence 0.

Thanks,
Gleb

Reply

Hi Gleb,
nice article, help me a lot.

But, I have a question. I noticed that when you update the row with id = 5, you got some informations about the transaction and the updated value, but i want to know if there’s a way to get all values, included the non-updated values.

for example:

when you insert -> you have 10 columns
I BDTEST.TEST_TAB_1 2016-02-16 20:43:32.000231 2016-02-16T15:43:37.199000 00000000000000002041 5 TEST_1 2016-02-16:15:43:25 TEST_1 2016-02-16:15:43:25

when you update -> you have 8 columns, the two last is missing
U BDTEST.TEST_TAB_1 2016-02-16 20:43:53.000233 2016-02-16T15:43:56.056000 00000000000000002243 5 TEST_1_1

Reply

Leave a Reply

Your email address will not be published. Required fields are marked *