December 2015 brought us a new version of GoldenGate, and a new version for Big Data adapters for the GoldenGate. Let’s have a look at what we have now and how it works. I am going to start from the HDFS adapter.
As a first step, we need to prepare our source database for replication. It becomes easier with every new GoldenGate version. We will need to perform several steps:
a) Enable archive logging on our database. This particular step requires downtime.
orcl> alter database mount; Database altered. orcl> alter database archivelog; Database altered. orcl> alter database open;
b) Enable force logging and minimal supplemental logging. No need to shutdown database for this.
orcl> alter database add supplemental log data; Database altered. orcl> alter database force logging; Database altered. orcl> SELECT supplemental_log_data_min, force_logging FROM v$database; SUPPLEME FORCE_LOGGING -------- --------------------------------------- YES YES
c) Switch parameter “enable_goldengate_replication” to “TRUE”. Can be done online.
orcl> alter system set enable_goldengate_replication=true sid='*' scope=both; System altered. orcl>
And we are almost done. Now we can create a schema for a GoldenGate administrator, and provide required privileges. I’ve just granted DBA role to the user to simplify process. In any case you will need it in case of integrated capture. For a production installation I advise you to have a look at the documentation to verify necessary privileges and roles.
orcl> create user ogg identified by welcome1 default tablespace users temporary tablespace temp; orcl> grant connect, dba to ogg;
Let’s create a test schema to be replicated. We will call it schema on the source as ggtest and I will name the destination schema as bdtest. It will allow us also to check how the mapping works in our replication.
orcl> create tablespace ggtest; -- optional step orcl> create user ggtest identified by welcome1 default tablespace ggtest temporary tablespace temp; orcl> grant connect, resource to ggtest;
Everything is ready on our source database for the replication.
Now we are installing Oracle GoledenGate for Oracle to our database server. We can get the software from the Oracle site on the download page in the Middleware section, GoldenGate, Oracle GoldenGate for Oracle databases. We are going to use 126.96.36.199.1 version of the software. The installation is easy – you need to unzip the software and run Installer which will guide you through couple of simple steps. The installer will unpack the software to the destination location, create subdirectories, and register GoldenGate in the Oracle global registry.
[[email protected] distr]$ unzip fbo_ggs_Linux_x64_shiphome.zip [[email protected] distr]$ cd fbo_ggs_Linux_x64_shiphome/Disk1/ [[email protected] Disk1]$ ./runInstaller
We continue by setting up parameters for Oracle GoldenGate (OGG) manager and starting it up. You can see that I’ve used a default blowfish encryption for the password. In a production environment you may consider another encryption like AES256. I’ve also used a non-default port for the manager since I have more than one GoldenGate installation on my test sandbox.
[[email protected] ~]$ export OGG_HOME=/u01/oggora [[email protected] ~]$ cd $OGG_HOME [[email protected] oggora]$ ./ggsci GGSCI (sandbox.localdomain) 1> encrypt password welcome1 BLOWFISH ENCRYPTKEY DEFAULT Using Blowfish encryption with DEFAULT key. Encrypted password: AACAAAAAAAAAAAIARIXFKCQBMFIGFARA Algorithm used: BLOWFISH GGSCI (sandbox.localdomain) 2> edit params mgr PORT 7829 userid [email protected],password AACAAAAAAAAAAAIARIXFKCQBMFIGFARA, BLOWFISH, ENCRYPTKEY DEFAULT purgeoldextracts /u01/oggora/dirdat/*, usecheckpoints GGSCI (sandbox.localdomain) 3> start manager Manager started.
Let’s prepare everything for initial load, and later online replication.
I’ve decided to use GoldenGate initial load extract as the way for initial load for the sake of consistency for the resulted dataset on Hadoop.
I prepared the parameter file to replicate my ggtest schema and upload all data to the trail file on remote site. I’ve used a minimum number of options for all my processes, providing only handful of parameters required for replication. Extract options is a subject deserving a dedicated blog post. Here is my simple initial extract:
[[email protected] oggora]$ cat /u01/oggora/dirprm/ini_ext.prm SOURCEISTABLE userid [email protected],password AACAAAAAAAAAAAIARIXFKCQBMFIGFARA, BLOWFISH, ENCRYPTKEY DEFAULT --RMTHOSTOPTIONS RMTHOST sandbox, MGRPORT 7839 RMTFILE /u01/oggbd/dirdat/initld, MEGABYTES 2, PURGE --DDL include objname ggtest.* TABLE ggtest.*;
Then we run the initial load extract in passive node and it will create a trail file with the data. The trail file will be used later for our initial load on the target side.
[[email protected] oggora]$ ./extract paramfile dirprm/ini_ext.prm reportfile dirrpt/ini_ext.rpt [[email protected] oggora]$ ll /u01/oggbd/dirdat/initld* -rw-r-----. 1 oracle oinstall 3028 Feb 16 14:17 /u01/oggbd/dirdat/initld [[email protected] oggora]$
We can also prepare our extract on the source site as well. I haven’t used datapump in my configuration limiting the topology only by simplest and strait-forward extract to replicat configuration. Of course, in any production configuration I would advise using datapump on source for staging our data.
Here are my extract parameters, and how I added it. I am not starting it yet because I must have an Oracle GoldenGate Manager running on the target, and the directory for the trail file should be created. You may have guessed that the Big Data GoldenGate will be located in /u01/oggbd directory.
[[email protected] oggora]$ ggsci Oracle GoldenGate Command Interpreter for Oracle Version 188.8.131.52.1 OGGCORE_184.108.40.206.0_PLATFORMS_151211.1401_FBO Linux, x64, 64bit (optimized), Oracle 12c on Dec 12 2015 02:56:48 Operating system character set identified as UTF-8. Copyright (C) 1995, 2015, Oracle and/or its affiliates. All rights reserved. GGSCI (sandbox.localdomain) 1> edit params ggext extract ggext userid [email protected],password AACAAAAAAAAAAAIARIXFKCQBMFIGFARA, BLOWFISH, ENCRYPTKEY DEFAULT --RMTHOSTOPTIONS RMTHOST sandbox, MGRPORT 7839 RMTFILE /u01/oggbd/dirdat/or, MEGABYTES 2, PURGE DDL include objname ggtest.* TABLE ggtest.*; GGSCI (sandbox.localdomain) 2> dblogin userid [email protected],password AACAAAAAAAAAAAIARIXFKCQBMFIGFARA, BLOWFISH, ENCRYPTKEY DEFAULT Successfully logged into database. GGSCI (sandbox.localdomain as [email protected]) 3> register extract GGEXT database 2016-02-16 15:37:21 INFO OGG-02003 Extract GGEXT successfully registered with database at SCN 17151616. GGSCI (sandbox.localdomain as [email protected]) 4> add extract ggext, INTEGRATED TRANLOG, BEGIN NOW EXTRACT (Integrated) added.
Let’s leave our source site for a while and switch to the target . Our target is going to be a box where we have hadoop client and all requirement java classes.
I used the same box just to save resources on my sandbox environment. You may run different GoldeGate versions on the same box considering, that Manager ports for each of them will be different.
Essentially we need a Hadoop client on the box, which can connect to HDFS and write data there. Installation of Hadoop client is out of the scope for this article, but you can easily get all necessary information from the Hadoop home page .
Having all required Hadoop classes we continue by installing Oracle GoldenGate for Big Data, configuring and starting it up. In the past I received several questions from people struggling to find the exact place where all adapters could be uploaded. The Adapters were well “hidden” on “Oracle edelivery”, but now it is way simpler. You are going to GoldenGate download page on Oracle site and find the section “Oracle GoldenGate for Big Data 220.127.116.11.0” where you can choose the OGG for Linux x86_64, Windows or Solaris. You will need an Oracle account to get it. We upload the file to our linux box, unzip and unpack the tar archive. I created a directory /u01/oggbd as our GoldenGate home and unpacked the tar archive there.
The next step is to create all necessary directories. We start command line utility and create all subdirectories.
[[email protected] ~]$ cd /u01/oggbd/ [[email protected] oggbd]$ ./ggsci Oracle GoldenGate Command Interpreter Version 18.104.22.168.0 OGGCORE_22.214.171.124.0_PLATFORMS_151101.1925.2 Linux, x64, 64bit (optimized), Generic on Nov 10 2015 16:18:12 Operating system character set identified as UTF-8. Copyright (C) 1995, 2015, Oracle and/or its affiliates. All rights reserved. GGSCI (sandbox.localdomain) 1> create subdirs Creating subdirectories under current directory /u01/oggbd Parameter files /u01/oggbd/dirprm: created Report files /u01/oggbd/dirrpt: created Checkpoint files /u01/oggbd/dirchk: created Process status files /u01/oggbd/dirpcs: created SQL script files /u01/oggbd/dirsql: created Database definitions files /u01/oggbd/dirdef: created Extract data files /u01/oggbd/dirdat: created Temporary files /u01/oggbd/dirtmp: created Credential store files /u01/oggbd/dircrd: created Masterkey wallet files /u01/oggbd/dirwlt: created Dump files /u01/oggbd/dirdmp: created GGSCI (sandbox.localdomain) 2>
We are changing port for our manager process from default and starting it up. I’ve already mentioned that the port was changed due to the existence off several GoldenGate managers running from different directories.
GGSCI (sandbox.localdomain) 2> edit params mgr PORT 7839 ..... GGSCI (sandbox.localdomain) 3> start manager Manager started.
Now we have to prepare parameter files for our replicat processes. Let’s assume the environment variable OGGHOME represents the GoldenGate home and in our case it is going to be /u01/oggbd.
Examples for the parameter files can be taken from $OGGHOME/AdapterExamples/big-data directories. There you will find examples for flume, kafka, hdfs, hbase and for metadata providers. Today we are going to work with HDFS adapter.
I copied files to my parameter files directory ($OGGHOME/dirprm) and modified them accordingly:
[email protected] oggbd]$ cp /u01/oggbd/AdapterExamples/big-data/hdfs/* /u01/oggbd/dirprm/ [email protected] oggbd]$ vi /u01/oggbd/dirprm/hdfs.props
Here are my values for the hdfs.props file:
[[email protected] dirprm]$ cat hdfs.props gg.handlerlist=hdfs gg.handler.hdfs.type=hdfs gg.handler.hdfs.includeTokens=false gg.handler.hdfs.maxFileSize=1g gg.handler.hdfs.rootFilePath=/user/oracle/gg gg.handler.hdfs.fileRollInterval=0 gg.handler.hdfs.inactivityRollInterval=0 gg.handler.hdfs.fileSuffix=.txt gg.handler.hdfs.partitionByTable=true gg.handler.hdfs.rollOnMetadataChange=true gg.handler.hdfs.authType=none gg.handler.hdfs.format=delimitedtext #gg.handler.hdfs.format.includeColumnNames=true gg.handler.hdfs.mode=tx goldengate.userexit.timestamp=utc goldengate.userexit.writers=javawriter javawriter.stats.display=TRUE javawriter.stats.full=TRUE gg.log=log4j gg.log.level=INFO gg.report.time=30sec gg.classpath=/usr/lib/hadoop/*:/usr/lib/hadoop/lib/*:/usr/lib/hadoop-hdfs/*:/usr/lib/hadoop/etc/hadoop/:/usr/lib/hadoop/lib/native/* javawriter.bootoptions=-Xmx512m -Xms32m -Djava.class.path=ggjava/ggjava.jar
You can find information about all those parameters in oracle documentation here, but there are parameters you will most likely need to change from default:
- gg.handler.hdfs.rootFilePath – it will tell where the directories and files have to be written on HDFS.
- gg.handler.hdfs.format – you can setup one of the four formats supported by adapter.
- goldengate.userexit.timestamp – it will depend from your preferences for transactions timestamps written to your hdfs files.
- gg.classpath – it will depend from location for your hadoop jar classes and native libraries.
You can see I’ve mentioned the gg.handler.hdfs.format.includeColumnNames parameter. It is supposed to put column name before each value in the output file on HDFS. It may be helpful in some cases, but at the same time it makes the file bigger. If you are planning to create an external Hive table, you may consider commenting on it as I have.
The next parameter file is for our data initialization replicat file. You may consider using a Sqoop or another way to make the initial load for your tables, but I think it makes sense to use the GG replicat if the table size is relatively small. It makes the resulting file-set more consistent with the following replication, since it will be using the same engine and format. So, here is my replicat for initial load:
[[email protected] dirprm]$ cat /u01/oggbd/dirprm/irhdfs.prm --passive REPLICAT for initial load irhdfs -- Trail file for this example is located in "dirdat/initld" -- Command to run REPLICAT: -- ./replicat paramfile dirprm/irhdfs.prm reportfile dirrpt/ini_rhdfs.rpt SPECIALRUN END RUNTIME EXTFILE /u01/oggbd/dirdat/initld --DDLERROR default discard setenv HADOOP_COMMON_LIB_NATIVE_DIR=/usr/lib/hadoop/lib/native DDL include all TARGETDB LIBFILE libggjava.so SET property=dirprm/hdfs.props REPORTCOUNT EVERY 1 MINUTES, RATE GROUPTRANSOPS 10000 MAP ggtest.*, TARGET bdtest.*;
I was running the initial load in passive mode, without creating a managed process and just running it from command line. Here is an example:
[[email protected] oggbd]$ ./replicat paramfile dirprm/irhdfs.prm reportfile dirrpt/ini_rhdfs.rpt [[email protected] oggbd]$ hadoop fs -ls /user/oracle/gg/ Found 2 items drwxr-xr-x - oracle oracle 0 2016-02-16 14:37 /user/oracle/gg/bdtest.test_tab_1 drwxr-xr-x - oracle oracle 0 2016-02-16 14:37 /user/oracle/gg/bdtest.test_tab_2 [[email protected] oggbd]$ hadoop fs -ls /user/oracle/gg/bdtest.test_tab_1 Found 1 items -rw-r--r-- 1 oracle oracle 624 2016-02-16 14:37 /user/oracle/gg/bdtest.test_tab_1/bdtest.test_tab_1_2016-02-16_14-37-43.376.txt [[email protected] oggbd]$ hadoop fs -tail /user/oracle/gg/bdtest.test_tab_1/bdtest.test_tab_1_2016-02-16_14-37-43.376.txt IBDTEST.TEST_TAB_12016-02-16 19:17:40.7466992016-02-16T14:37:43.37300000000000-100000020121371O62FX2014-01-24:19:09:20RJ68QYM52014-01-22:12:14:30 IBDTEST.TEST_TAB_12016-02-16 19:17:40.7466992016-02-16T14:37:44.89600000000000-100000021552371O62FX2014-01-24:19:09:20HW82LI732014-05-11:05:23:23 IBDTEST.TEST_TAB_12016-02-16 19:17:40.7466992016-02-16T14:37:44.89600100000000-100000022983RXZT5VUN2013-09-04:23:32:56RJ68QYM52014-01-22:12:14:30 IBDTEST.TEST_TAB_12016-02-16 19:17:40.7466992016-02-16T14:37:44.89600200000000-100000024414RXZT5VUN2013-09-04:23:32:56HW82LI732014-05-11:05:23:23 [[email protected] oggbd]$
You can see the Hadoop directories and files created by the initial load.
As soon as the initial load has run we can start our extract and replicat to keep the destination side updated.
We are moving to source and starting our extract prepared earlier.
GGSCI (sandbox.localdomain as [email protected]) 6>start extract ggext Sending START request to MANAGER ... EXTRACT GGEXT starting
So, moving back to target and preparing our replicat. I used the replicat with the following parameters:
[[email protected] oggbd]$ cat /u01/oggbd/dirprm/rhdfs.prm REPLICAT rhdfs -- Trail file for this example is located in "dirdat/or" directory -- Command to add REPLICAT -- add replicat rhdfs, exttrail dirdat/or --DDLERROR default discard setenv HADOOP_COMMON_LIB_NATIVE_DIR=/usr/lib/hadoop/lib/native DDL include all TARGETDB LIBFILE libggjava.so SET property=dirprm/hdfs.props REPORTCOUNT EVERY 1 MINUTES, RATE GROUPTRANSOPS 10000 MAP ggtest.*, TARGET bdtest.*;
We are adding the replicat to our configuration and it is going to carry on the replication.
GGSCI (sandbox.localdomain) 1> add replicat rhdfs, exttrail dirdat/or REPLICAT added. GGSCI (sandbox.localdomain) 2> start replicat rhdfs Sending START request to MANAGER ... REPLICAT RHDFS starting GGSCI (sandbox.localdomain) 3>
Our replication is up and running, the initial load worked fine, and we can test and see what we have on source and target.
Here is the data in one of our source tables:
orcl> select * from ggtest.test_tab_1; PK_ID RND_STR USE_DATE RND_STR_1 ACC_DATE ---------------- ---------- ----------------- ---------- ----------------- 1 371O62FX 01/24/14 19:09:20 RJ68QYM5 01/22/14 12:14:30 2 371O62FX 01/24/14 19:09:20 HW82LI73 05/11/14 05:23:23 3 RXZT5VUN 09/04/13 23:32:56 RJ68QYM5 01/22/14 12:14:30 4 RXZT5VUN 09/04/13 23:32:56 HW82LI73 05/11/14 05:23:23 orcl>
I’ve created an external Hive table for the table test_tab_1 to have better look.
hive> CREATE EXTERNAL TABLE BDTEST.TEST_TAB_1 (tran_flag string, tab_name string, tran_time_utc timestamp, tran_time_loc string,something string, something1 string, > PK_ID INT, RND_STR VARCHAR(10),USE_DATE string,RND_STR_1 string, ACC_DATE string) > stored as textfile location '/user/oracle/gg/bdtest.test_tab_1'; OK Time taken: 0.327 seconds hive> select * from BDTEST.TEST_TAB_1; OK I BDTEST.TEST_TAB_1 2016-02-16 19:17:40.746699 2016-02-16T14:37:43.373000 00000000-10000002012 1 371O62FX 2014-01-24:19:09:20 RJ68QYM5 2014-01-22:12:14:30 I BDTEST.TEST_TAB_1 2016-02-16 19:17:40.746699 2016-02-16T14:37:44.896000 00000000-10000002155 2 371O62FX 2014-01-24:19:09:20 HW82LI73 2014-05-11:05:23:23 I BDTEST.TEST_TAB_1 2016-02-16 19:17:40.746699 2016-02-16T14:37:44.896001 00000000-10000002298 3 RXZT5VUN 2013-09-04:23:32:56 RJ68QYM5 2014-01-22:12:14:30 I BDTEST.TEST_TAB_1 2016-02-16 19:17:40.746699 2016-02-16T14:37:44.896002 00000000-10000002441 4 RXZT5VUN 2013-09-04:23:32:56 HW82LI73 2014-05-11:05:23:23 Time taken: 0.155 seconds, Fetched: 4 row(s) hive>
You can see the table definition is a bit different from what we have on the source site and you will see why.
We got additional columns on the destination side. Interesting that while some of them have a pretty clear purpose, the other columns are not totally clear and have null values.
The first column is a flag for operation, and it shows what kind of operation we have gotten in this row. It can be “I” for insert, “D” for delete and “U” for an update. The second column is table name. The third one is a timestamp in UTC showing when the transaction occurred. The next one is another time in local timezone informing the time of commit, and the next column has a commit sequence number. Those columns can help you to construct the proper data set for any given time.
Let’s insert and update some row(s) on source and see how it will show up on the target:
orcl> insert into ggtest.test_tab_1 values (5,'TEST_1',sysdate,'TEST_1',sysdate); 1 row created. orcl> commit; orcl> update ggtest.test_tab_1 set RND_STR='TEST_1_1' where PK_ID=5; 1 row updated. orcl> commit; Commit complete. orcl>
Let’s check how it is reflected on the destination site. We see two new rows, where each row represents a DML operation. One was for the insert and the second one was for the update.
[highlight="7,8" hive> select * from BDTEST.TEST_TAB_1; OK I BDTEST.TEST_TAB_1 2016-02-16 19:17:40.746699 2016-02-16T14:37:43.373000 00000000-10000002012 1 371O62FX 2014-01-24:19:09:20 RJ68QYM5 2014-01-22:12:14:30 I BDTEST.TEST_TAB_1 2016-02-16 19:17:40.746699 2016-02-16T14:37:44.896000 00000000-10000002155 2 371O62FX 2014-01-24:19:09:20 HW82LI73 2014-05-11:05:23:23 I BDTEST.TEST_TAB_1 2016-02-16 19:17:40.746699 2016-02-16T14:37:44.896001 00000000-10000002298 3 RXZT5VUN 2013-09-04:23:32:56 RJ68QYM5 2014-01-22:12:14:30 I BDTEST.TEST_TAB_1 2016-02-16 19:17:40.746699 2016-02-16T14:37:44.896002 00000000-10000002441 4 RXZT5VUN 2013-09-04:23:32:56 HW82LI73 2014-05-11:05:23:23 I BDTEST.TEST_TAB_1 2016-02-16 20:43:32.000231 2016-02-16T15:43:37.199000 00000000000000002041 5 TEST_1 2016-02-16:15:43:25 TEST_1 2016-02-16:15:43:25 U BDTEST.TEST_TAB_1 2016-02-16 20:43:53.000233 2016-02-16T15:43:56.056000 00000000000000002243 5 TEST_1_1 Time taken: 2.661 seconds, Fetched: 6 row(s) hive>
It work for deletes too, only flag will be “D” instead of “I” for insert and “U” for updates.
What about DDL support? Let’s truncate the table.
orcl> truncate table ggtest.test_tab_1; Table truncated. orcl>
And here, there is nothing in our hdfs files. Maybe I’ve missed something, but it looks like the truncate operation is not creating any record. I need to dig a bit more. I will try to make a separate blog post about DDL support for the Big Data Adapters.
It works pretty well when we create a new table and insert new rows.
It also works if we change one of the existing tables, adding or dropping a column. Let’s try to create a new table here:
orcl> create table ggtest.test_tab_4 (pk_id number, rnd_str_1 varchar2(10),acc_date date); Table created. orcl> insert into ggtest.test_tab_4 select * from ggtest.test_tab_2; 1 row created. orcl> commit; Commit complete. orcl>
You can see that it has created a new directory and file for the new table. Additionally, if you add a column the new file will be used for all DML operations for the updated table. It will help to separate tables with different structure.
[[email protected] oggbd]$ hadoop fs -ls /user/oracle/gg/ Found 3 items drwxr-xr-x - oracle oracle 0 2016-02-16 15:43 /user/oracle/gg/bdtest.test_tab_1 drwxr-xr-x - oracle oracle 0 2016-02-16 14:37 /user/oracle/gg/bdtest.test_tab_2 drwxr-xr-x - oracle oracle 0 2016-02-16 15:56 /user/oracle/gg/bdtest.test_tab_4 [[email protected] oggbd]$ hadoop fs -ls /user/oracle/gg/bdtest.test_tab_4/ Found 1 items -rw-r--r-- 1 oracle oracle 127 2016-02-16 15:56 /user/oracle/gg/bdtest.test_tab_4/bdtest.test_tab_4_2016-02-16_15-56-50.373.txt [[email protected] oggbd]$ hadoop fs -tail /user/oracle/gg/bdtest.test_tab_4/bdtest.test_tab_4_2016-02-16_15-56-50.373.txt IBDTEST.TEST_TAB_42016-02-16 20:56:47.0009532016-02-16T15:56:50.371000000000000000000068327IJWQRO7T2013-07-07:08:13:52 [[email protected] oggbd]$
At first glance it looks good, but let’s try to create a table as select.
orcl> create table ggtest.test_tab_3 as select * from ggtest.test_tab_2; Table created. orcl>
Not sure if it is an expected behavior or bug, but apparently it is not working. Our replicat is broken and complains that “DDL Replication is not supported for this database”.
[[email protected] oggbd]$ tail -5 ggserr.log 2016-02-16 15:43:37 INFO OGG-02756 Oracle GoldenGate Delivery, rhdfs.prm: The definition for table GGTEST.TEST_TAB_1 is obtained from the trail file. 2016-02-16 15:43:37 INFO OGG-06511 Oracle GoldenGate Delivery, rhdfs.prm: Using following columns in default map by name: PK_ID, RND_STR, USE_DATE, RND_STR_1, ACC_DATE. 2016-02-16 15:43:37 INFO OGG-06510 Oracle GoldenGate Delivery, rhdfs.prm: Using the following key columns for target table bdtest.TEST_TAB_1: PK_ID. 2016-02-16 15:48:49 ERROR OGG-00453 Oracle GoldenGate Delivery, rhdfs.prm: DDL Replication is not supported for this database. 2016-02-16 15:48:49 ERROR OGG-01668 Oracle GoldenGate Delivery, rhdfs.prm: PROCESS ABENDING. [[email protected] oggbd]$ [[email protected] oggbd]$ hadoop fs -ls /user/oracle/gg/ Found 2 items drwxr-xr-x - oracle oracle 0 2016-02-16 15:43 /user/oracle/gg/bdtest.test_tab_1 drwxr-xr-x - oracle oracle 0 2016-02-16 14:37 /user/oracle/gg/bdtest.test_tab_2 [[email protected] oggbd]$
What can we say in summary? The replication works, and supports all DML, and some DDL commands. You will need to prepare to get consistent datasets for any given time using flag for operation, and time for the transaction. In my next few posts, I will cover other Big Data adapters for GoldenGate.
Hello, nice article. I have one question about the GoldenGate Adapters. Does the GG product see the adapters in the same way that it would see the replicat process. In other words, can you show the status of an adapter that is being sent data or is the adapter kind of like a separate process that has to monitored outside of GG. For example, lets say that GG is replicating data fine to the adapter, but for some reason nothing is being written to the target. From GG do you get back an error. What happens if the adapter fails, does GG show the status of the adapters that are running or do you just have to know that something is wrong and go check the logs from the adapter to find out what the problem is.
You have a OGG replicat on adapter side and can see the status and details directly from GoldenGate:
GGSCI (bigdatalite.localdomain) 5> info rhbase
REPLICAT RHBASE Last Started 2016-03-29 12:56 Status RUNNING
Checkpoint Lag 00:00:00 (updated 00:00:08 ago)
Process ID 27277
Log Read Checkpoint File dirdat/or000045
2016-03-24 13:09:02.000274 RBA 1018490
The same will be about errors and logging for the adapter side. You work with it as with any other replicat.
Hope it helps,
Hello Gleb –
I’m Reddy, it was a pleasure reading thru your article… very informative! I have a situation where I have to replicate using goldengate from oracle DB into hdfs so my question is are all ddl supported if I were to have ddl replication into hdfs?
By ddl, I’m referring to the following situations
altering existing tables to add new columns
altering existing tales to modify datatype of existing columns
Flexibility to add tables that are created as new ( we are going to perform schema lavel replication )
can you please let us know from prior experience if DDL replication is even possible into hdfs?
Yes, the DDL commands altering a table structure are supported in OGG to BD. So, if you add or change a column it will be reflected in the next DML transaction. Just keep in mind that the change will be visible and replicated only with the following DML.
Also for different output format it will look slightly different. Like for the csv text on hdfs you are going to have a new row with change number of columns and for JSON you are getting the schema/table definition with new values.
Hope it helps.
How to configure high availability for oracle golden gate for big data .lets assume 10 node hadoop cluster . now i have configured oggbd on node5(one of the data node) I am running my replicat process on node5 .The trail files are pumped on node5 . What if this node5 goes down ??. (What about the trail files and the replicat process). What are the options for high availability..?
Desperately waiting for your reply,
I you want GoldenGate to be high available you need to think about creating a high available environment for the GoldenGate itself. I would not install it to any Hadoop node but prepare a HA solution for GoldenGate. As example you can create an Oracle RAC , put GoldenGate and trail files to a shared FS and configure high availability. The OGG requires clients hadoop software to push data to hadoop cluster but it doesn’t mean the OGG has to be installed to any of your hadoop nodes.
I tried to replicate the process, but I keep getting errors about “OGG-00423 Could not find definition for GGTEST.RAINFALL”
I try to add the DDL option in the param file but seems it is incompatible with the SOURCEISTABLE option.
I got that error when I was trying to do the replicat process in the destination box
I tried to use the ASSUMETARGETDEFS option in the irhdfs.prm for the replicat process, but then it claims that “table bdtest.RAINFALL does not exist in target database”
I think I sorted it out by using SOURCEDEFS and defgen to generate a definition file.
And as the title specified, OGG 12.2 support trail files contains definition directly, thus there’s no other steps in the parameter file.
It was working for me and all DDL have been uploaded to the trail file. You may try to run a simple extract without SOURCEISTABLE but with DDL included and it should have all the definition too. Let me know if it works for you.
The defgen would be working for sure but it is not flexible and require to recreate the file every time when you change definitions for the objects.
I am getting the same error message, how was the issue resolved?
You have to have DDL uploaded to a trail file (for OGG 12.2) or defgen file if you don’t use DDL option for your extract. In case of earlier version of OGG, like 12.1, you have to use defgen.
But how can you maintain a non-transaction table? As you mentioned each action (Insert.Update.Delete) would create a new row with a colume op_type I/U/D. However, I just want to replicate the table, which means 3 inserts + 1 delete should end up with 2 records in the hadoop txt file instead of 4 records. Same apply to update.
If I want to achieve the goal above, how can I configure it?
I am not sure if you can do it with text file. If I have understood you correctly you want to skip some operations like if your have inserted and deleted the same row inside one transaction.
I don’t think it is possible in the current version using a parameter. You have two different mode “tx” and “op” for the handler but is not going to filter any of your DML but enable or disable grouping for messages.
GoldenGate is not batch but rather streaming replication and replicate all committed changes.
Sorry I might not make my point clear. What I want to do is to have a result set (in the HDFS) exactly same as the source (in terms of the content).
In the blog you did an update. But instead of a updated record on the HDFS side, you have a new row with op_type = U (as U BDTEST.TEST_TAB_1 2016-02-16 20:43:53.000233 2016-02-16T15:43:56.056000 00000000000000002243 5 TEST_1_1)
What I want is instead of having a record of the update, I want the update happened in the result txt file (Your update should not be a new record, but should really get updated in the HDFS file).
If I want this to happen, anything else I should configure?
I think you may need to consider another platform as destination. HDFS is not really suited to keep a dataset. You create and append data to files in HDFS but not update information there.
I would suggest to look to HDFS (or to Kafka topic) as a log of changes, something more like CDC than a replication of state for an object. You off course can construct state for a object for any given time having initial state and log of changes but it will require some programming.
You may have a look to HBASE instead as a target for your replication to see a state for the object instead of log of changes.
With the above configuration i was not able to do updates on extract. Is there any extra configuration that needs to be done.
2016-11-01 09:29:21 WARNING OGG-01431 Aborted grouped transaction on ‘hrhdfs.EMPLOYEES’, Mapping error.
2016-11-01 09:29:21 WARNING OGG-01003 Repositioning to rba 2542 in seqno 0.
2016-11-01 09:29:21 WARNING OGG-01151 Error mapping from HR.EMPLOYEES to hrhdfs.EMPLOYEES.
It shouldn’t require any extra configuration but your hdfs target should be ready.
Are you getting errors only for updates or to inserts and deletes?
Unfortunately information you have provided doesn’t provide any clue for the error.
I assume the log is from replicat side.
Couple of things to verify:
Check again your parameter files for replicat.
Check existence and permissions for the destination hdfs directory to make sure the user running OGG has rights to write to the directory.
i have met the error OGG-15051
that said:oracle.goldengate.util.GGException: Unable to commit transaction, STATUS=ABEND.
what can i do
Unfortunately the information is not sufficient to troubleshoot your problem. I can only guess assuming you are getting the error for the replicat and trying to write to HDFS. You may need to check if the user able to write to your designated location in HDFS.
I have the same issue as Bob in the initial load in passive mode. The trailfile is OK. the gg.classpath is set to include the following items (according to Oracle’s doc) :
However the replica process is ABENDed with the error OGG-15051.
I would be glad to help you but the information you have provided was not enough to troubleshoot. What exactly the error say ? Usually it is more or less clear from the error (you have a error log and report files for the replicat).
I have implemented this hdfs adapter and able to load data to hadoop hdfs. Thanks for your detailed explanation. But I am facing an issue in applying transform fuctions in the replicat using colmap(). Not sure whether this some issue with hdfs handler or something else. I am not able to use that colmap() for any new columns to target. eg : MAP “SRC_DB”.”SRC_TABLE” , TARGET TGT_DB.”TGT_TABLE”, COLMAP(USEDEFAULTS,tran_date = @DATENOW()) this does not work saying ” ERROR OGG-00919 Oracle GoldenGate Delivery, TST.prm: Error in COLMAP clause.” But if I use COLMAP(USEDEFAULTS) alone, it will work. I am not looking or populating current_timestamp, but checking how can I use the transform functions in Bigdata 12.2? Please help me resolve this
This is what it shows in the report
Error in COLMAP clause. Unrecognized clause or element.
To use colmap functionality you have to use Metadata Provider in your configuration to define the target.
Here is link to documentation:
So far i see documentation for Avro, java and Hive metadata providers.
Hello, nice article. I have one question about the GoldenGate Adapters.How do I set the time format for writing HDFS?The time format is now ‘YYYY-MM-DD:HH24:MI:SS’, but I would like to set it as’ YYYY-MM-DD HH24:MI:SS’. Reduce a colon.how can I configure it?
What version of OGG do you run? Interesting that in my current version I see exactly YYYY-MM-DD HH24:MI:SS format of the date. And OGG documentation says:
“Datetime (DATE, TIME, and TIMESTAMP) columns can accept datetime and character columns, as well as string literals. Literals must be enclosed within single quotes. To map a character column to a datetime column, make certain it conforms to the Oracle GoldenGate external SQL format of YYYY-MM-DD HH:MI:SS.FFFFFF.”
Anyway you can use @DATE function in your mapping like :
MAP ggtest.*, TARGET ggtest.*,
COLMAP ( USEDEFAULTS,
ACC_DATE = @DATE(‘DD/MM/YY HH:MI:SS’,’YYYY-MM-DD HH:MI:SS’,ACC_DATE)
In this case it converts ‘YYYY-MM-DD HH24:MI:SS’ to ‘DD/MM/YY HH:MI:SS’ and here is the result:
[[email protected] oggbd]$ hadoop fs -cat ogg/ggtest.test_tab_1/ggtest.test_tab_1_2017-07-12_10-10-13.677.txt
IGGTEST.TEST_TAB_12017-07-12 14:08:15.0001762017-07-12T10:10:13.674000000000002400076288234Test042017-07-12 10:08:12Test0412/07/17 10:08:12
Thanks for the post – it helped me very quickly to set up a Proof of Concept environment.
One question I have – is there any way to configure the GG Adapter to place the files into sub-directories for each table, so that a Hive Partitioned table can be created ?
Or is the best approach to achieve this to:
1. Create a non-partitioned External table on top of the replicated HDFS data
2. Create a partitioned table
3. Run a Job to do:
Thanks for the post and I am just following the same steps in doing my initial load. I ran into issues while creating the extract file for a 100 GB table ( 2 gig limit in size). I added this command on extract side – RMTFILE /mnt/ogg/dirdat/initld, MAXFILES 999, MEGABYTES 1500 and it created multiple files. Issue solved on Source Extract.
Now the issue is in processing those files on target side. As per this document in irhdfs.prm, how to process multiple files? You param file specifically calls a file name initld – EXTFILE /u01/oggbd/dirdat/initld
Can you tell me files names you have gotten for the dataset? Are they look like trail file name + sequential number?
When you specify a trail file name you put only name for the file without the sequential numbers. GoldenGate replicat by default will try to run all the files stating from sequence 0.
nice article, help me a lot.
But, I have a question. I noticed that when you update the row with id = 5, you got some informations about the transaction and the updated value, but i want to know if there’s a way to get all values, included the non-updated values.
when you insert -> you have 10 columns
I BDTEST.TEST_TAB_1 2016-02-16 20:43:32.000231 2016-02-16T15:43:37.199000 00000000000000002041 5 TEST_1 2016-02-16:15:43:25 TEST_1 2016-02-16:15:43:25
when you update -> you have 8 columns, the two last is missing
U BDTEST.TEST_TAB_1 2016-02-16 20:43:53.000233 2016-02-16T15:43:56.056000 00000000000000002243 5 TEST_1_1