Reviewing the operation modes of Oracle GoldenGate BigQuery Handler

GoldenGate for Big Data 12.3.2.1.1 introduces a new target – Google BigQuery. BigQuery handler can work in two Audit log modes: 1. auditLogMode = true 2. auditLogMode = false I want to review the differences between these two operation modes…

Read More >

How to schedule weekdays only on Airflow

Consider the following situation: You have a data ingestion pipeline where the data comes in real-time on weekdays and is stored in a dated folder.  The day’s data needs to be ingested within four hours. An instant response may be…

Read More >

Analyzing BigQuery via Excel and Google Sheets

Both MS Excel and Google Sheets offer ways to connect directly to BQ data, to run queries, to pull data back to Excel/Sheets and allow further analysis via options such as pivot tables, charts and drilling up/down. MS Excel The…

Read More >

Data modeling for cloud DW

In this blog post, I would like to share some options that you can consider to model your cloud DW for better query performance.  With a traditional EDW, we would either come up a STAR, Snowflake or similar schemas. These…

Read More >

Azure Data Lake basics for the SQL Server DBA / developer and… for everyone!

The basics If you’re a Microsoft SQL Server DBA or developer and have not been introduced to the Microsoft Azure Data Lake and would like to understand what it’s all about and how to get started, this article is for YOU….

Read More >

Big Data on Microsoft Azure – HDInsight

Introduction   The best definition you going to find for data is that data is the new oil in today’s world. Starting from that, we can define a new horizon and a new way of looking at how we treat…

Read More >

Scheduling Google Cloud Functions

Currently, there is no straightforward way to schedule Google Cloud Functions. It is still possible to achieve this by different means, such as (but not limited to): deploying Compute Engine instance and setting crontab entry configuring HTTP/S uptime checks via…

Read More >

Spark UDF memoization

Memoization is a powerful technique that allows you to improve performance of repeatable computations. Although it would be a pretty handy feature, there is no memoization or result cache for UDFs in Spark as of today. In fact it’s something…

Read More >

Spark Scala UDF primitive type bug

I was working on an instrumentation framework for Scala UDFs in Spark when I noticed a subtle difference in the execution plan depending on whether I used wrappers or not. It looked like some code was added or was not…

Read More >

Why your data hub belongs in the cloud

Whether your core infrastructure is on-prem or in the cloud, there’s only one choice for your data hub
The public cloud effectively provides you with three options:  Infrastructure as a Service, Platform as a Service or Software as a Service. And with IaaS, people often face the choice of “rent vs buy”.

Read More >
Page 1 of 1012345...10...Last Page »