Scheduling Google Cloud Functions

Currently, there is no straightforward way to schedule Google Cloud Functions. It is still possible to achieve this by different means, such as (but not limited to): deploying Compute Engine instance and setting crontab entry configuring HTTP/S uptime checks via…

Read More >

Spark UDF memoization

Memoization is a powerful technique that allows you to improve performance of repeatable computations. Although it would be a pretty handy feature, there is no memoization or result cache for UDFs in Spark as of today. In fact it’s something…

Read More >

Spark Scala UDF primitive type bug

I was working on an instrumentation framework for Scala UDFs in Spark when I noticed a subtle difference in the execution plan depending on whether I used wrappers or not. It looked like some code was added or was not…

Read More >

Why your data hub belongs in the cloud

Whether your core infrastructure is on-prem or in the cloud, there’s only one choice for your data hub
The public cloud effectively provides you with three options:  Infrastructure as a Service, Platform as a Service or Software as a Service. And with IaaS, people often face the choice of “rent vs buy”.

Read More >

Spark performance regression with sum aggregations

There is an interesting bug that was found during the latest performance tuning we performed for Spark 2.2 (2.3 is also affected). It was a batch Spark job scheduled to be executed hourly and to process about 1Tb worth of…

Read More >

Why column stores?

Introduction I’ve been working with data in many forms for my entire career. During this time, I have occasionally needed to build or query existing databases to get statistical data. Traditional databases are usually designed to query specific data from…

Read More >

A comparative analysis of Amazon SageMaker and Google Datalab

Amazon SageMaker and Google Datalab have fully managed cloud Jupyter notebooks for designing and developing machine learning and deep learning models by leveraging serverless cloud engines. However, as much as they have in common, there are key differences between the…

Read More >

Exploring Amazon SageMaker

Amazon SageMaker is another cloud-based fully managed data analytics/ machine learning modeling platform for designing, building and deploying data models. The key selling point of Amazon SageMaker is “zero-setup”. The concept of “zero-setup” means data science teams can entirely focus…

Read More >

Dipping your toes into building an Analytics Platform on Google Cloud Platform

“We have many disparate data sources and we’re having a hard time getting a global view of all our data across our organization.” “Our data is currently all in <enter data warehouse name here> and we want to migrate it…

Read More >

Building a custom routing NiFi processor with Scala

In this post we will build a toy example NiFi processor which is still quite efficient and has powerful capabilities. Processor logic is straightforward: it will read incoming files line by line, apply given function to transform each line into…

Read More >
Page 2 of 1012345...10...Last Page »