Spark Scala UDF primitive type bug

I was working on an instrumentation framework for Scala UDFs in Spark when I noticed a subtle difference in the execution plan depending on whether I used wrappers or not. It looked like some code was added or was not…

Read More >

Why your data hub belongs in the cloud

Whether your core infrastructure is on-prem or in the cloud, there’s only one choice for your data hub
The public cloud effectively provides you with three options:  Infrastructure as a Service, Platform as a Service or Software as a Service. And with IaaS, people often face the choice of “rent vs buy”.

Read More >

Spark performance regression with sum aggregations

There is an interesting bug that was found during the latest performance tuning we performed for Spark 2.2 (2.3 is also affected). It was a batch Spark job scheduled to be executed hourly and to process about 1Tb worth of…

Read More >

Why column stores?

Introduction I’ve been working with data in many forms for my entire career. During this time, I have occasionally needed to build or query existing databases to get statistical data. Traditional databases are usually designed to query specific data from…

Read More >

A comparative analysis of Amazon SageMaker and Google Datalab

Amazon SageMaker and Google Datalab have fully managed cloud Jupyter notebooks for designing and developing machine learning and deep learning models by leveraging serverless cloud engines. However, as much as they have in common, there are key differences between the…

Read More >

Exploring Amazon SageMaker

Amazon SageMaker is another cloud-based fully managed data analytics/ machine learning modeling platform for designing, building and deploying data models. The key selling point of Amazon SageMaker is “zero-setup”. The concept of “zero-setup” means data science teams can entirely focus…

Read More >

Dipping your toes into building an Analytics Platform on Google Cloud Platform

“We have many disparate data sources and we’re having a hard time getting a global view of all our data across our organization.” “Our data is currently all in <enter data warehouse name here> and we want to migrate it…

Read More >

Building a custom routing NiFi processor with Scala

In this post we will build a toy example NiFi processor which is still quite efficient and has powerful capabilities. Processor logic is straightforward: it will read incoming files line by line, apply given function to transform each line into…

Read More >

Updating Elasticsearch indexes with Spark

With the extensive adoption of Elasticsearch as a search and analytics engine, more often we build data pipelines that interact with Elasticsearch. And apparently, most often the processing framework of choice is Apache Spark. Although reading data from Elasticsearch and…

Read More >

Minimal Twitter to Google Pub/Sub example with Scala

Recently I was looking for a simple Twitter to Pub/Sub streaming pipeline and ended up with own implementation in Scala. I tried to make it as compact as possible. So I chose the dispatch and Google Pub/Sub client libraries for…

Read More >
Page 2 of 1012345...10...Last Page »