Introduction The best definition you going to find for data is that data is the new oil in today’s world. Starting from that, we can define a new horizon and a new way of looking at how we treat…
Read More >Currently, there is no straightforward way to schedule Google Cloud Functions. It is still possible to achieve this by different means, such as (but not limited to): deploying Compute Engine instance and setting crontab entry configuring HTTP/S uptime checks via…
Read More >Memoization is a powerful technique that allows you to improve performance of repeatable computations. Although it would be a pretty handy feature, there is no memoization or result cache for UDFs in Spark as of today. In fact it’s something…
Read More >I was working on an instrumentation framework for Scala UDFs in Spark when I noticed a subtle difference in the execution plan depending on whether I used wrappers or not. It looked like some code was added or was not…
Read More >Whether your core infrastructure is on-prem or in the cloud, there’s only one choice for your data hub
The public cloud effectively provides you with three options: Infrastructure as a Service, Platform as a Service or Software as a Service. And with IaaS, people often face the choice of “rent vs buy”.
There is an interesting bug that was found during the latest performance tuning we performed for Spark 2.2 (2.3 is also affected). It was a batch Spark job scheduled to be executed hourly and to process about 1Tb worth of…
Read More >Introduction I’ve been working with data in many forms for my entire career. During this time, I have occasionally needed to build or query existing databases to get statistical data. Traditional databases are usually designed to query specific data from…
Read More >Amazon SageMaker and Google Datalab have fully managed cloud Jupyter notebooks for designing and developing machine learning and deep learning models by leveraging serverless cloud engines. However, as much as they have in common, there are key differences between the…
Read More >Amazon SageMaker is another cloud-based fully managed data analytics/ machine learning modeling platform for designing, building and deploying data models. The key selling point of Amazon SageMaker is “zero-setup”. The concept of “zero-setup” means data science teams can entirely focus…
Read More >In this post we will build a toy example NiFi processor which is still quite efficient and has powerful capabilities. Processor logic is straightforward: it will read incoming files line by line, apply given function to transform each line into…
Read More >