While working on a Scala (2.12.4) project, I made a typo in my code and received an interesting output: While it was an absolutely useless construction, it still causes such misleading output as seen above: I would say that this…Read More >
I was exploring immutable RedBlackTree tree implementation in Scala (2.12.4) when I noticed something that wasn’t clear to me: Comparing keys via compare is followed by explicit key not “equals” condition. I compared similar parts of mutable RedBlackTree implementation and…Read More >
Memoization is a powerful technique that allows you to improve performance of repeatable computations. Although it would be a pretty handy feature, there is no memoization or result cache for UDFs in Spark as of today. In fact it’s something…Read More >
I was working on an instrumentation framework for Scala UDFs in Spark when I noticed a subtle difference in the execution plan depending on whether I used wrappers or not. It looked like some code was added or was not…Read More >
There is an interesting bug that was found during the latest performance tuning we performed for Spark 2.2 (2.3 is also affected). It was a batch Spark job scheduled to be executed hourly and to process about 1Tb worth of…Read More >
In this post we will build a toy example NiFi processor which is still quite efficient and has powerful capabilities. Processor logic is straightforward: it will read incoming files line by line, apply given function to transform each line into…Read More >
Recently I was looking for a simple Twitter to Pub/Sub streaming pipeline and ended up with own implementation in Scala. I tried to make it as compact as possible. So I chose the dispatch and Google Pub/Sub client libraries for…Read More >
In the third part of the series we will develop a pipeline to transform messages from “data” Pub/Sub using messages from the “control” topic as source code for our data processor. The idea is to utilize Scala toolBox. It’s much…Read More >
In the second part of this series we will develop a pipeline to transform messages from “data” Pub/Sub topic with the ability to control the process via “control” topic. How to pass effectively non-immutable input into DoFn, is not obvious,…Read More >
In this 3-part series I’ll show you how to build and run Apache Beam pipelines using Java API in Scala. In the first part we will develop the simplest streaming pipeline that reads jsons from Google Cloud Pub/Sub, convert them…Read More >
© Copyright 2019 Pythian Group Inc. ® ALL RIGHTS RESERVED.
PYTHIAN®, LOVE YOUR DATA®, and ADMINISCOPE® are trademarks and registered trademarks owned by Pythian in North America and certain other countries, and are valuable assets of our company. Other brands, product and company names on this website may be trademarks or registered trademarks of Pythian or of third parties. Use of trademarks without permission is strictly prohibited.