Tag: Apache Beam

Apache Beam: the Future of Data Processing?

apache beam

Apache Beam is a unified model for defining both batch and streaming data-parallel processing pipelines. It’s a software development kit (SDK) to define and construct data processing pipelines as well as runners to execute them.     Why Apache Beam?…

Read More >

Caching Alternatives in Google Dataflow: Avoiding Quota Limits and Improving Performance

The problem When building data pipelines, it’s very common to require an external API call to enrich, validate or obfuscate data using external services. This might happen with streaming or batch pipeline. The situation is the same: call external services…

Read More >

Putting ML Prototypes Into Production Using TensorFlow Extended (TFX)

Introduction Machine learning projects start by building a proof-of-concept or a prototype. This entails choosing the right dataset (features), the appropriate ML algorithm/model and the hyper-parameters for that algorithm. As a result of a POC, we would have a trained…

Read More >

Datascape podcast episode 30 – learn about streaming

In this episode of the Datascape Podcast, we welcome back Danil Zburivsky from Pythian to talk about the new streaming technologies that he and his team are working on. The step up from batch processing to streaming seems to be…

Read More >

Apache Beam pipelines with Scala: part 3 – dynamic processing

In the third part of the series we will develop a pipeline to transform messages from “data” Pub/Sub using messages from the “control” topic as source code for our data processor. The idea is to utilize Scala toolBox. It’s much…

Read More >

Apache beam pipelines With Scala: Part 2 – Side Input

In the second part of this series we will develop a pipeline to transform messages from “data” Pub/Sub topic with the ability to control the process via “control” topic. How to pass effectively non-immutable input into DoFn, is not obvious,…

Read More >

Apache beam pipelines with Scala: part 1 – template

In this 3-part series I’ll show you how to build and run Apache Beam pipelines using Java API in Scala. In the first part we will develop the simplest streaming pipeline that reads jsons from Google Cloud Pub/Sub, convert them…

Read More >