Google Cloud Dataflow is a data processing tool developed by Google that runs in the cloud. Dataflow is an easy to use, flexible tool that delivers completely automated scaling. It is deeply tied to the Google cloud infrastructure, making it a very powerful for projects running in Google Cloud.
Dataflow is an attractive resource management and job monitoring tool because it automatically manages all of the Google Cloud resources, including creating and tearing down Google Compute Engine resources, communicating with Google Cloud Storage, working with Google Cloud Pub/Sub, aggregating logs, etc.
Cloud Dataflow has the following major components:
SDK – The Dataflow SDK provides a programming mode that simplifies/abstracts out the processing of large amounts of data. Dataflow only provides a Java SDK at the moment, which is a barrier for non-Java programmers. More on the programming model later.
Google Cloud Platform Managed Services – This is one of my favourite features in Dataflow. Dataflow manages and ties together components, such as Google Compute Engine, spins up and tears down VMs, manages BigQuery, aggregates logs, etc.
These two components can be used together to create jobs.
Being programmatic, Dataflow is extremely flexible. It works well for both batch and streaming jobs. Dataflow excels at high-volume computations and provides a unified programming model, which is very efficient and rather simple considering how powerful it is.
The Dataflow programming model simplifies the mechanics of large-scale data processing and abstracts out a lot of the lower level tasks, such as cluster management, adding more nodes, etc. It lets you focus on the logical aspect of your pipeline and not worry about how the job will run.
The Dataflow pipeline consists of four major abstractions:
- Pipelines – A pipeline represents a complete process on a dataset or datasets. The data could be brought in from external data sources. It could then have a series of transformation operations, such as filter, joins, aggregation, etc., applied to the data to give it meaning and to achieve its desired form. This data could be then written to a sink. The sink could be within the Google Cloud platform or external. The sink could even be the same as the data source.
- PCollections – PCollections are datasets in the pipeline. PCollections could represent datasets of any size. These datasets could be bounded (fixed size – such as national census data) or unbounded (such as a Twitter feed or data from weather sensors). PCollections are the input and output of every transform operation.
- Transforms – Transforms are the data processing steps in the pipeline. Transforms take one or more PCollections, apply some transform operations to those collections, and then output to a PCollection.
- I/O Sinks and Sources – The Source and Sink APIs provide functions to read data into and out of collections. The sources act as the roots of the pipeline and the sinks are the endpoints of the pipeline. Dataflow has a set of built in sinks/sources, but it is also possible to write sinks sources for custom data sources.
Dataflow is also planning to add integration for Apache Flink and Apache Spark. Adding Spark and Flink integration would be a huge feature since it would open up the possibilities to use MLlib, Spark SQL, and Flink machine-learning capabilities.
One of the use cases we explored was to create a pipeline that ingests streaming data from several POS systems using Dataflow’s streaming APIs. This data can be then joined with customer profile data that is ingested incrementally on a daily basis from a relational database. We can then run some filtering and aggregation operations on this data. Using the sink for BigQuery, we can insert the data into BigQuery and then run queries. What makes this so attractive is that in this whole process of ingesting vast amounts of streaming data, there was no need to set up clusters or networks or install software, etc. We stayed focused on the data processing and the logic that went into it.
To summarize, Dataflow is the only data processing tool that completely manages the lower level infrastructure. This removes several API calls for monitoring the load and spinning up and tearing down VMs, aggregating logs, etc., and lets you focus on the logic of the task at hand. The abstractions are very easy to understand and work with and the Dataflow API also provides a good set of built in transform operations for tasks such as filtering, joining, grouping, and aggregation. Dataflow integrates really well with all components in the Google Cloud Platform, however, Dataflow does not have SDKs in any language besides Java, which is somewhat restrictive.