Google NEXT 2019 was a momentous event for the Pythian team. We celebrated winning the Google Cloud Global Data Analytics Partner of the Year award, had a chance to meet with customers and Googlers, and had many conversations about the future of the cloud. Because building and managing cloud data platforms is something I’m involved in on a daily basis, I wanted to highlight some of the new products and feature announcements made at NEXT in the area of data analytics, excluding the machine learning announcements, which I will leave to my data scientist colleagues.
Cloud Data Fusion
With Cloud Data Fusion Google now offers a fully managed ETL service. With Google’s acquisition of Cask and its open source data integration and data application development platform for Hadoop, it was clear that a Google Cloud Platform (GCP) native solution would be forthcoming. Cloud Data Fusion allows you to construct ETL pipelines using a visual interface and then execute those pipelines as either Apache Spark or MapReduce jobs using a GCP-managed Cloud Dataproc cluster. What’s really nice is that the process of generating the Spark/MapReduce job is completely invisible to the end user – Cloud Data Fusion provisions the required Cloud Dataproc cluster, runs the job and then removes the cluster, so you don’t pay for it to run 24/7.
Cloud Data Fusion also comes with a number of useful data transformation building blocks. For example, there is a built-in data wrangler that allows you to perform most common column level data manipulations, such as renaming or deleting columns, parsing various date formats, etc. You can also add a data validation step to make sure that data conforms to certain rules before you save it to BigQuery or another destination.
Out of the box, Cloud Data Fusion supports about a dozen data source types and destinations, including the most common RDBMS sources, flat files on FTP/GCS/S3 storage, Kafka and a few others. Since Data Fusion is based on open source, it allows you to create your own plugins (in Java) that you can then use to implement new data source ingestions, sinks or custom data transformations. This is a rather unique feature when it comes to managed cloud ETL services, since with competing services you are usually restricted to using only an existing set of components.
Today even a small organization may need to integrate dozens of data sources and create and maintain hundreds of data pipelines and in larger enterprises, this number can easily number in the thousands. The goal of Cloud Data Fusion as a product is to provide an environment where it is easy not only to create new pipelines, but also to discover and understand existing ones. Data pipelines created in Fusion register their lineage and metadata in a central repository and this information can be used to perform an impact analysis – telling you which data sets and columns will be affected by a change to a given pipeline. From our experience, not having this visibility has significantly restricted the productivity of data engineers, so this is a very important feature.
Cloud Data Fusion is currently in beta and, understandably, is still rough around some of its edges. I tested a number of scenarios, including ingesting data from various sources into BigQuery, which seems to work just fine. You can still run into sporadic issues with more complex cases with multiple destinations for the data and complex transformations. This will definitely improve in the weeks to come and we look forward to working closely with the Google Cloud Data Fusion team to make this already exciting product even better.
BigQuery BI Engine
Another interesting announcement is a beta release of BigQuery BI Engine — a fast, in-memory caching layer on top of BigQuery. While BigQuery itself is very scalable and achieves remarkable performance when dealing with terabytes and petabytes of data, it has its limitations when it comes to low latency and high concurrency. BigQuery, like most distributed big data query engines, trades off scalability for high query latency, which is challenging for use cases with many concurrent queries that need to return results in sub-seconds. Today’s analytics users expect their reports and dashboards to be interactive, responsive and show the most up-to-date data regardless of how much data and how many concurrent queries are running.
Various BI tools address this problem by implementing their own in-memory data stores to speed up the reporting interface, but then you need to solve the problem of keeping your BI tool cache and warehouse in sync, usually by scheduling periodic data extracts from the warehouse. This, in turn, introduces data freshness issues.
BigQuery BI Engine addresses this by adding an in-memory cache between BigQuery and the BI application. This allows us to achieve sub-second response time for queries coming out of BI tools. Since BI Engine is built into and managed by the BigQuery service, you don’t need to worry about keeping cached data up to date. The cache size is limited to 10GB, but since BI Engine only caches the columns it needs for specific reports, this might be enough to address most use cases. Currently, only Google’s Data Studio BI tool supports BigQuery BI Engine, but I’m sure we will see support from BI vendors soon.
BigQuery Data Transfer Service third-party integrations
Getting data reliably into your analytics platform is a difficult problem to solve, especially when you may have dozens of different data sources, from relational databases to the latest and greatest marketing SaaS application. Integrating SaaS data into your platform is especially time-consuming because not only are there are hundreds of SaaS applications, they don’t seem to follow any particular standard on how their data can be accessed. You can spend a lot of time just going through the API documentation for a particular service you need to integrate, and then there are inevitable changes requiring ongoing adjustments to your ingestion process.
Previously, BigQuery Data Transfer Service allowed you to easily ingest data from Google-owned SaaS services such as Google Analytics, Youtube, Google Ad Manager, and other Google services into BigQuery. Data Transfer service handled all the complexities of the underlying SaaS API, including changes to the API, and basically made ingesting data from the supported sources as simple as a few clicks in the GCP portal.
While very useful, until recently BigQuery Data Transfer Service was limited to only a handful of Google-supported data sources. At Google NEXT 2019, Google announced the extension of this service to include support for third-party integrations. As over the last few years integrating data from various SaaS applications has become an established SaaS business in its own right, this new feature means that you could use a service like Fivetran to ingest data from most popular SaaS applications, like Salesforce, Marketo, and others into BigQuery without having to worry about low-level integration details.
You can use these third-party integration providers, using the GCP web interface and unified billing to combine your GCP and third-party SaaS costs. It’s worth mentioning that today only service provisioning and billing are unified. You will still need to use third-party interfaces for any monitoring or troubleshooting needs.
Cloud Dataflow SQL (almost announced)
Finally, according to this blog post, Cloud Dataflow SQL will be released into a public alpha soon. There aren’t too many details about this new feature, but from the looks of it, you will be able to write a SQL query in the BigQuery interface, where it will be compiled into an Apache Beam/Dataflow job for you. Currently, Cloud Dataflow has a rather steep learning curve for data analysts and data engineers without a Java background, so I’m excited about any features that make creating distributed real-time data pipelines simpler.