Some of the highlights of our discussion were:
- Google Cloud Dataflow
- Master Node HA for Cloud Dataproc
- Access to GPUs on Preemptible VMs
Google is making a push in the Machine Learning (ML) space and it would be safe to say that they are leaders in this area. As far as I know, they have one of the largest libraries for ML modelling.
To understand AutoML, think of the current state of ML service consumption models as a spectrum, akin to the IaaS, PaaS, SaaS, and FaaS consumption models. On one end, you have ML services that are more like the IaaS consumption model, where everything is very raw and there’s a lot of foundational work that needs to be done to get a working model up and running. On the other end of the spectrum, you have pre-built models that are exposed solely as API endpoints to be consumed on demand (e.g. Cloud Vision API, Cloud Speech API, etc.). Needless to say, in the former case with the added overheard you also get the most flexibility—as you need to manage the entire environment—while in the latter case you get the least flexibility.
What Google has done is to introduce an intermediate service on this spectrum. This service sits in between a pre-built model and a PaaS that you can use to develop your own models on. The idea is that Google will provide you with pre-built models that are conditioned for a certain type of data and task (e.g. images for image recognition) that you can train further with a more specific data set to render a specialized version of the model, one that is relevant to your specific workloads. A cooking analogy may help here: you can think of AutoML as a half-cooked meal that you buy from the grocery store which you can cook fully to your heart’s delight.
One example of that is Google’s image processing API—the Vision API— which can be used to identify a cloud in an image. What AutoML can do is take this one step further by identifying the exact type of the cloud in the picture (assuming you have trained the model to be able to distinguish between different types of clouds). This is beneficial for someone who works in the weather industry, for example, and wants to run some analysis on certain cloud types (e.g. stratus, cumulonimbus, etc.) being formed in a certain geography.
Ultimately, Google is democratizing machine learning and trying to do so with varying levels of flexibility to try and match your requirements no matter what they may be. AutoML is a step in the direction to democratize machine learning models for the more general user. They are enabling more organizations to leverage ML without having to go through all the grunt work that data scientists usually have to go through.
For those who do not have much knowledge of linear algebra or its applications and how it is used to build ML models or the required knowledge to expose an ML model in production via an API, there is no need to worry because all of this is mostly UI-driven and very accessible. You have the ability to train your own ML model for specific image processing, and my guess is that they’re going to start rolling out similar models such as language processing and video processing in the future.
Google Cloud Dataflow
Dataflow is a homegrown Google project. Originally, they had it as an internal project which they open sourced a few years ago. Since then, they’ve released it to the Apache foundation as the Beam API. Essentially, that’s their cornerstone for everything ETL-driven. The driver behind the Beam API is that it is positioned by Google to handle your batch processing workloads as well as your stream processing workloads, all using a single codebase and service.
Traditionally, people would follow what is referred to by some as a lambda pipeline architecture, where you’d have to separate your batch and streaming data pipelines and workloads, as they would leverage different storage and processing engines. You’d then be faced with the challenge of integrating these streams at some point throughout the data pipeline in order to leverage the data in your analytics engine. What Dataflow does is it consolidates/converges those two types of workloads into a single service and codebase, effectively freeing you from the need to deploy parallel pipelines for varying workloads.
As far as visualization, Google has provided (in partnership with Trifacta) a service called Dataprep, which is a GUI that interacts with Dataflow (under the hood) for your data processing workloads. Dataprep allows you to develop Dataflow-based pipelines through a visual interface and observe your transformations as they happen in real time. Under the hood, Dataprep is generating Beam API code, which is then set to run on Cloud Dataflow.
Master Node HA for Cloud Dataproc
One of the most recent and notable improvements for those who are heavy users of YARN and Hadoop processing cluster. Google’s Cloud Dataproc service now provides you with the option to deploy multi-master YARN clusters, enabling a higher level of availability for cluster management and job scheduling.
Previously, and with some other cloud providers as well, you would be able to use the native-managed YARN cluster service of the service provider; however, it would only be assigned a single master node to manage the cluster. This can lead to operational issues since the cluster is centrally managed and jobs are scheduled on the cluster via the master node. So if this node goes down, or its performance degrades for any given reason, then so does the performance of your whole cluster.
With this new announcement, you can now have up to three master nodes for your managed YARN (Dataproc) cluster for increased availability and reliability of your YARN-based workloads.
Access to GPUs on Preemptible VMs
Last but not least, Google also had an announcement this month regarding the extended availability of high-end hardware for preemptible VMs. Preemptible VMs usually cost approximately 20% of what a regular VM would cost; however, that comes at the cost of them not being dedicated resources. They are a liability as they can go down at any moment with just a 30-second notice if needed, and generally, do not run for longer than 24 hours.
Preemptible VMs are usually used in a number of different cases, whether it is for testing purposes, or if you have massively parallel short-lived compute jobs that need to run on 100’s or 1000’s of machines at a time. These are potential uses for these preemptible VMs.
In the spirit of making these VMs applicable to more use-cases, Google has now made it possible for you to deploy preemptible VMs that have GPUs attached to them. This allows users, such as data scientists, who require these resources for testing purposes or maybe even massive video rendering farms, to adopt the use of preemptible VMs with GPUs attached to complete these jobs at a much economical rate. I personally use preemptible VMs quite frequently when I quickly want to spin up a Kubernetes cluster on GKE, or a large set of VMs for testing purposes.
This was a summary of the Google Cloud Platform topics we discussed during the podcast, Chris also welcomed Greg Baker (Amazon Web Services), and Warner Chaves (Microsoft Azure) who also discussed topics related to their expertise.
Click here to hear the full conversation and be sure to subscribe to the podcast to be notified when a new episode has been released.
Interested in working with John? Schedule a tech call.