Pythian EDP QuickStart Components: Part 4

Scott McCormick

September 14, 2022

Tags: Google Cloud, Enterprise Data Platform (Edp), Technical Track, Business Insights

Google Cloud storage provides a secure storage layer that is a cost-effective foundation of the platform.

When you ingest raw data into the platform, it is stored in a cloud storage bucket in its raw data format before any transformations take place. This data is preserved for audit, reconciliation, and playback purposes, ensuring a single point of truth exists for the data. Based on enterprise retention policy and/or regulatory controls, this storage may be periodically purged or moved to more effective storage layers, such as Nearline or Coldline storage.

As data is transformed and cleaned through the data processing lifecycle, it is periodically stored in separate cloud storage buckets for further analysis. Within the data platform, a best practice is to separate the raw ingestion data from all other data using separate buckets. While Google Cloud has strong IAM and object-level permissions, using a single bucket creates a larger risk of assigning improper permissions.

An extension of cloud storage, automated triggers are set up on buckets based on data events within each bucket. For example, upon upload of an object into a bucket, a cloud function is triggered, which can perform transformations or trigger a downstream job in Dataflow.

Pub/Sub

Pub/sub is used as a message queue and processing control throughout the data platform.

Technical issues during the life of a data platform are inevitable. Building resilient systems ensures no data is lost if an issue occurs. The data platform design should consider this by building in queues that can stage the data if a particular piece of the platform begins to have issues. Within Google Cloud, this can be done by using pub/sub with a well-considered set of message attributes to assist with data routing.

In addition to this functionality, pub/sub enables new systems to plug into the data platform without having to fully understand these requirements on day one. For example, a data platform that allows for data scientists to be notified when data has been cleaned, but prior to being transformed, provides a much more flexible platform than one that requires all end users to access the data at the same endpoint.

Cloud functions

Cloud functions perform simple, quick, repeatable tasks that all pipelines require to function, such as extracting parameters from the metadata stores and configuring jobs for runtime. Because a cloud function can have thousands of parallel instances executing at once, there is limited chance for bottlenecks and no need to worry about paying for excess compute usage.

Workflows

Workflow management and orchestration within your data platform is a critical component to managing complexity as your platform grows and performs overall job coordination within the platform. Google Cloud offers several options to orchestrate your data workflows:

Cloud Scheduler
Workflows
Cloud Composer
A 3rd Party orchestrator running on Google Cloud, such as Argo or Luigi

Pythian’s EDP QuickStart uses Argo workflows that run within a GKE cluster and are similar to Composer DAGs. Argo provides an open source, container-native workflow engine for orchestrating parallel jobs on Kubernetes and enables Pythian to architect cloud agnostic pipeline jobs that can run in any Kubernetes environment.

Argo is used to track task dependencies and batch job status. As a cloud function has a 9-minute execution timeout, this task tracking must be done outside of the cloud function.

Data loss prevention

Cloud data loss prevention provides tools to classify, mask, tokenize and transform sensitive elements to help you better manage the data you collect, store or use for business or analytics. All of the Big Data Processing engine code has a flexibly designed integration with data loss prevention (DLP) to allow for sensitive data to be obfuscated when needed.

Serverless Spark

Serverless Spark is an autoscaling implementation of Spark that has been integrated with Google-native and open source tools. Serverless Spark is used within the data platform to ingest, clean, and transform the initial raw source data. Spark was chosen due to the wide variety of files and processes which can be integrated with it.

Prebuilt Spark pipelines designed to process data from many disparate sources and which can be deployed and updated by simple metadata updates, have been created to ingest, clean and transform data and prepare it for loading into BigQuery.

Dataflow

Dataflow is the main processing engine for all non-BigQuery processes after the data is ingested. Like the pipelines written for Serverless Spark, the Dataflow pipelines are templated and allow for code-free deployments and updates as requirements change. In addition to standard Dataflow, we can also leverage Dataflow Prime, which brings new user benefits with innovations in resource utilization and distributed diagnostics. These new capabilities in Dataflow significantly reduce the time spent on infrastructure sizing and tuning tasks, and time spent diagnosing data freshness problems.

Dataform/dbt

Dataform or dbt are used to generate SQL native pipelines for data manipulation directly within BigQuery. In addition, both tools have unit testing capabilities to ensure data quality, and both tools can display data lineage.

Dataform is a platform for data analysts to manage data workflows in cloud data warehouses such as Google BigQuery. It provides tools for analysts and data engineers to build workflows that transform raw data into reliable datasets ready for analysis. dbt is a development framework that combines modular SQL with software engineering best practices to make data transformation reliable, fast, and fun.

BigQuery

An enterprise-scale data warehouse, BigQuery is used to create data marts and as the main repository of structured reporting data for the data platform.

Using BigQuery and GCS, Pythian implements a Lakehouse and Data Mesh architecture for its customers. Data Mesh empowers people to avoid being bottlenecked by one team and enabling the entire data stack. It breaks silos into smaller organizational units in an architecture that provides federated data access. Lakehouse brings the data warehouse and data lake together, allowing different types and higher volumes of data. This effectively leads to schema-on-read instead of schema-on-write, a feature of data lakes that was thought to close some performance gaps in enterprise data warehouses. As an added benefit, this architecture also borrows more rigorous data governance, which data lakes typically lack.

Looker

Looker enables ?you to build data-rich experiences that empower users and reduce reliance on data teams to extract value from your data. Users can analyze, visualize, and act on insights from the data in BigQuery or Cloud Storage buckets knowing that the data is up to date. As part of Pythian’s EDP QuickStart solution, a customer can have prebuilt Looker dashboards deployed out of the box with the option to implement custom dashboards specific for your use case.

Delivering business value with Pythian’s Enterprise Data Platform QuickStart

With your understanding of Pythian’s EDP QuickStart’s bits and pieces, it’s natural to wonder what you can accomplish by applying it; what are the potential benefits? To learn about the value our EDP QuickStart can uncover, read the final installment of our blog series.

Insight and analysis of technology and business strategy