Enterprises recognize that becoming truly data-driven will propel innovation, transformation, and differentiation. However, the modern analytics journey can be complicated and take many years, making swift results challenging.
Balancing the desire for quick wins with the skills needed to “do it right” can be difficult for most organizations. While they understand that significant business value can be unlocked from within their data, few organizations are able to realize this value.
Co-authored by Google Cloud and Pythian–an award-winning Google Premier, MSP, and Google Cloud Data Analytics Specialization Partner–this post walks you through the foundational components of an enterprise-scale data platform and best practices, and how Pythian accelerates this journey for customers with their Enterprise Data Platform (EDP) QuickStart.
Designing a data platform for the enterprise
Modern analytics demand self-service, clean, and integrated data of many types, as well as the ability to scale while ensuring strong security controls and data governance. A cloud-based system built on Google Cloud is the fastest way to realize these benefits.
The design of a secure enterprise-scale data platform requires several key capabilities:
Data ingestion layer compatible with multiple formats
A highly secured data ingestion layer can process many different file formats from disparate data sources such as Oracle, Non-Oracle, SQL, and No-SQL data stores. The ingestion layer serves as a flexible landing zone that is strictly monitored, and the raw data is often processed by various Google Cloud or third-party tools, which are chosen based on “best-fit” for the specific task.
After landing and initial processing, the data should be clean and in a standardized format (Avro/Parquet), which then simplifies the remaining processing logic. The data ingestion layer should also have the capability to ingest via database log-based CDC or normal batch processing which is an important need for enterprise data warehouse modernization.
Data processing with disparate engines
Different data use cases may have very different data processing requirements, and those requirements are not necessarily going to always fit in the same Big Data processing engine. Google Cloud has a wide variety of Big Data processing tools such as Dataproc, Dataflow, Dataform, and others.
A data platform should be flexible enough to run processes in the best-fit environment for that particular job. For example, a process to convert data to Avro format for processing in a machine learning (ML) environment may use Spark because of the strong Pandas Dataframes libraries. On the other hand, a streaming process may use Dataflow due to the simplicity of development in that area and, of course, data that is already inside of BigQuery can use Dataform for native SQL pipelines, such as aggregations or KPI calculations.
A flexible data platform will enable customers to choose the processing engine, source data formats, and target based upon the workload to fully exploit the rich capabilities of Google Cloud both today and tomorrow.
Data segmentation is the process of grouping your data based on use cases and types of information but also based on the sensitivity of that data and the level of authority needed to access that type of information. Once data is segmented, different security parameters and authentication rules should be established, depending on the data segment at hand.
Within the context of a Google Cloud EDP are several options:
- Ensure the raw data is stored in a separate project and use multiple Google Cloud Storage (GCS) buckets for different data throughout the process.
- Make use of the Data Catalog for advanced security options within both GCS Buckets and BigQuery.
- Use Dataplex as a single pane of glass view for providing access to the various data stores in your GCP projects. Using Dataplex you can perform data governance across your Data lake, Delta Lakehouse, and/or Data Warehouse architectures.
Data lineage is the ability to trace how data flows from source to destination, including tracking of all transformations performed such as cleaning of the data, aggregations, and calculations. A complete data lineage process will include a business glossary to define the column, such as what’s provided by Looker’s LookML models, an understanding of the data owners to ensure changes to the data reflect the business reality, and a (generally) automated lineage gathering process to avoid missing any pipelines.
Data lineage can be a challenging area to implement in an enterprise context with holistic coverage not always possible. Focusing Instead on key lineage flows and increments across the platform rather than aiming for perfection from day one is a more effective strategy. For further guidance on data lineage, please refer to this guide or reach out to a skilled data lineage partner such as Pythian.
Role-based access control
This is a mechanism that enables you to configure fine-grained and specific sets of permissions that define how a given user, or group of users, can interact with any object in your environment. In the context of an EDP, role-based access controls allow you to store data in a unified central location such as a data lake while maintaining fine-grained access controls based on user needs. Rather than granting access to a complete datastore or dataset, you can grant access to only the data that a user needs for the purpose of their role or analysis, ensuring that access to other data is restricted.
Decouple compute from storage
The separation of storage and computing is a key component in a scalable and cost-effective data platform. In cloud pricing models, storage is significantly cheaper than compute, and as data volumes grow exponentially in an enterprise, it’s not feasible to store data long term in a compute tied storage system such as a Hadoop Distributed File System (HDFS).
By separating storage from compute, you can then leverage object storage for its inexpensive, virtually unlimited, and scalable storage which is by default highly available.
You can then optimize the compute layers that process your data based on data volumes, data type, and use case requirements. For example, if you require big data processing you can quickly provision a Dataproc cluster, paying compute costs only for the time the job runs. Alternatively for small data jobs, you can leverage functions as a service via Cloud Functions, which provide further cost optimization and minimal operational overhead.
Complete metadata tracking
Metadata tracking enables you to track data in the data platform. Metadata is critical to the long-term support of a data platform, as it often is the only window into how a given pipeline is currently running or has previously performed. Metadata is used heavily by a DataOps team to track, alert, and respond to issues quickly and often, automatically.
Some commonly tracked metadata include:
- Current and historic pipeline configuration values
- Pipeline runtime statuses
- Pipeline errors, warnings, and other messages
- DataOps data metrics
- Completeness: Ensure all relevant data is stored
- Consistency: Ensure values are consistent across data sets
- Uniqueness: No duplicate data
- Validity: Data conforms to the syntax of its definition (range, type, format)
- Timeliness: All data is stored within any required timeframe
- Accuracy: Data correctly describes an object or event
- Current and historic data schemas
- Data lineage
Building an EDP: Best practices and considerations
If you’re interested in how best to build an EDP, stay tuned for Part 2 of the series, where we’ll explore the practices you should apply when going in this direction.