This is the third in a series of four posts on breaking down data silos and its importance for organizations. In this post, we look at which types of data platforms work best for a centralized, integrated data program.
In our previous two posts, we discussed the importance of centralizing your overall data program to ensure actionable, repeatable and accurate data and insights across your organization. As we learned in the last post, a strong master data management strategy must be centralized to follow data governance best practices and allow for consistency.
But a centralized data program also requires strong data integration tools to clean, deduplicate, integrate and transfer data from different sources and departments into one version of the truth across an organization.
This begs the question: How do the various types of data platforms – from on-premise to Hadoop, to cloud-native – compare when it comes to effective big data integration?
On-premises, or “on-prem”, data warehouses are the most traditional of the three types. Historically, most organizations have held their data in a server room on the same premises as the office. To most employees, this was just a mysterious room that stayed locked at all times, except when IT employees sometimes shuffled in and out.
The overall pros and cons of on-premises data warehouses have been well documented: they can be expensive to scale, but can also offer additional security, localized speed, and control over connectivity, for example.
In terms of data integration, however, traditional on-prem data warehouses can be somewhat lacking. That’s because they usually weren’t built to handle an immense amount of data from a huge variety of data sources and types, and often require significant investments in money, time and staff to accomplish this.
Since they weren’t originally built with modern data integration in mind, many on-prem data warehouses, such as Oracle Exadata, also require the purchase of additional products – or even an additional suite of products – to ensure the minimum extract, transform and load (ETL) capabilities necessary for true data integration.
Similarly, products like Teradata Data Warehouse also typically require multiple platforms to meet even the most common use cases that require integrated data from disparate sources across the enterprise. Achieving full integration can be such a challenge that the company offers a data integration service to help clients who are struggling to figure out which platforms to use.
Because of the complexity and expense involved in achieving true data integration, on-prem data warehouses can also create optimal conditions for data siloing within an organization.
Hadoop is a strong option when it comes to big data storage and processing. But as with on-prem data warehouses, big data integration or cloud data integration with Hadoop can be challenging. This is usually one of the first big challenges organizations face after they’ve initially set up their Hadoop cluster.
For starters, simply using Hadoop requires specialized skills. Querying the system usually means writing functions in Java, or using software like Hive or Pig. These technical complexities can make Hadoop difficult for non-specialized users. And data integration for Hadoop can be next to impossible for typical database administrators, unless – as in the case of on-prem data warehouses – additional tools are purchased.
That’s because on its own, Hadoop doesn’t deal with data ingestion, verification, transformation, or unification. Custom scripts can be written for ingestion into Hadoop depending on your organization’s environment, but they’re not very scalable and can break, potentially leaving users frustrated.
And while data ingestion tools made for Hadoop such as Flume, Sqoop and Attunity Replicate can help fill the gaps, you won’t be successful at catching everything and achieving full data integration unless your team also has deep technical knowledge about the platform.
Cloud-native is quickly becoming one of the most common data platform types, and for good reason. These platforms are easy and cheap to scale as compute or storage demands change. They offer unparalleled speed for organizations with a global reach.
And strong cloud-native platforms are tailor-made for cloud data integration, by providing automated processes for ingesting, cleaning, and unifying information from a range of different sources such as RDBMS, SaaS, CSV, and JSON, along with semi-structured or unstructured data.
Indeed, strong cloud-native data platforms offer data integration, data quality, and data governance solutions all under one roof. That means no holes to patch or gaps to fill. This kind of automated data curation ensures repeatable and consistent management of data throughout its lifecycle, establishing the ongoing verification, transformation and unification of complex, multi-source data on a single platform.
These platforms also do data modifications at the source, rather than in the data hub, to ensure consistency of data across systems and departments.
And that means your users – be they business users, analytics users, or data scientists – always have valid, curated data ready to go.
For more in-depth information on cloud-based data platforms and how they can help you break down data silos, read our book, The Data Warehouse is Dead, Long Live the Data Platform.
Want to talk with a technical expert? Schedule a tech call with our team to get the conversation started.