If you’re a technical person trying to explain a big data platform like Apache or Cloudera Hadoop, or a Hadoop-based cloud database as a service (DBaaS) like Azure’s HDInsight, Amazon’s EMR Elastic MapReduce, or Google’s Cloud DataProc to the business, you’ve probably been stymied by a communication barrier. Somehow they just don’t get how awesome it is to have schema-on-read instead of schema-on-write, or why being able to store unstructured data is so awesome, or why faster transform makes such a difference.
Here’s how to explain the benefits of big data systems, which for simplicity we’ll call Hadoop, to people who have only ever known traditional data warehouses.
It all starts with what the business wants.
The business wants to be a truly data driven organization, and to become that, it’s critical that the business is able to use ALL types of data to drive business improvements – easily, fully, quickly, and cost effectively.
Relative to traditional data warehouse systems, Hadoop has functionality that brings easy access to more and different kinds of data, quickly and cost effectively in a way that wasn’t possible until these new technologies were invented. Here’s how:
All types of data
Unlike traditional databases, Hadoop-based big data systems are very flexible in terms of accepting large amounts of data of any type, regardless of structure. Until Hadoop it was expensive and difficult to store unstructured data. Now unstructured data like clickstream data, social media data, and audio file like call logs from call centers can be stored and analyzed. Hadoop’s ability to cost effectively store lots of data means that you can now get access to granular or raw data, not the aggregated data that is most commonly stored in traditional data warehouses.
Because relative to RDBMSs, Hadoop is inexpensive and can handle all kinds of data. It is becoming that single place where analysis and programming can be done across multiple sources of data. No more silos of data and “cutting and pasting” from different data sources to get a single integrated view.
With Hadoop-based DBaaS, compute and storage can scale independently of each other which means that variations in demand can be easily and quickly met. To the business user this simply means that however they want to use the system at any time, even as that need changes, the system can adapt to their needs much faster than an RDBMS could.
Hadoop/DaaS isn’t restricted to queries defined by predefined schemas. The traditional RDBMS schema-on-write model is good at answering the “unknown known” questions — those that we could model the schema for ahead of time. But there is a very large class of exploratory questions that fall under the category of “unknown unknowns” — questions that we didn’t know/expect ahead of time. Hadoop’s Schema on read means that Hadoop is better suited for exploring data, which is really important for advanced analytics where a lot of the time you “don’t know what you don’t know.”
Hadoop’s transform efficiency means even huge amounts of data can be transformed quickly. To the business user this means less waiting for their data, or more timely access to data for more real-time decision making.
Schema-on-read system, or ETL on the fly, means new data can start flowing into the system in any shape or form without having to first set up schemas, and months later you can change your schema parser to immediately expose the new data elements without having to go through an extensive database reload or column recreation. Once again this means less waiting to get access to incoming data.
Hadoop can process real time strings in memory – which to the business means you can do real-time analytics for uber timely information.
Traditional archival systems are an inexpensive way to store data (especially when compared to RDBMSs) but since you can’t run processing/queries in archival systems, you have to recover the data back into an RDBMS to use it again, making it very expensive. Hadoop/DaaS allows for cost effective storage of detailed historical data with the ability to at any time and immediately query archived data (Active Archive). To the business this means they can get access to archived data more quickly.
Unlike RDBMSs that scale vertically, Hadoop scales horizontally using commodity and therefore less expensive hardware. This means more data can be stored for a lower price than using traditional databases.
The bottom line is that for a businessperson who wants to engage with data, a Hadoop-based system is like a dream come true on several fronts, and now you can help them understand why – in words they’ll understand.