We don’t pay enough attention to Hadoop.
By “we” I mean DBAs, the rest of the world is paying plenty of attention to Hadoop. Recently, I started asking my customers and fellow DBAs about Hadoop adoption in their company. Turns out that many of them have Hadoop. Hadoop shows up in large companies and small ones, in established industries and in startups. Its everywhere.
The way Hadoop shows up in all companies, and the way DBAs don’t pay Hadoop much attention, reminds me a lot of how MySQL started showing up in the enterprise. It didn’t start by DBAs showing up one morning and telling their managers:
“There’s this new open source database. Its not as stable as Oracle and it doesn’t have all the features we need, but man – its going to save us tons of money, and its pretty simple to manage.”
Nope, this never happened. What happened instead is that developers learned about MySQL, and it seemed to them like an excellent way to go around this whole DBA thing. They could install it themselves, learn how to use it in a week and become happy and productive. Without ever having to discuss their schema, data model, requirements, capacity planning, availability, backups and all the other things that DBAs want to talk about.
By the time the application came out of developement and had to be deployed in production, MySQL was a done deal. No one is going to re-write the app just because the DBAs don’t know MySQL. Sometimes the Oracle DBAs were forced to learn and admin MySQL, but more often it was considered “not a database” and left for the sysadmins to manage, while the DBAs continued to pretend that the entire world is written by Oracle.
So thats what Hadoop adoption looks like now – Its usually introduced by the developers and administered by sysadmins, while DBAs continue to pretend it doesn’t exist or doesn’t matter. When pressed, some DBAs will even insist that all this “big data” thing can and should be done in a database, but the developers are too ignorant or lazy to work with a proper RDBMS.
I think the day arrived when, just like DBAs can no longer ignore MySQL, we can no longer ignore Hadoop either. So lets talk about it.
What is Hadoop?
First, Hadoop is not a database. Its infrastrusture, almost an operating system.
Hadoop was developed to ease the management of “big data”. In this context big data is too much data to fit on the hard-drive of a single machine, so the data and the analysis of the data has to be distributed over a large cluster.
The idea is that you install Hadoop on a large number of normal servers, use their harddrives to store the data, so the data is distributed across many separate machines and disks and then use their CPUs to process the data. Its a shared nothing architecture built with commodity hardware.
Hadoop consists of two parts – a distributed file system (HDFS) and programing model with a job scheduling system (Map Reduce). Hadoop’s file system is different from your mother’s file system in two important aspects:
- It was built to support large files, so the default block size is 64M. This makes the disk seek time a small percentage of the time it takes to retrieve the data. You can store smaller files in HDFS, but you can’t store too many small files – one of HDFS servers has to keep the entire file list in memory, and too many files will result in this server running out of memory. You can configure a larger block size, but you need to be careful – data is processed in blocks. If you have fewer blocks than processing machines, you use less CPUs than you could to process the data. Missing on some of the performance benefits.
- Each block is replicated on several servers, so if any single server fails, the data is not lost and processing can continue. You can configure the number of servers each block is replicated on.
Map-Reduce is a parallel job-processing framework.
Each map-reduce job splits the data into independent chunks (usually block-sized), each chunk is processed by a map task in parallel to all other tasks. Map tasks usually do independent transformations and filtering of the data. The output of the map tasks is the input of reduce tasks – reduce tasks aggregate the data and generate the final output. The results of the tasks are stored in the HDFS filesystem, and the map-reduce framework keeps track of all the jobs.
This includes placing the tasks on the server that contains the data each task processes, to reduce network utilization, ad tracking the tasks so if a task hangs or stalls on one server, it can be started on an additional server to speed up processing.
Using Hadoop consists of loading data files into HDFS filesystem, and then writing map-reduce jobs to analyse the data. One of the major drawbacks of using Hadoop is that Map-Reduce, while it makes developing distributed software easier, is still much more difficult to use than SQL. There are tools that make developing ad-hoc queries for Hadoop easier. I describe few of them below.
Where would I want to use Hadoop?
Hadoop was developed for analysis that load data once, rarely modify it, and that run batch operations that will scan most of the data set. Note that this is very different from normal use of a database where traditionally we want to access a small fraction of the data (using indexes to locate precisely the data we want) and to constantly modify the data.
By far the most common Hadoop use-case is ETL. Transforming various logs collected throughout the organization into a information that can be added to the corporate data-warehouse and analysed by tranditional BI tools.
Whether its mining web server logs to discover usage patterns of websites and web applications, or an ISP analysing mail server logs to find the location of users and decide which locations require additional mail servers. The other part is analysing load and failure data to improve internal IT operations.
There are other exciting use cases for Hadoop:
- The New York Times famously used 100-server Hadoop cluster hosted by Amazon to transform 4TB of old images into 11 million PDF files. They did it in 24 hours and for total cost of 240$.
- Yahoo uses Hadoop to create web indexes and power its search engine.
- Autodesk uses Hadoop to track the most popular products in product catalogs and sell this information back to their customers.
- eBay uses Hadoop to optimize its product search.
- AOL-Advertising uses Hadoop to optimize its ad-placement. Facebook are doing the same. Facebook are also using Hadoop to mine user behavior data and use this information to make product marketing decisions.
The list goes on and on. Almost every company has a lot of data, a lot of it outside the relational database. Almost every company can optimize its business operations or even drive completely new products by analysing and mining this data. Hadoop is a tool to mine large amounts of non-relational data.
Our business analysts won’t write map-reduce jobs
There are two solutions to this problem and most companies use both:
- Load the results of Hadoop processing into the data-warehouse that is already in use by the business analysts (usually through their BI tools). This can be done only when there are definite requirements on how the data will be used.
- Use tools such as HBase, Hive or HUE as a front-end to Hadoop. These tools provide a language similar to SQL that will be more familiar to business analysts and will allow them to learn how to use Hadoop for ad-hoc queries. In addition, Pentaho has a BI product that can integrate directly with Hadoop.
Who is offering support and products for Hadoop?
This is definitely an area that showed large and unexpected growth in the last year. As more large companies adopt Hadoop, more vendors rush to support it, and as enterprise support for Hadoop grows, more companies are ready to adopt it. This growth spiral was very exciting to watch.
Hadoop is an open-source product. If you need support, training and all kinds of enterprise services, you’ll need to find a company to support you.
The most well known company in this space is Cloudera, who deserve tons of credit for making Hadoop what it is today. The founders of Cloudera are the early Hadoop developers from Yahoo, so they definitely have the technical chops to support it. They also hired top-notch training team from MySQL after the Oracle aquisition.
While Cloudera sells Hadoop professional services, it appears that they do not sell 24/7 production support services or integration services. I’ve heard from my customers that even small 24-node cluster requires a full time employee to support it. Figuring out the fastest way to load terabytes of data into HDFS also remains the problem of the developing teams.
In addition to support and services, Cloudera also sell their own Hadoop distribution with some enterprise-ready extensions such as a management suite.
EMC, the storage giant, created their own Hadoop distribution, which they support. It is called “Greenplum HD Enterprise Edition”. Their distribution includes snapshots, WAN replication and cluster management capabilities.
EMC also have an Hadoop data appliance that is claimed to run Greenplum database and Hadoop in the same device. All with hardware optimized for Hadoop processing and a unified interface of some kind. It sounds nice. I’m still waiting to run into one of those “in the wild”.
The device was announced on May, and I kind of expect Oracle to announce their own Hadoop Exadata ever since. The story of unified structured and unstructured data in same device sounds like something that Oracle won’t be able to ignore. Maybe this year in OpenWorld?
Netapp announced their own Hadoopler around the same time that EMC announced their device. The hadoopler is not a complete Hadoop stack – its just a high performance storage running HDFS. Its not a NAS/SAN system – the computation nodes (which Netapp does not provide) are expected to connect directly to the disks on Netapp shelves.This entire thing is based on the Netapp E-series (AKA Engenio). It is supposed to improve disk-failure recovery and high availability of HDFS.
Netapp has partnership with Cloudera competitor, Hortonworks to provide Hadoop support.
May was a busy month indeed because at the same time IBM announced its own Hadoop distribution “InfoSphere BigInsights”, which is once again Hadoop with enterprise features. It seems to be software-only. Support will be provided by IBM.
So, IBM, EMC and Netapp are joining the Hadoop fun. You’d better believe that this is not just a toy for web startups, but a tool that is expected to have significant use in the enterprise.
Discover more about our expertise in Hadoop.