After my “Building Integrated DWH with Oracle and Hadoop” webinar for IOUG Big Data SIG on Tuesday, I got a bunch of excellent follow-up questions. The most frequently asked questions were: “What is the minimum I need to do to get started with Hadoop?” and “How do I load data into Hadoop?”
Since so many people are interested in the same questions, it makes more sense to put my answers on a blog than to copy and paste them to everyone personally. Also, in the grand open source tradition, there are many ways to get started with Hadoop and load data into Hadoop.
Let’s go over a few options for getting started with Hadoop:
Hadoop can run in few different modes.
- Local mode – Here, you can run your map-reduce code over files in your local filesystem. In this mode there is no cluster or HDFS. It’s mostly used to test if your fancy map-reduce jar files will run at all.
- Pseudo-distributed mode – In this mode, you are running all Hadoop processes as you would in a real cluster, but they are all running from a single server. My test “cluster” is a pseudo-distributed Hadoop running in a VM, and for many tests this is enough. For example, I ran all the Hadoop-Oracle integration scenarios in this setup.
- Fully-distributed mode – Here the sky is the limit. All production clusters run in this mode, and they can be huge and complex. But for starters, we are looking at just two or three machines, either VMs or in the cloud and running just one of each of the basic processes. For some tests, you need a fully distributed cluster – for example there is a blog post coming up about configuring HA HDFS, and I couldn’t do this with just a single node.
Getting started in Local Mode is probably easiest. You download a release of Apache Hadoop and install it. Then configure /etc/hadoop/conf/hadoop-env.sh with your JAVA_HOME (You need Java 1.6 to run Hadoop). At this point you can run /bin/hadoop jar <your job> and presto! You are running Hadoop!
This may be good enough for developers, but DBAs usually don’t feel they really experimented with a new data store if they don’t have some place to store the data. Local file system isn’t what we are looking for. So we usually get started with the Pseudo-distributed Mode.
In Pseudo-distributed mode, you will be running all basic Hadoop processes:
- HDFS name node – master process for HDFS. There is always just one of those, and it is responsible for keeping track of the filesystem meta data, such as file names and directories.
- HDFS data node – slave process for HDFS. In a real cluster there are many of those. They are responsible for communicating with the client, storing the data, and replicating it.
- Map-Reduce Job Tracker – master process for Map-Reduce. This process keeps track of job progress, failures, and scheduling.
- Map-Reduce Task-Tracker – slave process for Map-Reduce. Responsible for running specific maps as well as reducing processes and reporting their progress.
- (Optional) Secondary name-node – Offloads checkpointing activity off the Name-node.
To get started with Pseudo-Distributed mode, you have two easy options:
- Install Cloudera’s distribution from a package built for your flavor of Linux. It’s relatively easy if you are familiar with your version of Linux.
- Download Cloudera’s demo VM and run it. Can’t get any easier than that!
I used the first method when I wanted to run Hadoop on a system that already had Oracle installed. It’s easier to install Hadoop on Oracle server than vice versa. The rest of the time, I used the VM for my tests.
Full cluster Mode is required when you run a more serious POC or want to test some of the HA features.
There are no easy ways to get started with a full cluster. Both require a deeper understanding of Hadoop, but they are not very difficult either:
- Take the VM from the Pseudo-distributed step, and run two of it. Make sure they can communicate with each other. Stop all services Hadoop on both VMs. Configure /etc/hadoop/conf/core-site.conf , /etc/hadoop/conf/hdfs-site.conf, and /etc/hadoop/conf/mapred-site.conf with the appropriate IP addresses so the services will be able to find each other. Start name-node and job-tracker on one node and data-node and task-tracker on the other.
- As an alternative, you can create the cluster in the Amazon cloud with Whirr. Note that this can get costly if you forget to take the cluster down like I did last week (ouch!). If you are familiar with Amazon AWS services, you can follow the instructions on the Cloudera website, and if you are a cloud newbie, this blog post offers more handholding.
O.K. In one way or another, you now have Hadoop. Now lets see how we get data on HDFS. I’m going to assume a Pseudo-distributed cluster was used here, since this is what I mostly use.
There are literally endless ways to get data on Hadoop, so let’s review some of my favorites. Note that none of these methods require any special Java or even programming knowledge. They can be used by any simple DBA:
- Copy a file: hadoop fs -put localfile /user/hadoop/hadoopfile
- If you have large number of files (very common in my experience), a shell script that will run multiple “put” commands in parallel will greatly speed up the process. File copying is easy to put in parallel without any need to write fancy MR code.
- You can also have a cron job scanning a directory for new files and “put” them in HDFS as they show up.
- Mount HDFS as a file system and simply copy files or write files there. The instructions for mounting are in my presentation.
- Use Sqoop to get data from a database to Hadoop. This was also covered in my presentation.
- Use Flume to continuously load data from logs into Hadoop. There are some catches there – you want to make sure you get relatively fresh data since you don’t want Flume to do too much buffering, but you also don’t want many small files because that’s bad for HDFS. I’ll probably blog about this in the future.
In my presentation, I mentioned Perl as a method of ETLing data from MySQL to Hadoop. It was mentioned because I use Perl whenever possible, not necessarily because it was recommended. Most of the time, using Sqoop will be a better idea. Pre-processing the data in the DB can also be a good idea, and if you need custom code (like I did) and can program in Java (like I hate doing), you should probably write proper MR code.
In any case, I used an interface called Streaming which allows running any shell command that reads and writes key-value pairs as mappers and reducers. I wrote my own perl scripts to get data from the DB and did some pre-processing before dumping it into Hadoop.
If you have other tips on getting started with Hadoop or for loading data, the comments are all yours! :)
Discover more about our expertise in Hadoop.
I have MAC OS X10.6.8.(64 bit). Looks like the distribution of Cloudera has only Red HAT and CentOS flavors of Linux. Waht is your suggestions to start with Hadoop in this case? Thanks for help