I was peacefully minding my own business on Friday when suddenly the heavens parted and, lo and behold, the archangel Gabriel, all wings and flaming sword of God, descended unto the Earth right in front of me. He said in a voice that sounded like a thousand mating bullfrogs: “Yo, dude: Hadoop. Learn it.”
“Uh, sure,” was my answer. “Any particular reason why?”
“Final Battle’s coming in. Humanity’ll need you,” was the terse answer. And then he vanished in a blast of prismatic colors and cherubic high notes.
So, there you have it in exclusivity. Armageddon’s officially coming, and it’s safe to assume that it won’t be exactly as described in the Book of Revelations.
Oh, yeah, and I’m now playing around with Hadoop.
What’s a Doop?
Hadoop is a framework to deal with the processing of massive amounts of data. And by massive, I really mean more than just big. I mean levianthanesque in scope, the kind of humongous size that can only be put in context by a Yo mama joke.
As far as I can make it out, Hadoop’s fundation lies on not one, but two types of special sauces. The first is the Hadoop Distributed File System (HDFS), which provides high-throughput access to the data, and Hadoop MapReduce, the software framework allowing the processing of those masses of data. After a quick first peek at it, the latter feels to me like Perl’s map
function re-imagined by an engineer of the Death Star. Sounds promising to me.
First Poke: Hadoop’s Streaming Interface
Before I dig into the mechanics under the hood of the Hadoop beastie (which is the part, I assume, that is going to be heady as hell), I thought it would be a good idea to play a little bit with some of its applications to give me a feel for the lay of the land.
The first thing I wanted to try was the streaming interface of MapReduce. Basically, it’s an API that allows Hadoop to use any script/program to perform the mapping/combining/reducing parts of the job. Yes, any program, in any language. I already thought what you are thinking, and am in fact one step further: I found Hadoop::Streaming, which is going to make the exercise one sweet, smooth walk in the park.
The classic Hello World application for MapReduce seems to be counting instances of words in a document. It’s nice, but let’s try to spruce it up a (itsy-bitsy) notch. How about figuring out what is the first instance I ever mentioned different modules in my blog entries?
The first, non-Hadoop, step that we need to do is to stuff all blog entries in a single document, along with its creation date and its filename. That ain’t too hard to do:
[perl]#!/usr/bin/env perluse 5.10.0;
use strict;
use warnings;
use File::Find::Rule;
use Path::Class qw/ file /;
File::Find::Rule
->file()
->name( ‘entry’, ‘*.pl’, ‘*.pm’ )
->exec( sub {
my $file = $_[2];
print join “\t”, $file, -M $file, $_
for file($file)->slurp;
} )
->in( ‘/home/yanick/work/blog_entries’ );
With that, we can create our master file and drop it in Hadoop’s cavernous file system:
[bash]$ ./gather_entries.pl > all_blog_entries$ hadoop fs -put all_blog_entries all_blog_entries [/bash]
Now, we’ll need two scripts for our data munging. A mapper.pl
, to extract all module name / filename and creation time pairs:
use strict;
use warnings;
package ModuleCounter::Mapper;
use Moose;
use Method::Signatures;
with ‘Hadoop::Streaming::Mapper’;
method map($line) {
my ( $file, $creation, $content ) = split “\t”, $line;
# regex is not perfect, but for the example, it’ll do
$self->emit( $& => join ” “, $creation, $file )
while $content =~ /\w+(::\w+)+/g;
}
__PACKAGE__->meta->make_immutable;
__PACKAGE__->run;
1;
And a reducer.pl
, to produce the earliest blog entry:
package Modulecounter::Reducer;
use Moose;
use Method::Signatures;
with ‘Hadoop::Streaming::Reducer’;
method reduce($key, $it) {
my( $oldest_time, $file ) = ( 0, ” );
while( $it->has_next ) {
my @e = split ‘ ‘, $it->next;
($oldest_time, $file) = @e if $e[0] > $oldest_time;
}
$self->emit( $key => “$file ($oldest_time)” );
}
__PACKAGE__->meta->make_immutable;
__PACKAGE__->run;
1;
With that, we are ready to unleash the might of Hadoop:
[bash]$ hadoop jar /path/to/hadoop-streaming-1.0.3.jar \-D mapred.job.name=”Module Mentions” \
-input all_blog_entries \
-output oldest_blog_entries \
-mapper `pwd`/mapper.pl \
-reducer `pwd`/reducer.pl
packageJobJar: [/tmp/hadoop-root/hadoop-unjar1363307178676659245/] [] /tmp/streamjob451754843805661903.jar tmpDir=null
12/07/08 20:23:36 INFO util.NativeCodeLoader: Loaded the native-hadoop library
12/07/08 20:23:36 WARN snappy.LoadSnappy: Snappy native library not loaded
12/07/08 20:23:36 INFO mapred.FileInputFormat: Total input paths to process : 1
12/07/08 20:23:36 INFO streaming.StreamJob: getLocalDirs(): [/tmp/hadoop-root/mapred/local] 12/07/08 20:23:36 INFO streaming.StreamJob: Running job: job_201207071321_0014
12/07/08 20:23:36 INFO streaming.StreamJob: To kill this job, run:
12/07/08 20:23:36 INFO streaming.StreamJob: /usr/libexec/../bin/hadoop job -Dmapred.job.tracker=localhost:9001 -kill job_201207071321_0014
12/07/08 20:23:36 INFO streaming.StreamJob: Tracking URL: https://localhost:50030/jobdetails.jsp?jobid=job_201207071321_0014
12/07/08 20:23:37 INFO streaming.StreamJob: map 0% reduce 0%
12/07/08 20:23:51 INFO streaming.StreamJob: map 100% reduce 0%
12/07/08 20:24:03 INFO streaming.StreamJob: map 100% reduce 100%
12/07/08 20:24:09 INFO streaming.StreamJob: Job complete: job_201207071321_0014
12/07/08 20:24:09 INFO streaming.StreamJob: Output: oldest_blog_entries
$ hadoop fs -cat oldest_blog_entries/part-00000 | head
A::B template-declare-import/lib/A/B.pm (322.057893518519)
Acme::Bleach pythian/dpanneur-your-friendly-darkpancpan-proxy-corner-store/entry (654.046655092593)
Acme::Buffy use.perl.org/31870.d/entry (574.224444444444)
Acme::Dahut use.perl.org/dahuts/entry (575.121782407407)
Acme::EyeDrops pythian/dpanneur-your-friendly-darkpancpan-proxy-corner-store/entry (654.046655092593)
Algorithm::Combinatorics nocoug-2012-sane/files/party_solver.pl (27.0165972222222)
Apache::AxKit::Language::XPathScript shuck-and-awe/issue-11/entry (654.046643518519)
Apache::XPointer::XPath shuck-and-awe/issue-11/entry (654.046643518519)
[/bash]
Yeah, looks about right. And, just like that, we now have our first application using Hadoop’s MapReduce. Wasn’t too painful, now, was it?
Graduating to the Real Good Stuff
Edit: As some people pointed out in the comments, I’m full of hooey. Cassandra is not built on top of Hadoop but is a different beast altogether. I was lead astray by the Hadoop site listing Cassandra as one of the Hadoop-related projects, which erroneously made me assume that they were, err, related instead of simply acquainted. Oh well. It’s not like I’m not used to having wiggling toes tickling my uvula, and I wanted to play with everything dealing with the Hadoop ecosystem anyway, so it’s all good.
One of the yummy things about Hadoop is the applications already built on top of it. For example, there is Cassandra, a scalable multi-master database with no single points of failure. (Translation: Even after a thermonuclear war, your data will still probably linger around to, no doubt, the everlasting glee of the lone surviving mutant cockroaches.)
There are a few Cassandra modules on CPAN. But to start easy, I went with Cassandra::Lite, which makes the basics very easy indeed:
[perl]#!/usr/bin/env perluse strict;
use warnings;
use Cassandra::Lite;
my $c = Cassandra::Lite->new(keyspace => ‘galuga’);
$c->put( ‘blog_entries’ => ‘hadoop’ => {
title => ‘First Foray in Hadoop territory’,
body => ‘…’,
} );
# .. then, later on
my $entry = $c->get(‘blog_entries’ => ‘hadoop’ );
[/perl]Now, if you remember my post of a few weeks back where I used the SQLite as a NoSQL database, you might realize that to alter that code to use Cassandra instead of SQLite should be fairly easy. With just an itsy-bitsy little bit of work, I could web-scale my blog writing.
Which, okay, might be a tad overkill.
But still. I could.
But first, I have other applications to try…
To be continued
Discover more about our expertise in Hadoop
2 Comments. Leave new
Perhaps I misunderstood what you meant but… Cassandra isn’t actually built on top of Hadoop. There is some integration between two products but they are completely different beasts.
Hadoop is for massive parallel processing crunching huge amounts of data in batch mode while Cassandra is for highly scalable transactional workloads and real time data analytics. I’m a bit oversimplifying but the essence is this.
Ooops. My bad. Seeing Cassandra on the main Hadoop page under the category “Other Hadoop-related projects”, I made the wrong assumption. I stand corrected.