Recently, I’ve been playing with the workflow managers of the Hadoop world. Namely, Azkaban and Oozie.
While Azkaban offers a cute graph-oriented display of your running workflows, it is a little bit limited in the workflow logic department. No conditional branching? No error state? Meh. Lame.
Oozie, on the other hand, has more logic horsepower, but it comes with a certain complexity tax. And the graphical view provided by Hue is not as visual as the one we have in Azkaban.
But while Oozie doesn’t come with the shiny, it does come with a REST API. So, potentially, we have the technology… How hard could it be to create a visual interface to its workflow, the way we like’em? Well. Let’s see.
Step 1: Get That Workflow
First thing we need to do is to get to the workflow.xml
master file (which we’ll assume is already on hdfs), and generate our graph edges off it. As I already tackled the munging of the workflow file in a previous entry, I can gleefully steal the transformation logic from there. All that remains, really, is the fetching of the file from hdfs.
use Net::Hadoop::WebHDFS;
use Path::Tiny;
use Web::Query;
use Data::Printer;
my $hadoop_host = ‘192.168.0.203’;
my $workspace_root = ‘/user/hue/oozie/workspaces/managed’;
p workflow_to_graph( get_graph_from_hdfs( shift ) );
sub get_graph_from_hdfs {
my $workflow = shift;
return Net::Hadoop::WebHDFS->new( ‘host’ => $hadoop_host )
->read( path( $workspace_root, $workflow, ‘workflow.xml’ ) )
|| die "’workflow.xml’ not found";
}
sub workflow_to_graph {
my $q = Web::Query->new_from_html( shift );
my %graph;
$q->find( ‘start’ )->each(sub{
push @{ $graph{START} }, $_[1]->attr(‘to’);
});
$q->find( ‘end’ )->each(sub{
$graph{$_[1]->attr(‘name’)} = [];
});
$q->find(‘action’)->each(sub{
for my $next (qw/ ok error /) {
my $next_node = $_[1]->find($next)->attr(‘to’) or next;
push @{ $graph{$_[1]->attr(‘name’)} }, $next_node;
}
});
$q->find(‘fork’)->each(sub{
my $name = $_[1]->attr(‘name’);
$_[1]->find(‘path’)->each(sub{
push @{$graph{$name}}, $_[1]->attr(‘start’);
});
});
$q->find(‘join’)->each(sub{
push @{$graph{$_[1]->attr(‘name’)}}, $_[1]->attr(‘to’);
});
$q->find(‘decision’)->each(sub{
my $name = $_[1]->attr(‘name’);
$_[1]->find(‘case,default’)->each(sub{
my $next_node = $_[1]->attr(‘to’) or next;
push @{ $graph{$_[1]->attr(‘name’)} }, $next_node;
});
});
# just make sure all nodes are present as keys
$graph{$_} ||= [] for map { @$_ } values %graph;
return \%graph;
}
[/perl]
And with that we get
[bash]$ perl get_graph.pl sleepfork{
end [],
fork-34 [
[0] "Sleep-1",
[1] "Sleep-5"
],
fork-38 [
[0] "Sleep-3",
[1] "Sleep-4"
],
join-35 [
[0] "end"
],
join-39 [
[0] "join-35"
],
kill [],
Sleep-1 [
[0] "Sleep-10",
[1] "kill"
],
Sleep-3 [
[0] "join-39",
[1] "kill"
],
Sleep-4 [
[0] "join-39",
[1] "kill"
],
Sleep-5 [
[0] "fork-38",
[1] "kill"
],
Sleep-10 [
[0] "join-35",
[1] "kill"
],
START [
[0] "fork-34"
] }
[/bash]
So far, so good.
Step 2: From Data Structure To The Graph
Now the fun stuff: turning the raw data structure into a purty graph.
For this, I decided to leverage the dagre-d3 javascript library, which has a sane API and produces nice-looking graphs. Since we already have the data structure at hand, all we have to do is to create a CSS stylesheet, drop a placeholder in our HTML page (both not shown here, because very boring — see the final GitHub repo below for the full monty), and generate our graph.
var nodes = new Object();
var g = new dagreD3.Digraph();
for ( source in graph ) {
if ( nodes[source] == null ) {
g.addNode( source, { label: source });
}
nodes[source] = 1;
for ( var i = 0; i < graph[source].length; i++ ) {
var dest = graph[source][i];
if ( nodes[dest] == null ) {
g.addNode( dest, { label: dest });
nodes[dest] = 1;
}
g.addEdge( null, source, dest );
}
}
var renderer = new dagreD3.Renderer();
// give an ‘id’ to all nodes
var oldDrawNode = renderer.drawNode();
renderer.drawNode(function(graph, u, svg) {
oldDrawNode(graph, u, svg);
svg.attr("id", "node-" + u);
});
renderer.run(g, d3.select("svg g"));
});
[/javascript]And with that, we can see!
Step 3: Launch And Monitor
So we have a static view of a workflow. Let’s give it life. First, we need to launch the job. There is currently no Hadoop::Oozie::REST
-like module on CPAN (a terrible hole I intend to fill at some point), but that’s okay, REST::Client will do in a pinch:
my $client = REST::Client->new;
my $host = config->{hadoop_host};
my $path = config->{workspace_root} . ‘/’ . $workflow . ‘/’;
$client->setHost( ‘https://’. $host . ‘:11000’ );
$client->addHeader( ‘Content-Type’ => ‘application/xml;charset=UTF-8’ );
$client->POST( ‘/oozie/v1/jobs?action=start’, <<"END"
<?xml version="1.0" encoding="UTF-8"?>
<configuration>
<property>
<name>user.name
<value>hue
</property>
<property>
<name>oozie.wf.application.path
<value>hdfs://$path
</property>
</configuration>
END
);
print $client->responseContent;
[/perl]
Monitoring it isn’t going to be much harder. All we need to do is to query the mothership for updates,
[perl] my $client = REST::Client->new;$client->setHost( ‘https://’.config->{hadoop_host} . ‘:11000’ );
$client->GET( ‘/oozie/v1/job/’ . param(‘id’) . ‘?show=info’ );
print $client->responseContent;
[/perl]
and then use those updates to refresh the graph with colors that illustrate the different states of the nodes,
[javascript]var state_color = {"OK": "green",
"PREP": "blue",
"FAILED": "red",
"KILLED": "pink",
"DONE": "lightgreen"
};
function update() {
$.get( ‘/job/’ + job_id ).done(function(data){
data = JSON.parse(data);
for ( var i = 0; i < data.actions.length; i++ ) {
var action = data.actions[i];
console.log( action["name"] + " : " + action["status"] );
if ( action["name"] == ‘:start:’ ) {
action["name"] = ‘START’;
}
$(‘#node-‘ + action["name"])
.attr(‘fill’, state_color[action["status"]]);
}
// update every 2 seconds
setTimeout( ‘update()’, 2000 );
});
}
[/javascript]
And that’s pretty much it. We can now pull all those parts into a small Dancer application, and we have a very minimal workflow launcher and visualizer:
Discover more about our expertise in Hadoop.
3 Comments. Leave new
Nice… but you do realize that Hue already supports workflow visualization?
I do realize that. :-) But:
A. The visualization is a bit meh.
B. I wanted an excuse to play with the REST service and the graph library. :-)
Also, Hue is not supported in MapR hadoop so this may be the best option for visualization or the only one!