Today we’re having a quick one.
Earlier during the day, I had to peruse an Oozie log for the first time. And it looked like:
2014-02-11 20:13:14,211 INFO ActionStartXCommand:539 - USER[running_user] GROUP[-] TOKEN[] APP[some-big-job-workflow] JOB[0004636-140111040403753-oozie-W] ACTION[[email protected]:start:] Start action [[email protected]:start:] with user-retry state : userRetryCount [0], userRetryMax [0], userRetryInterval [10]
2014-02-11 20:13:14,212 WARN ActionStartXCommand:542 - USER[running_user] GROUP[-] TOKEN[] APP[some-big-job-workflow] JOB[0004636-140111040403753-oozie-W] ACTION[[email protected]:start:] [***[email protected]:start:***]Action status=DONE
2014-02-11 20:13:14,212 WARN ActionStartXCommand:542 - USER[running_user] GROUP[-] TOKEN[] APP[some-big-job-workflow] JOB[0004636-140111040403753-oozie-W] ACTION[[email protected]:start:] [***[email protected]:start:***]Action updated in DB!
2014-02-11 20:13:14,271 INFO ActionStartXCommand:539 - USER[running_user] GROUP[-] TOKEN[] APP[some-big-job-workflow] JOB[0004636-140111040403753-oozie-W] ACTION[[email protected]] Start action [[email protected]] with user-retry state : userRetryCount [0], userRetryMax [0], userRetryInterval [10]
2014-02-11 20:13:15,079 WARN HiveActionExecutor:542 - USER[running_user] GROUP[-] TOKEN[] APP[some-big-job-workflow] JOB[0004636-140111040403753-oozie-W] ACTION[[email protected]] credentials is null for the action
2014-02-11 20:13:18,306 INFO HiveActionExecutor:539 - USER[running_user] GROUP[-] TOKEN[] APP[some-big-job-workflow] JOB[0004636-140111040403753-oozie-W] ACTION[0004636-140111040403[email protected]] checking action, external ID [job_201401070500_217582] status [RUNNING]
2014-02-11 20:13:18,408 WARN ActionStartXCommand:542 - USER[running_user] GROUP[-] TOKEN[] APP[some-big-job-workflow] JOB[0004636-140111040403753-oozie-W] ACTION[[email protected]] [***[email protected]***]Action status=RUNNING
2014-02-11 20:13:18,409 WARN ActionStartXCommand:542 - USER[running_user] GROUP[-] TOKEN[] APP[some-big-job-workflow] JOB[0004636-140111040403753-oozie-W] ACTION[[email protected]] [***[email protected]***]Action updated in DB!
2014-02-11 20:13:34,367 INFO CallbackServlet:539 - USER[-] GROUP[-] TOKEN[-] APP[-] JOB[0004636-140111040403753-oozie-W] ACTION[[email protected]] callback for action [[email protected]]
2014-02-11 20:13:34,424 INFO HiveActionExecutor:539 - USER[running_user] GROUP[-] TOKEN[] APP[some-big-job-workflow] JOB[0004636-140111040403753-oozie-W] ACTION[[email protected]] action completed, external ID [job_201401070500_217582]
2014-02-11 20:13:34,443 INFO HiveActionExecutor:539 - USER[running_user] GROUP[-] TOKEN[] APP[some-big-job-workflow] JOB[0004636-140111040403753-oozie-W] ACTION[[email protected]] action produced output
2014-02-11 20:13:34,653 INFO ActionStartXCommand:539 - USER[running_user] GROUP[-] TOKEN[] APP[some-big-job-workflow] JOB[0004636-140111040403753-oozie-W] ACTION[[email protected]] Start action [[email protected]] with user-retry state : userRetryCount [0], userRetryMax [0], userRetryInterval [10]
2014-02-11 20:13:35,418 WARN HiveActionExecutor:542 - USER[running_user] GROUP[-] TOKEN[] APP[some-big-job-workflow] JOB[0004636-140111040403753-oozie-W] ACTION[[email protected]] credentials is null for the action
2014-02-11 20:13:38,628 INFO HiveActionExecutor:539 - USER[running_user] GROUP[-] TOKEN[] APP[some-big-job-workflow] JOB[0004636-140111040403753-oozie-W] ACTION[[email protected]] checking action, external ID [job_201401070500_217583] status [RUNNING]
2014-02-11 20:13:38,731 WARN ActionStartXCommand:542 - USER[running_user] GROUP[-] TOKEN[] APP[some-big-job-workflow] JOB[0004636-140111040403753-oozie-W] ACTION[[email protected]] [***[email protected]***]Action status=RUNNING
2014-02-11 20:13:38,731 WARN ActionStartXCommand:542 - USER[running_user] GROUP[-] TOKEN[] APP[some-big-job-workflow] JOB[0004636-140111040403753-oozie-W] ACTION[[email protected]] [***[email protected]***]Action updated in DB!
2014-02-11 20:13:57,659 INFO CallbackServlet:539 - USER[-] GROUP[-] TOKEN[-] APP[-] JOB[0004636-140111040403753-oozie-W] ACTION[[email protected]] callback for action [[email protected]]
2014-02-11 20:13:57,712 INFO HiveActionExecutor:539 - USER[running_user] GROUP[-] TOKEN[] APP[some-big-job-workflow] JOB[0004636-140111040403753-oozie-W] ACTION[[email protected]] action completed, external ID [job_201401070500_217583]
2014-02-11 20:13:57,729 WARN HiveActionExecutor:542 - USER[running_user] GROUP[-] TOKEN[] APP[some-big-job-workflow] JOB[0004636-140111040403753-oozie-W] ACTION[[email protected]] Launcher ERROR, reason: Main class [org.apache.oozie.action.hadoop.HiveMain], exit code [10044]
2014-02-11 20:13:57,895 INFO ActionEndXCommand:539 - USER[running_user] GROUP[-] TOKEN[] APP[some-big-job-workflow] JOB[0004636-140111040403753-oozie-W] ACTION[[email protected]] ERROR is considered as FAILED for SLA
2014-02-11 20:13:57,964 INFO ActionStartXCommand:539 - USER[running_user] GROUP[-] TOKEN[] APP[some-big-job-workflow] JOB[0004636-140111040403753-oozie-W] ACTION[0004636-140111[email protected]] Start action [[email protected]] with user-retry state : userRetryCount [0], userRetryMax [0], userRetryInterval [10]
2014-02-11 20:13:57,965 WARN ActionStartXCommand:542 - USER[running_user] GROUP[-] TOKEN[] APP[some-big-job-workflow] JOB[0004636-140111040403753-oozie-W] ACTION[[email protected]] [***[email protected]***]Action status=DONE
2014-02-11 20:13:57,965 WARN ActionStartXCommand:542 - USER[running_user] GROUP[-] TOKEN[] APP[some-big-job-workflow] JOB[0004636-140111040403753-oozie-W] ACTION[[email protected]] [***[email protected]***]Action updated in DB!
2014-02-11 20:13:58,036 WARN CoordActionUpdateXCommand:542 - USER[running_user] GROUP[-] TOKEN[] APP[some-big-job-workflow] JOB[0004636-140111040403753-oozie-W] ACTION[-] E1100: Command precondition does not hold before execution, [, coord action is null], Error Code: E1100
Finished: FAILURE
wut?
Okay, Java peeps have a predilection for verbose logs, but surely there is a way to make the whole thing a little more readable for poor, poor human eyes… So I quickly hacked the following:
use 5.10.0;
use strict;
my @lines = map { parse_line($_) } <>;
# the job id is big, and doesn't give us much, remove
for my $useless ( map { $_->{JOB} } @lines ) {
for ( @lines ) {
$_->{msg} =~ s/\Q$useless//g;
$_->{ACTION} =~ s/^\Q$useless//;
}
}
my %previous;
for my $l ( @lines ) {
# we'll only print metadata that changed
my @changes = grep { $l->{$_} ne $previous{$_} }
qw/ USER GROUP TOKEN APP JOB ACTION /;
say join ' ', map { $_ . "[" . $l->{$_} . "] " } @changes if @changes;
say "\t", $l->{time}, " ", $l->{msg};
%previous = %$l;
}
sub parse_line {
my $line = shift;
# try to parse the line as a typical log line
my( $time, $info ) = $line =~
/^\d{4}-\d{2}-\d{2}\s* # the date. Don't care
(\d+:\d\d:\d\d) # the time, More interesting
,\d+\s*.*?- # log level and stuff. Meh
(.*) # the message itself
/x
or return ();
my %data = ( time => $time );
# capture some repeated metadata
for my $k ( qw/ USER GROUP TOKEN APP JOB ACTION / ) {
$data{$k} = $1 if $info =~ s/$k\[(.*?)\]\s*//;
}
# useless and long, scrap it
$info =~ s/\[\*{3}.*?\*{3}\]//;
$data{msg} = $info;
return \%data;
}
And there we go, a log trace that is a mite easier on the eyes…
$ perl filter oozie_mess.log
USER[running_user] GROUP[-] APP[some-big-job-workflow] JOB[0004636-140111040403753-oozie-W] ACTION[@:start:]
20:13:14 Start action [@:start:] with user-retry state : userRetryCount [0], userRetryMax [0], userRetryInterval [10]
20:13:14 Action status=DONE
20:13:14 Action updated in DB!
ACTION[@a-first-action]
20:13:14 Start action [@a-first-action] with user-retry state : userRetryCount [0], userRetryMax [0], userRetryInterval [10]
20:13:15 credentials is null for the action
20:13:18 checking action, external ID [job_201401070500_217582] status [RUNNING]
20:13:18 Action status=RUNNING
20:13:18 Action updated in DB!
USER[-] TOKEN[-] APP[-]
20:13:34 callback for action [@a-first-action]
USER[running_user] TOKEN[] APP[some-big-job-workflow]
20:13:34 action completed, external ID [job_201401070500_217582]
20:13:34 action produced output
ACTION[@some-action]
20:13:34 Start action [@some-action] with user-retry state : userRetryCount [0], userRetryMax [0], userRetryInterval [10]
20:13:35 credentials is null for the action
20:13:38 checking action, external ID [job_201401070500_217583] status [RUNNING]
20:13:38 Action status=RUNNING
20:13:38 Action updated in DB!
USER[-] TOKEN[-] APP[-]
20:13:57 callback for action [@some-action]
USER[running_user] TOKEN[] APP[some-big-job-workflow]
20:13:57 action completed, external ID [job_201401070500_217583]
20:13:57 Launcher ERROR, reason: Main class [org.apache.oozie.action.hadoop.HiveMain], exit code [10044]
20:13:57 ERROR is considered as FAILED for SLA
ACTION[@fail]
20:13:57 Start action [@fail] with user-retry state : userRetryCount [0], userRetryMax [0], userRetryInterval [10]
20:13:57 Action status=DONE
20:13:57 Action updated in DB!
ACTION[-]
20:13:58 E1100: Command precondition does not hold before execution, [, coord action is null], Error Code: E1100
4 Comments. Leave new
None of the “wtf” is Java’s fault. This is how Oozie is logging.
Please open a Jira ticket with how you’d like things to change so we can fix it! A patch would be even nicer :)
[Not Java’s fault] Maybe not technically, but I think it’s hard to argue the fact that the Java sphere and its logging underworld are inclined to, ahem, let’s call it prolixity. ;-)
Also, I have to be fair: I just looked at that specific log. I’m pretty sure there are options and knobs one can adjust to make the logs lean more toward human consumption. … right?
[patches are welcome] … you realize how dangerous it is to say something like this to a guy like me, right? 8-D
Looks great – but when I ran it against your example oozie.log, it output nothing.
Did a bit of debugging and found that this snippet:
my( $time, $info ) = /^d{4}-d{2}-d{2}s* # the date. Don’t care
(d+:dd:dd) # the time, More interesting
,d+s*.*?- # log level and stuff. Meh
(.*) # the message itself
/x
or return ();
was always doing the “return ();”
Any advice ?
Thanks
Andrew
Gah! I probably went overboard when I cleaned the code for the blog entry. That line you spotted is indeed the culprit, and it’s missing only one tiny thing: the variable on which we do the regex.:
my( $time, $info ) = $line =~ /^d{4}-d{2}-d{2}s*
(d+:dd:dd)
,d+s*.*?-
(.*)
/x
or return ();
That should already work muuuch better.
I’m currently at YAPC (the Perl conference), so my keyboard time is limited, but I’ll try to fix the example in the blog entry as soon as I’m back.
And mucho thanks for the bug report! :-)