Fun with Hadoop

Posted in: Big Data, Hadoop, Site Reliability Engineering

I previously deployed openstack private cloud at home.

The OpenStack series can be found on the blog at https://blog.pythian.com/author/bhagat/

I have been playing with big data and Hadoop for some time now, and decided to make use of my cloud infra. So I deployed 7 node Hadoop cluster and had some fun stuff with it.

The first problem I faced however was computing power.

I had only one compute node and I was short on computing power to run the cluster. To overcome this I added two additional compute nodes in the cloud infra. But that is for another series.

I want to talk about Hadoop so let’s get started with it.

I decided to make use of Hortonworks Hadoop with Ambari server as the documentation and resources available to me were mostly for Hortonworks Hadoop.

I first played with it on AWS’s free tier but quickly realized that it might be a more expensive than to just playing and learn Hadoop.

So that made me decide to deploy my own Hadoop cluster at my own place.

 

I am not going to explain how to install and configure Hadoop as there are plenty of guides available for the same on internet.

Instead I am going to tell you what I did with my Hadoop setup.

The first thing I did after creating Hadoop cluster is start playing with hdfs.

I decided to do MapReduce and word count examples as its included in Hadoop examples.

During the time I got engaged in a server issue, where I had to analyze the log files to understand what is happening on the server.

After a while I decided to crunch it, and out of curiosity uploaded the file to Hadoop cluster to perform mapreduce job with various tools. I uploaded the log file into hive database for further analysis as well.

Lets say the result was really helpful to narrow down the issue and troubleshoot it.

I am going to display the same with a log file uploaded from my personal systems and how I used it.

I used /var/log/messages file from my test system and uploaded it to hdfs.

I used the same example jar provided with hadoop for word count using below command.

yarn jar /usr/hdp/2.3.6.0-3796/hadoop-mapreduce/hadoop-mapreduce-examples.jar wordcount /user/centos/messages /user/centos/msg_analysis

I got the results but it was just key and value crunched via maper. It was helpful but still I was not satisfied with the result and decided to use the hive for further analysis.

I created hive table first using below command on hive shell.

hive> create table msgs (month STRING, dt INT, tm STRING, host STRING, command STRING, messages STRING)
 > ROW FORMAT DELIMITED
 > FIELDS TERMINATED BY ' ';
OK
Time taken: 1.151 seconds
hive>

Now that I had the table i loaded the data into the empty table using below command.
hive> load data inpath '/user/centos/messages' into table msgs;
Loading data to table default.msgs
Table default.msgs stats: [numFiles=1, totalSize=588264]
OK
Time taken: 1.519 seconds
hive>

Now that I had data where I wanted for analysis, I started to gather data by running below queries to get insight of services/command and pattern in the logs.

I identified the different services and commands which generated logs using below query.

hive> select distinct command from msgs;
Query ID = centos_20160823100556_ffe796f6-5548-4bba-9c12-467315aca7dd
Total jobs = 1
Launching Job 1 out of 1


Status: Running (Executing on YARN cluster with App id application_1471922082248_0003)

--------------------------------------------------------------------------------
 VERTICES STATUS TOTAL COMPLETED RUNNING PENDING FAILED KILLED
--------------------------------------------------------------------------------
Map 1 .......... SUCCEEDED 1 1 0 0 0 0
Reducer 2 ...... SUCCEEDED 1 1 0 0 0 0
--------------------------------------------------------------------------------
VERTICES: 02/02 [==========================>>] 100% ELAPSED TIME: 6.99 s
--------------------------------------------------------------------------------
OK
NetworkManager[2309]:
NetworkManager[758]:
NetworkManager[776]:
NetworkManager[789]:
NetworkManager[790]:
ambari-agent:
audispd:
auditd[727]:
auditd[728]:
auditd[742]:
auditd[743]:
augenrules:
chronyd[762]:
chronyd[765]:
chronyd[791]:
chronyd[798]:
dbus-daemon:
dbus[760]:
dbus[763]:
dbus[777]:
dbus[780]:
dhclient[800]:
dhclient[816]:
dhclient[837]:
dracut:
jexec:
journal:
kdumpctl:
kernel:
lvm:
network:
nm-dispatcher:
polkitd[799]:
polkitd[801]:
polkitd[815]:
polkitd[839]:
rsyslogd:
sm-notify[1000]:
sm-notify[1002]:
sm-notify[1025]:
sm-notify[1046]:
snmpd[1021]:
snmpd[998]:
sshd-keygen:
su:
systemd-fsck:
systemd-journald[230]:
systemd-journald[233]:
systemd-journald[235]:
systemd-logind:
systemd-tmpfiles:
systemd-udevd:
systemd:
systemd[1]:
yum[10423]:
yum[11321]:
yum[2714]:
yum[3407]:
yum[8774]:
Time taken: 11.742 seconds, Fetched: 59 row(s)
hive>

After that I performed count analysis for each commands/services to retrieve logs count as below.

Once I gathered the data for each services, I used google docs to create charts using the data I obtained from analysis.

hive services chart

Service contribution in log generation

As we can see from the chart the major contributors for logging are kernel and systemd.

I also analyzed the dates and created below chart.

date chart

date contribution in log generation

The max no of logs were generated on 17th. while on 18th,20th and 21st no logs were created because the system was not powered on.

There are many more things we can do with hive and Hadoop but that is for another time.

Find out more about how Pythian can help your organization optimize and maintain Hadoop.

email
Want to talk with an expert? Schedule a call with our team to get the conversation started.

No comments

Leave a Reply

Your email address will not be published. Required fields are marked *