Configure high availability – load balancing for Hiveserver2

Posted in: Big Data, Hadoop, Technical Track
We are noticing a steady adoption of Hive adoption among our customers. Due to the increased workload we want to make sure that performance and availability of Hive is not compromised. The following steps will ensure the smooth functioning of Hive under increased workloads.

Configuring High Availability for Hive requires the following components to be fail proof:

1. Hive Metastore underlying RDBMS
2. Zookeeper
3. Hive Metastore Server
4. Hiveserver2

For the sake of simplicity this blog will focus on enabling HA for the Hive Metastore Server and HiveServer2. We recommend that the underlying Hive Metastore underlying RDBMS be configured for High Availability and we have configured multiple Zookeeper instances on the current cluster.

Enabling High Availability for Hive Metastore Server
1. Log on to Cloudera Manager
2. Click on HIVE > Hive Metastore Server. Locate the host for the Hive Metastore Server.
3. SSH to Hive Metastore Server.
# vi /etc/hive/conf.cloudera.hive/hive-site.xml
Expected Output below.
4. On the Cloudera Manager Console click Hive > Configuration

Select Scope > Hive Metastore Server.
Select Category > Advanced.
Locate the Hive Metastore Delegation Token Store property.
Choose org.apache.hadoop.hive.thrift.DBTokenStore
Click Save Changes.

5. On the Cloudera Manager Console click Hive > Instances. Click on Add Role Instances.

Click on Select Hosts for Hive Metastore Server.

6. Choose multiple Hosts (at least 2 more to make a total of 3) to configure Hive Metastore Server on.

Click OK and Continue.

7. Click Finish. You should now see new hosts added as the Hive Metastore Server.
Click on Restart the service (or the instance) for the changes to take effect.
8. Notice that hive.metastore.uris now has multiple instances of Hive Metastore Server.

Click on Restart Stale Service.

9. Click Restart Now.
10. Review Restart Messages.
11. Notice that you now have multiple instances of Hive Metastore Server.
12. SSH again to Hive Metastore Server.
# vi /etc/hive/conf.cloudera.hive/hive-site.xml
Expected Output below. Note that new instances have been added.
So how do you know the settings are working? The following is the recommended plan for testing the High Availability of Hive MetaStore.
1. SSH to any DataNode. Connect to Hiveserver2 using Beeline.

# beeline -u “jdbc:hive2://ip-10-7-176-204.ec2.internal:10000”

2. On the Cloudera Manager Console click Hive > Hive MetaStore Server.
Stop the first Hive MetaStore Server in the list.
Issue “Show databases” command in the beeline shell of step 1. The command should work normally.
3. Stop the second Hive Metastore Server in the list. Issue Show databases command in the beeline shell of step 1.
The command should still work normally.
4. Stop the third Hive Metastore Server in the list. Issue Show databases command in the beeline shell of step 1.
This command should fail which is normal.
Expected Output from beeline below.
5. Now start a random Hive Metastore Server in the list. Issue Show databases command in the beeline shell of step 1.
This command should start working normally again.
6. After testing it completed make sure you start all Hive Metastore Servers in the list.
Enabling Load Balancing and High Availability for Hiveserver2
To provide high availability and load balancing for HiveServer2, Hive provides a function called dynamic service discovery where multiple HiveServer2 instances can register themselves with Zookeeper. Instead of connecting to a specific HiveServer2 directly, clients connect to Zookeeper which returns a randomly selected registered HiveServer2 instance.
1. Log on to Cloudera Manager.Click Hive > Instances. Click on Add Role Instances.

Click on Select Hosts for HiveServer2.

2. Choose multiple Hosts (at least 2 more to make a total of 3) to configure HiveServer2 on.

Click OK and Continue.

3. You should now see new hosts added as HiveServer2.

Choose the newly added instances and Choose Start.

4. Click on Close. The newly added HiveServer2 instances are now ready for use.
5. Open Hive -> Configuration -> Category -> Advanced.
Find “HiveServer2 Advanced Configuration Snippet (Safety Valve) for hive-site.xml”
Add a new property as below:

Name: hive.server2.support.dynamic.service.discovery
Value: true

6. Go to the Cloudera Manager Home Page and Restart Hive Service.
7. You should now have multiple instances of HiveServer2.
So how do you know the settings are working? Following is the recommended plan for testing the Load Balancing for Hiveserver2.
1. As mentioned before HiveServer2 High Availability is managed through Zookeeper.

The clients connecting to HiveServer2 now go through Zookeeper. An example, JDBC connect string is as follows. Notice that the JDBC now points to a list of nodes that have Zookeeper on them.

beeline -u “jdbc:hive2://ip-10-7-176-204.ec2.internal:2181,ip-10-229-16-131.ec2.internal:2181,ip-10-179-159-209.ec2.internal:2181/;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=hiveserver2”

2. SSH to any data node. Connect to Hiveserver2 using Beeline.
# beeline -u “jdbc:hive2://ip-10-7-176-204.ec2.internal:2181,ip-10-229-16-131.ec2.internal:2181,ip-10-179-159-209.ec2.internal:2181/;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=hiveserver2”
3. The connection gets routed to the HiveServer2 instances in a round robin fashion.

Issue the following command on the HiveServer2 nodes.

# tail -f /var/log/hive/hadoop-cmf-hive-HIVESERVER2-ip-10-7-176-204.ec2.internal.log.out

Issue the following command on the HiveServer2 nodes.

4. You may issue the beeline command from multiple sources and monitor the HiveServer2 logs.
So how do you know the settings are working? Following is the recommended plan for testing the High Availability for Hiveserver2.
1. On the Cloudera Manager Console click Hive > HiveServer2.
Stop the first HiveServer2 in the list

Connection to Beeline using command below should work normally.

# beeline -u “jdbc:hive2://ip-10-7-176-204.ec2.internal:2181,ip-10-229-16-131.ec2.internal:2181,ip-10-179-159-209.ec2.internal:2181/;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=hiveserver2”
2. Stop the second HiveServer2 in the list

Connection to Beeline using command below should still work normally.

3. Stop the third HiveServer2 in the list.

Connection to Beeline using command below should fail.

4. Start the third HiveServer2 in the list.

Connection to Beeline using command below should work normally again.

5. After the testing completes make sure you start all HiveServer2 in the list.

 

Discover more about our expertise in Big Data and Hadoop.

email

Author

Interested in working with Manoj? Schedule a tech call.

8 Comments. Leave new

Have you tried this with MapR

Reply
Manoj Kukreja
March 17, 2016 1:11 pm

I have not tried this with MapR but it should work the same way.

Reply

I liked your post and the wayt it is organized . Thanks!
my question
beeline works fine when one of the 2 hiveserver2 is down.
How do you get this working in HUE. Hue is always pointing to the default hiverserver2 in case if that service is down it doesn’t switch to the second hiveserver2 and fails the hive queries for connection error

Reply

Doesn’t this still leave the actual datastore as a single point of failure? Whether it be Oracle, MySQL, Derby, etc ? Have you tried to use sqlproxy on each metastore server pointing to a Galera based cluster?

Reply

The idea for this blog to present the HA solution for HiveServer2 only. For the underlying metastore we either use a simple Master Slave Replication or Galera/PXDB type solution.

Reply
arullaldivakar
November 16, 2016 3:40 am

How to configure the new hiveserver connection in Hue. It looks like after enabling HA, Hive is not connecting through Hue.

Reply

This article is awesome.. saved lots of my efforts and i was able to fix the issue

Reply

Can we use nginx to load balance metstore.

Reply

Leave a Reply

Your email address will not be published. Required fields are marked *