Configure high availability - load balancing for Hiveserver2

Manoj Kukreja

January 12, 2016

We are noticing a steady adoption of Hive adoption among our customers. Due to the increased workload we want to make sure that performance and availability of Hive is not compromised. The following steps will ensure the smooth functioning of Hive under increased workloads. Configuring High Availability for Hive requires the following components to be fail proof:

1. Hive Metastore underlying RDBMS 2. Zookeeper 3. Hive Metastore Server 4. Hiveserver2

For the sake of simplicity this blog will focus on enabling HA for the Hive Metastore Server and HiveServer2. We recommend that the underlying Hive Metastore underlying RDBMS be configured for High Availability and we have configured multiple Zookeeper instances on the current cluster.

Enabling High Availability for Hive Metastore Server

1. Log on to Cloudera Manager

2. Click on HIVE > Hive Metastore Server. Locate the host for the Hive Metastore Server.

3. SSH to Hive Metastore Server.

# vi /etc/hive/conf.cloudera.hive/hive-site.xml

Expected Output below.

4. On the Cloudera Manager Console click Hive > Configuration

Select Scope > Hive Metastore Server. Select Category > Advanced. Locate the Hive Metastore Delegation Token Store property. Choose org.apache.hadoop.hive.thrift.DBTokenStore Click Save Changes.

5. On the Cloudera Manager Console click Hive > Instances. Click on Add Role Instances.

Click on Select Hosts for Hive Metastore Server.

6. Choose multiple Hosts (at least 2 more to make a total of 3) to configure Hive Metastore Server on.

Click OK and Continue.

7. Click Finish. You should now see new hosts added as the Hive Metastore Server.

Click on Restart the service (or the instance) for the changes to take effect.

8. Notice that hive.metastore.uris now has multiple instances of Hive Metastore Server.

Click on Restart Stale Service.

9. Click Restart Now.

10. Review Restart Messages.

11. Notice that you now have multiple instances of Hive Metastore Server.

12. SSH again to Hive Metastore Server.

# vi /etc/hive/conf.cloudera.hive/hive-site.xml

Expected Output below. Note that new instances have been added.

So how do you know the settings are working? The following is the recommended plan for testing the High Availability of Hive MetaStore.

1. SSH to any DataNode. Connect to Hiveserver2 using Beeline.

# beeline -u "jdbc:hive2://ip-10-7-176-204.ec2.internal:10000"

2. On the Cloudera Manager Console click Hive > Hive MetaStore Server.

Stop the first Hive MetaStore Server in the list.

Issue "Show databases" command in the beeline shell of step 1. The command should work normally.

3. Stop the second Hive Metastore Server in the list. Issue Show databases command in the beeline shell of step 1.

The command should still work normally.

4. Stop the third Hive Metastore Server in the list. Issue Show databases command in the beeline shell of step 1.

This command should fail which is normal.

Expected Output from beeline below.

5. Now start a random Hive Metastore Server in the list. Issue Show databases command in the beeline shell of step 1.

This command should start working normally again.

6. After testing it completed make sure you start all Hive Metastore Servers in the list.

Enabling Load Balancing and High Availability for Hiveserver2

To provide high availability and load balancing for HiveServer2, Hive provides a function called dynamic service discovery where multiple HiveServer2 instances can register themselves with Zookeeper. Instead of connecting to a specific HiveServer2 directly, clients connect to Zookeeper which returns a randomly selected registered HiveServer2 instance.

1. Log on to Cloudera Manager.Click Hive > Instances. Click on Add Role Instances.

Click on Select Hosts for HiveServer2.

2. Choose multiple Hosts (at least 2 more to make a total of 3) to configure HiveServer2 on.

Click OK and Continue.

3. You should now see new hosts added as HiveServer2.

Choose the newly added instances and Choose Start.

4. Click on Close. The newly added HiveServer2 instances are now ready for use.

5. Open Hive -> Configuration -> Category -> Advanced.

Find "HiveServer2 Advanced Configuration Snippet (Safety Valve) for hive-site.xml" Add a new property as below:

Name: hive.server2.support.dynamic.service.discovery Value: true

6. Go to the Cloudera Manager Home Page and Restart Hive Service.

7. You should now have multiple instances of HiveServer2.

So how do you know the settings are working? Following is the recommended plan for testing the Load Balancing for Hiveserver2.

1. As mentioned before HiveServer2 High Availability is managed through Zookeeper.

The clients connecting to HiveServer2 now go through Zookeeper. An example, JDBC connect string is as follows. Notice that the JDBC now points to a list of nodes that have Zookeeper on them.

beeline -u "jdbc:hive2://ip-10-7-176-204.ec2.internal:2181,ip-10-229-16-131.ec2.internal:2181,ip-10-179-159-209.ec2.internal:2181/;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=hiveserver2"

2. SSH to any data node. Connect to Hiveserver2 using Beeline.

# beeline -u "jdbc:hive2://ip-10-7-176-204.ec2.internal:2181,ip-10-229-16-131.ec2.internal:2181,ip-10-179-159-209.ec2.internal:2181/;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=hiveserver2"

3. The connection gets routed to the HiveServer2 instances in a round robin fashion.

Issue the following command on the HiveServer2 nodes.

# tail -f /var/log/hive/hadoop-cmf-hive-HIVESERVER2-ip-10-7-176-204.ec2.internal.log.out

Issue the following command on the HiveServer2 nodes.

4. You may issue the beeline command from multiple sources and monitor the HiveServer2 logs.

So how do you know the settings are working? Following is the recommended plan for testing the High Availability for Hiveserver2.

1. On the Cloudera Manager Console click Hive > HiveServer2.

Stop the first HiveServer2 in the list

Connection to Beeline using command below should work normally.

# beeline -u "jdbc:hive2://ip-10-7-176-204.ec2.internal:2181,ip-10-229-16-131.ec2.internal:2181,ip-10-179-159-209.ec2.internal:2181/;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=hiveserver2"

2. Stop the second HiveServer2 in the list

Connection to Beeline using command below should still work normally.

3. Stop the third HiveServer2 in the list.

Connection to Beeline using command below should fail.

4. Start the third HiveServer2 in the list.

Connection to Beeline using command below should work normally again.

5. After the testing completes make sure you start all HiveServer2 in the list.

Discover more about our expertise in Big Data and Hadoop.