Recently I was tasked with setting up an HBase cluster to test compared against Amazon’s DynamoDB offering. The client had tested that it worked well for up to 10k updates/sec, but were concerned about the cost. I set up a 7-node HBase cluster in the following configuration:
- Node1: NameNode, HMaster, Zookeeper
- Nodes 2-7: DataNodes, RegionServers
Every node was configured with the following hardware:
- 32GB RAM
- 4x Intel Xeon E7320 2.13GHz
- 4x SAS SEAGATE ST3300657SS drives (JBOD configuration, no RAID)
- 4x 1GB ethernet NICs in 2×2 bonded interfaces (only one used by HBase)
HBase was configured (almost) as follows:
- hbase-env:
- export HBASE_HEAPSIZE=8000
- export HBASE_OPTS=”-ea -XX:+UseConcMarkSweepGC -XX:+CMSIncrementalMode”
- export HBASE_OPTS=”$HBASE_OPTS -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps”
- export HBASE_JMX_BASE=”-Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.authenticate=false”
- export HBASE_MASTER_OPTS=”$HBASE_JMX_BASE -Dcom.sun.management.jmxremote.port=10101″
- export HBASE_REGIONSERVER_OPTS=”$HBASE_JMX_BASE -Dcom.sun.management.jmxremote.port=10102″
- export HBASE_THRIFT_OPTS=”$HBASE_JMX_BASE -Dcom.sun.management.jmxremote.port=10103″
- export HBASE_ZOOKEEPER_OPTS=”$HBASE_JMX_BASE -Dcom.sun.management.jmxremote.port=10104″
- hbase-site:
- dfs.support.append = true
- hbase.cluster.distributed = true
- hbase.tmp.dir = /var/lib/hbase/tmp
- hbase.client.write.buffer = 8388608
- hbase.regionserver.handler.count = 20
The Thrift interface was started on the HMaster for a client program to connect and do work.
A single table was created with (almost) the following definition:
|
A simple Python program using the happybase HBase client was written to increment a single counter in that table as fast as possible. At first, we could only achieve around 700 increments per second, no matter how many client programs we ran. Looking at the HBase console, it was apparent that only one node was doing any work (which is as expected), but we expected better than 700 increments per second.
I did some analysis of the cluster. The CPU, Disk, and RAM footprints were all very low. The only indication the HBase cluster was doing work at all was that the three RegionServers hosting the table timelesstest were doing about 10x the interrupts/second of the others. Until I looked at the network. There was about 150KB/sec being transferred in and out of HMaster and the primary RegionServer for timelesstest, and about 380KB/sec for the replica RegionServers. At this point, I guessed that HBase was committing every single transaction coming through. The documentation pointed out a setting I could change: “hbase.regionserver.optionallogflushinterval”, but tweaks to that value did nothing.
As I was speaking with Ryan Rawson, one of the original HBase committers, he pointed out that the setting only applied to a given table if the DEFERRED_LOG_FLUSH is set at the table level, which the documentation didn’t make very clear[1]. So I simply ALTERed the timelesstest to enable the deferred log flushing, et voila! The number of increments per second a single client could achieve skyrocketed to 2k/sec. We added six more clients to the mix and achieved a sustained 10.5k/sec increments spiking at times up to 11.5k increments/sec.
Since we were incrementing only a single counter, this means that we were only using about 1/6th the power of the cluster, since there were 6 RegionServers present. We will soon be running more tests, with thousands of counters and perhaps dozens of clients. I suspect we will quickly run into a bottleneck on the Thrift server and may need to open that bottleneck up to achieve the theoretical max of about 60k increments/sec on this cluster (if that is actually achievable, which it may not due to there being a 3x replication factor for the table).
Stay tuned!
[1] I let the HBase documentation guys know about the confusion. Hopefully future HBase administrators won’t have to worry about this gotcha anymore.
No comments