Data comes in different shapes. One of the these shapes is called a time series. Time series is basically a sequence of data points recorded over time. If, for example, you measure the height of the tide every hour for 24 hours, then you will end up with a time series of 24 data points. Each data point will consist of tide height in meters and the hour it was recorded at.
Time series are very powerful data abstractions. There are a lot of processes around us that can be described by a simple measurement and a point in time this measurement was taken at. You can discover patterns in your website users behavior by measuring the number of unique visitors every couple of minutes. This time series will help you discover trends that depend on the time of day, day of the week, seasonal trends, etc. Monitoring a server’s health by recording metrics like CPU utilization, memory usage and active transactions in a database at a frequent interval is an approach that all DBAs and sysadmins are very familiar with. The real power of time series is in providing a simple mechanism for different types of aggregations and analytics. It is easy to find, for example, minimum and maximum values over a given period of time, or calculate average, sums and other statistics.
Building a scalable and reliable database for time series data has been a goal of companies and engineers out there for quite some time. With ever increasing volumes of both human and machine generated data the need for such systems is becoming more and more apparent.
OpenTSDB and HBase
There are different database systems that support time series data. Some of them (like Oracle) provide functionality to work with time series that is built on top of their existing relational storage. There are also some specialized solutions like InfluxDB.
OpenTSDB is somewhere in between these two approaches: it relies on HBase to provide scalable and reliable storage, but implements it’s own logic layer for storing and retrieving data on top of it.
OpenTSDB consists of a
tsd process that handles all read/write requests to HBase and several protocols to interact with tsd. OpenTSDB can accept requests over Telnet or HTTP APIs, or you can use existing tools like
tcollector to publish metrics to OpenTSDB.
OpenTSDB relies on scalability and performance properties of HBase to be able to handle high volumes of incoming metrics. Some of the largest OpenTSDB/HBase installations span over dozens of servers and process ~280k writes per second (numbers from https://www.slideshare.net/HBaseCon/ecosystem-session-6)
There exist a lot of different tools that complete OpenTSDB ecosystem from various metrics collectors to GUIs. This makes OpenTSDB one of the most popular ways to handle large volumes of time series information and one of the major HBase use cases as well. The main challenge with this configuration is that you will need to host your own (potentially very large) HBase cluster and deal with all related issues from hardware procurement to resource management, dealing with Java garbage collection, etc.
OpenTSDB and Google Cloud Bigtable
If you trace HBase ancestry you will soon find out that it all started when Google published a paper on a scalable data storage called Bigtable. Google has been using Bigtable internally for more than a decade as a back end for web index, Google Earth and other projects. The publication of the paper initiated creation of Apache HBase and Apache Cassandra, both very successful open source projects.
Latest release of Bigtable as a publicly available Google Cloud service gives you instant access to all the engineering effort that was put into Bigtable at Google over the years. Essentially, you are getting a flexible, robust HBase-like database that lacks some of the inherited HBase issues, like Java GC stalls. And it’s completely managed, meaning you don’t have to worry about provisioning hardware, handling failures, software installs, etc.
What does it mean for OpenTSDB and time series databases in general? Well, since HBase is built on Bigtable foundation it is actually API compatible with Google Cloud Bigtable. This means that your applications that work with HBase could be switched to work with Bigtable with minimal effort. Be aware of some of the existing limitations though. Pythian engineers are working on integrating OpenTSDB to work with Google Cloud Bigtable instead of HBase and we hope to be able to share results with the community shortly. Having Bigtable as a back end for OpenTSDB opens a lot of opportunities. It will provide you with a managed cloud-based time-series database, which can be scaled on demand and doesn’t require much maintenance effort.
There are some challenges that we have to deal with, especially around a client that OpenTSDB uses to connect to HBase. OpenTSDB uses it’s own implementation of HBase client called AsyncHBase. It is compatible on a wire protocol level with HBase 0.98, but uses a custom async Java library to allow for asynchronous interaction with HBase. This custom implementation allows OpenTSDB to perform HBase operations much faster than using standard HBase client.
While HBase API 1.0.0 introduced some asynchronous behavior using BufferedMutator it is not a trivial task to replace AsyncHBase with a standard HBase client, because it is tightly coupled with the rest of OpenTSDB code. Pythian engineers are working on trying out several ideas on how to make the transition to standard client look seamless from an OpenTSDB perspective. Once we have a standard HBase client working, connecting OpenTSDB to Bigtable should be simple.