An effective approach to migrate dynamic thrift data to CQL: part 2

Posted in: Cassandra, Open Source, Technical Track

Note that this post is Part 2 of a 3-part blog series. If you haven’t read Part 1 of this series, please do so before continuing. Part 1 gives some background knowledge of the problem we’re trying to solve, as well as some fundamental concepts used in the following discussion. The chapter number sequencing also follows that from Part 1. (note: Part 3 can be found here)

4. Static and Dynamic Cassandra Table Definition

In Thrift, Cassandra tables can be defined either statically, or dynamically. When a table is defined dynamically, an application can add new columns on the fly and the column definition for each storage row doesn’t necessary need to be the same. Although a little bit flexible this way, it can be problematic as well because the dynamic column definition information is merely available in the application and invisible to outside world.

In CQL, however, tables are always statically defined, which means that any column and its type information has to be defined in advance before it can be used. Each row in a CQL table has exactly the same columns and type definitions.

In CQL, the dynamism of a Thrift table definition can be achieved through clustering column and/or more advanced column type definition like collections and user defined types (UDTs).

4.1 Static Table Definition

In Thrift, a statically defined table has column_metadata information in the table definition statement, as following:

create column family avg_grade
    with key_validation_class = Int32Type
     and comparator = UTF8Type
     and column_metadata = [
       {column_name: class_id, validation_class: Int32Type},
       {column_name: grade, validation_class: FloatType}

A strictly equivalent CQL table definition is like this (note the “WITH COMPACT STORAGE” table property):

create table avg_grade (
    key int primary key,
    class_id int,
    grade float
) with compact storage

A statically table defined in either a Thrift utility (cassandra-cli) or a CQL utility (cqlsh) can be accessed in another one with no problem. One difference between the Thrift and CQL definition is that in CQL definition, the row key definition has a name, but Thrift definition doesn’t. In this case, when accessing a table defined in Thrift, CQL uses a default name (“key”) for the row key.

4.2 Dynamic Table Definition

In Thrift, a dynamically defined table does NOT have column_metadata information in the table definition statement. Typically, time-series application adopts dynamic table definition. For example, for a sensor monitoring application, we may use sensor id as the storage row key and for each sensor, we want to record different event values detected by the sensor within a period of time. An example table definition is as following:

create column family sensor_data
   with key_validation_class = Int32Type
    and comparator = TimeUUIDType
    and default_validation_class = DecimalType;

Suppose for this table, 2 events are recorded for sensor 1 and 1 event is recorded for sensor 2. The output from cassandra-cli utility is like below:

RowKey: 1
=> (name=1171d7e0-14d2-11e6-858b-5f3b22f4f11c, value=21.5, timestamp=1462680314974000)
=> (name=23371940-14d2-11e6-858b-5f3b22f4f11c, value=32.1, timestamp=1462680344792000)
RowKey: 2
=> (name=7537fcf0-14d2-11e6-858b-5f3b22f4f11c, value=11.0, timestamp=1462680482367000)

The above shows output that the columns for each row are dynamically generated by the application and can be different between rows. In CQL, a strictly equivalent table definition and output for the above dynamic Thrift able is as below:

CREATE TABLE sensor_data (
    key int,
    column1 timeuuid,
    value decimal,
    PRIMARY KEY (key, column1)
key  | column1                              | value
   1 | 1171d7e0-14d2-11e6-858b-5f3b22f4f11c | 0E-1077248000
   1 | 23371940-14d2-11e6-858b-5f3b22f4f11c | -8.58993459E-1077939396
   2 | 7537fcf0-14d2-11e6-858b-5f3b22f4f11c | 0E-1076232192

Since the columns are generated on the fly by the application, CQL doesn’t know the column names in advance. So it uses the default column name “column1” (and also the default key name “key”) in its definition. Functionally, it can be transformed equally to a more descriptive definition as below by using “ALTER TABLE” CQL command to rename the column names (e.g. “key” to “sensor_id”, “column1” to “event_time”), as below:

CREATE TABLE sensor_data (
    sensor_id int,
    event_time timeuuid,
    value decimal,
    PRIMARY KEY (sensor_id, event_time)

4.3 Mixed Table Definition

In thrift, a table can also be in a mixed mode, which means that when a table is created, it has part of its column information being defined in column_metadata, just like a static table. But during runtime, a Thrift based application can add more columns on the go.

The table below, blogs, is such an example.  This table has one column “author” as statically defined. There are also 3 more columns (tags:category, tags:rating, and tags:recommend) for RowKey 1 defined on the fly.

create column family blogs
    with key_validation_class = Int32Type
     and comparator = UTF8Type
     and column_metadata = [
       {column_name: author, validation_class: UTF8Type}
RowKey: 1
=> (name=author, value=Donald, timestamp=1462720696717000)
=> (name=tags:category, value=music, timestamp=1462720526053000)
=> (name=tags:rating, value=A, timestamp=1462720564590000)
=> (name=tags:recommend, value=Yes, timestamp=1462720640817000)
RowKey: 2
=> (name=author, value=John, timestamp=1462720690911000)

When examining in CQL, in both table schema and data output, we can only see the columns as statically defined. The dynamically columns are NOT displayed.

    key int PRIMARY KEY,
    author text
key  | author
1    | Donald
2    | John


Interested in working with Yabin? Schedule a tech call.

No comments

Leave a Reply

Your email address will not be published. Required fields are marked *