The traditional NoCOUG SQL challenge has been launched this year with a twist: In the wake of the “BigData” trend/buzz, it’s now been upgraded to a “SQL and/vs. NoSQL” challenge. I took on the challenge, treading through my SQL comfort zone and thinking of ways I could bend relational algebra to solve the wicked puzzle suggested this year.
And I guess I would’ve been happy to stick to my guns and get cozy in my es-cu-el corner, hadn’t I had colleagues who specialize in being inspirational and in getting you to “up your game”, pull your socks up, take on new challenges, and learn new skills. I’m lucky to have a number of those, and I’m grateful to them!
As a result, I was lured into thinking outside my SQL-box to answer the NoCOUG call. All I knew about NoSQL was exactly that: that is was “No SQL”. So off I went to find out more about it.
WARNING: If you decide to continue reading, you’re doing so at your own risk!! Please don’t take any of the text below as advice. On the other hand, if you have any constructive comments to correct/improve/build upon my approach, I’d love to hear from you and learn from the discussion! So please don’t be shy to help. :)
What’s NoSQL?
I had little idea where to start, so I started looking for information about NoSQL. I soon realized that this subject was very wide and that “NoSQL” is just an umbrella term covering a myriad of different databases, languages, technologies, etc. It’s hard to define what it actually is, so I guess this is why it’s rather defined in terms of what it’s not.
In general terms, NoSQL databases include management systems that (a) do not make use of the SQL language, (b) do not implement ACID properties and generally, and (c) use a distributed and fault tolerant architecture. These databases are optimized to efficiently retrieve data and append operations, but lack in the features and richness of the well-known RDBMS system we’re used to. They are designed to scale well and transparently but not to offer the richness of tools and flexibility of a traditional RDBMS.
If you delved into NoSQL land, you could stay there forever and never run out of new things to learn; at least this is my impression. The amount of different adopted technologies and implementations is mesmerizing, and you need to focus on something to avoid getting lost in an avalanche of information. And so I did: After some reading, I chose to learn more about Cassandra, a key-value store database, and tried to use it to solve the problem at hand.
The reason for the choice wasn’t that Cassandra is the best tool to solve this particular problem (and as I learned, it indeed doesn’t seem to be). It was rather that I had read about Cassandra before and knew a few people who had used it. I had been interested by it in the past, and this was an excuse to learn more about it. There are some other NoSQL technologies, particularly Graph Databases, that seem better suited to the sort of problem like this year’s NoCOUG challenge. This, however, is out of my league for now, and I decided to concentrate on a simpler task.
Cassandra basics
I did some reading on Cassandra concepts and internals to understand how it works and why it is suited for handling large amounts of data. This is a very interesting, and also complex, subject of which I have merely scratched the surface so far. As a relational database expert, I found amazing how different it can actually be from the techniques used for ACID-compliant databases.
Cassandra is a share-nothing database, relying on distributed storage to implement fault-tolerance. The Cassandra nodes are organized in a token-ring topology, and a replication strategy determines how replicas of the data are distributed across the cluster. Request from clients are received by a random node of the cluster – the storage proxy. This node finds out which nodes of the cluster store contain the data replicas requested by the client data and coordinate the reading and writing from/to those nodes.
All the data stored in Cassandra is in the form of a key-value pairs (think of a row in a relational database analogy). The key uniquely identifies the data being stored and allows for fast retrieval. The value is comprised of columns containing the actual data; I’ll talk more about this in the Data Modelling section later in this post. Unlike in RDBMS, it’s not unusual for a row to have hundreds, thousands or even more columns in Cassandra.
When writing to the database, the client session just needs to wait for two things to happen:
- The change is appended to the CommitLog on disk; and
- The changed data is written to an in-memory table, sorted by key value (MemTable).
These two operations are fast, and that’s one of the reasons Cassandra can handle large append rates. There are no indexes to be updated and no random writes to be performed.
Cassandra does not guarantee consistency across all the replicas of the data. What it offers is called “eventual consistency”, meaning that the data is replicated asynchronously and will “eventually” be consistent across all the replicas. The latency between replicas is of the order of milliseconds, and data retrieval usually returns the latest version of the data.
MemTables are effectively a “write-back” cache, and when they are “full”, they are written to disk in the form of SSTables (Sorted String Tables; the sorting is done by the key). SSTables are immutable; a future update of the data will be written to a new SSTable. SSTables encapsulate three structures: the actual data, a row index containing key-offset pairs, and a bloom index of the keys. Periodically, Cassandra reads SSTables asynchronously and merges them in bigger SSTables, eliminating old versions of the data and purging old SSTables.
Because different versions of the same data can be stored in different SSTables on disk, all the SSTables must be read when data is retrieved to allow for the latest version to be identified. When the client requests the data for a given key, Cassandra first looks up the existing MemTables. If the data is not found in memory, it then uses the bloom filters in each SSTables to quickly discard the SSTables that don’t contain the wanted data. For the ones that pass the bloom filter test, the index is then retrieved, and if the key is found in the index, the entire SSTable is loaded into memory. The latest version of the data is then identified and returned to the client.
Cassandra keeps keys sorted within each SSTable. The main advantage of this is not for data retrieval (requested by the client), but for the background merging of the SSTables; since each SSTable is sorted, merging two or more of them is very cheap.
When the data (key-value pair) is written to the database, a partitioning algorithm is used to determine to which node of the cluster the first replica will be written. By default, the RandomPartitioner is used. This partitioner applies a uniform hashing function to the key to choose in which node the data will be first stored. This guarantees a balanced distribution of the data across the cluster nodes. However, key range queries cannot be done when using this partitioning scheme since subsequent keys may be stored in different nodes.
To allow range scans, Cassandra can be configured with a different partitioner: the OrderPreservingPartitioner. When using this partitioner, though, the distribution of data evenly across the nodes may become a challenge and needs to be addressed with care.
This is just a glance of what Cassandra does behind the scenes. More details can be found on the Internet, and I’ve added a few links I liked at the end of this post as references.
Data model
The next thing to understand was how data is modeled in Cassandra. I had a very basic idea of what a key-value store was and needed more information before trying to build anything with it.
Initially, the concepts seemed to be analogous to relational databases. Cassandra stores key-value pairs in keyspaces – similar to schemas in a relational database. Within each keyspace the key-values are organised in column families. Using the relational database analogy, a column family can be compared to a relational table and the key-value pair to a row in the table. The key uniquely identifies the row within the column family and can have up to 64KB.
column_family = {column_family_name: {row [, row...]} row = {key: value}
The value part of the key-value pair contains the actual data, stored in the form of columns. Columns in Cassandra are represented by a pair {column name: column value}. Column names are known as comparators in Cassandra’s terminology.
value = {column [, column...]} column = {column_name: column_value}
The example below shows the column family Songs, which contains one song per row. A random key has been chosen for each song; the key is arbitrary, though, and the data designer can choose to use meaningful values instead of random ones.
{Songs: {217836: {title: 'It\'s a long way to the top'} {artist: 'AC/DC'} {release: 1975} } {941723: {title: 'Down under'} {artist: 'Men At Work'} {release: 1982} } ... }
Column families don’t need to have a static structure like tables in most RDBMSs do. Also, different rows in a column family don’t need to have the same columns. Client applications can dynamically add columns to a row at runtime.
Column names can be up to 64KB in size. Since the names are arbitrary and columns can vary from row to row, a valid modelling technique in Cassandra is to use column names to store actual data. Column values can store up to 2GB. A single row in a column family can have up to 2 billion columns, even though a lot less than that is usually used in practice.
The example below shows rows with different sets of columns:
{Songs: {217836: {title: 'It\'s a long way to the top'} {release: 1975} {genre: 'hard rock, 'blues rock'} } {941723: {artist: 'Men At Work'} {label: 'Columbia'} {title: 'Down under'} } ... }
The Cassandra model also has the concept of Super Columns. Super columns are named collections of columns and could be represented like this:
super_column = {super_column_name: {column [, column...]}}
Column families can be of two types: Standard and Super. The Standard column families are the ones described previously, which are collections of columns. The Super column families are similar, but contain super columns instead of columns. We can represent them as:
super_column_family = {super_column_family_name: {super_column [, super_column...]}
We could choose to redesign our Songs column family as the following Super column family:
{Songs: {217836: {title: {english: 'It\'s a long way to the top'}} {release: {year: 1975} {month: 'October'}} {genre: {'hard rock', } {'blues rock', }} {artist: {lead_guitar: 'Angus Young'} {rhythm_guitar: 'Malcolm Young'} {bass: 'Cliff Williams'} {lead_vocals: 'Brian Johnson'} {drums: 'Phil Rudd'}}} ... }
The super columns in the example above are: title, release, genre, and artist. Each of those contains a set of one or more columns. Notice that the columns in the “genre” super column use the description for the genre as the column name and don’t have any value associated with it. As explained earlier, this is a valid modelling technique when using Cassandra. For more information about super columns and their limitations, see the links in the references section.
Cassandra also has some other special types of columns like Composite Columns and Counters. I won’t talk about those now, but the links above will give you good insight on them and important information to consider when modelling data for Cassandra.
To be continued…
Learn more about Pythian’s Cassandra Services or get a free Cassandra assessment today.
I know that I haven’t even started to talk about the title of this post: my actual practical experience using Cassandra. I’ll get there, but I had to cover some basics first for the benefit of beginners like me.
In the next post I’ll start to cover my attempt to model a problem using the Cassandra model and store and query data using this model.
Many thanks to Gwen Shapira for some great feedback and example ideas!
References
Please find below some of the articles or documentation I found most useful:
Cassandra Wiki – This is a good source for a lot of reference information.
SSTable and Log Structured Storage: LevelDB (by Ilya Grigorik)
Cassandra Internals – Reading (by Mark Perham)
Cassandra Internals – Writing (by Mark Perham)
Datastax Cassandra Documentation
1 Comment. Leave new
Super.
Where will I get the link to the next blog?