I’m giving a NoSQL overview presentation at MOTS in two days. The presentation is somewhat misleadingly titled “Deep Dive” – it is deep enough in the sense that I go into specific requirements, algorithms and use-cases. It is not nearly deep enough for my tastes, so I’ll have to come up with a follow up.
The conference organizers didn’t ask me for a white paper, so I thought I could get away without writing one. But it turns out that I don’t really feel that I can speak confidently about a topic if I didn’t write the paper. I wonder if other presenters have this addiction too :)
So I’m posting the paper here for your enjoyment. Feedback is very welcome.
To discuss NoSQL as a category, one needs to define the category. Which databases are considered NoSQL and which are not. Unfortunately, NoSQL is a bit like Cloud Computing – its a marketing term. Carefully constructed to avoid saying anything too concrete.
NoSQL normally refer to data stores that avoid the relational model, they use other data models that will be discussed in more details later. Lacking relations, join operations will not be possible in these databases. In fact, many architects will say something like “We couldn’t use join because it was too slow, so we decided to migrate to NoSQL database”.
NoSQL databases are mostly distributed, and much of the design of NoSQL databases revolves around the various difficulties involved in managing a distributed database. The data in NoSQL databases is usually replicated between the nodes to provide high availability and reliability.
ACID is usually not supported – the usual claim is that it is the required trade-off when working with distributed databases, but even non-distributed NoSQL servers don’t support ACID. It is just assumed not to be a requirement.
How do you know whether you should or shouldn’t consider using NoSQL database?
NoSQL seems to be a better match for some companies than to others – Many of the problems that NoSQL databases attempt to solve arise from the lack of money. If you can afford an Exadata server, there is usually less reason to consider a NoSQL solution. Other problems are a matter of scale – if your business model demands that you’ll have to scale to 250M users with little budget, you should consider NoSQL.
Since NoSQL databases don’t support transactions, joins or consistency, it should be clear that the application developers must do much more infrastructure developement than those working with RDBMS. If you plan to use NoSQL, your developers must be capable and willing to do the extra work. Generally, you want to be sure that they are as good as those working for Oracle.
Aside from the company traits, there are also a class of problems that don’t require RDBMS. If you always access the data by primary key only (Shopping Cart), if you have no reason to join data, if you write often and continuously but the selects always run as offline batch processes (Monitoring, people you may know, search), if the data model is just a single set of items (word completion) – All those are problems that don’t necessarily require a relational database and other data models and solutions can be considered.
The main thing to keep in mind when thinking about NoSQL databases is that these databases were designed to provide high-end performance and availability for companies that can only afford slow and unreliable hardware.
The data sizes are normally very large. 30TB is the low end and up to several PB. Lacking funds for SAN, the data is usually kept on commodity disks – now selling for 50$ per TB at Fry’s. Since the data size is usually too large to be stored completely on a single server, the NoSQL databases use partitioning to divide the data between the servers.
High availability is usually the most important requirement. Amazon built their NoSQL database Dynamo demanding that “add to cart” operation must never ever fail – because this represents a lost sale – a specific number of dollars they did not get. For the same reason these systems don’t tolerate any data loss. The assumption is that the cheap commodity hardware – servers, disks, network components will all fail. Since there will be many of them, some components will be down at any given time. In order to keep the system functioning during those failures, data will be replicated across servers, racks and data centres – so it will be available in at least one location at any given time.
Performance is also a big consideration. Most web companies worry about latency – if a single transaction takes more than X seconds, the user may go away and the sale will be lost. So the performance requirement is usually to have 99% of the transactions take less than X seconds. Since this has to be done on cheap disks with high seek times, the strategy is to avoid the disk as much as possible and to avoid random IO at all costs.
How is this done?
I collected some of the more interesting algorithms used in NoSQL databases to support the requirements mentioned above.
Traditional hashing solutions have a limitation of requiring a complete rehash of the entire data set when servers are added or removed. This is obviously a big limitation with large data sets. Consistent hashing is an algorithm that only requires rehashing of very small portion of the keys, and is flexible enough to support other requirements such as replication and load balancing.
The idea is that you map all the keys to points on the edge of a circle. You also map your servers to the same space. Given a key, you find the server holding the key by scanning the circle clockwise. This is usually implemented with a mapping table, so its rather efficient.
When a server is added, it “takes over” a portion of the circle between itself and the nearest neighbor – only these keys need to be rehashed and copied. The opposite happens when a server is removed.
If a server is overly loaded, it can gradually move closer to its neighbor, shedding keys in the process until the load is balanced again.
If replication is required, each server holds a copy not only of the keys in its own slice, but also of the keys in the N slices past it. This also means that we can place servers in different racks next to each other, making replication more meaningful.
When keys are replicated N times, we normally require that at least R nodes will participate in each read and W nodes in each write. This allows us to balance consistency, reliability and latency. If R+W are larger than N, our read will always include the last value written.
Since we have a cluster of distributed servers, we need to manage membership – which servers are currently in the cluster. This can be done via a “master node” – but then this node will be a point of contention and a single point of failure.
Alternatively – the gossip model can be used. Nodes that receive updated information about membership will communicate the new information to other nodes – communication is always between pairs, is done in discrete steps and the recipient is selected at random. This is surprisingly efficient, provided that the frequency of communication is low enough to avoid overhead and within very few communication steps the entire cluster will be notified.
The CAP theorem says – you can only have 2 out of 3: Consistency, Availability and Partition Tolerance.
Consistency in this context is different than what the C in ACID refers to. In ACID terms, consistency means “Not breaking constraints”. NoSQL databases don’t have constraints and they use consistency to mean “concurrent selects see the same data” and “consecutive selects see data in sequentially correct order”.
Availability means not failing transactions and Partition Tolerance means giving usable results even when not all servers can communicate with each other. Due to strict availability requirements and the unreliable networks involved, many NoSQL databases decided to forgo consistency and instead support Eventual Consistency.
Eventual Consistency means that all nodes will receive the latest updates eventually. In other words, if you insert data, you’ll eventually see it again. This is the system used when registering new website. You register with a specific DNS – all other DNSs will be updated with the new information within 24 hours (eventually) but not immediately.
Having eventually consistent database means that when selecting data, the application will occasionally receive multiple versions of the data it requested. The data will arrive with “vector clocks” which showing the update history of the data. Every time a node updates the data, it adds its identifier to the vector clock. By comparing the vector clocks of the different versions, the latest version can be detected and also the existence of conflicting versions.
The application should know how to merge conflicting versions – usually by combining the values in some way to avoid losing data. In some applications such as shopping basket and monitoring systems, merging data is relatively trivial. Other applications, such as source control, don’t allow for automatic conflict resolution.
If the problem you are solving does allow for automated conflict resolution – you shouldn’t use an eventually consistent data store.
BigTable Data Model:
This is Google’s version of semi-structured table. Several other NoSQL databases are using a similar model, so understanding how BigTable models data will help you understand other semi-structured models.
The BigTable has rows, each identified by a row key (think primary key) and the data is stored sorted by this key. The columns are arranged by “column family”. The column families in the table are defined when defining the table, but the number or type of columns in the family is unlimited. The intersection of each row and column is called a cell. If you have a row for each web page, you can have a column family of all pages linking to that page, and a column for each specific page. This way different rows will have different columns in this column family and the data stored in each cell will be the text used when linking from the column page to the row page.
In addition to all this, each cell can store multiple versions of its data, each with a timestamp. A monitoring system will have a server as the row identifier, and in the “cpu” column it will have the cpu utilization measured for that server at specific times. When defining the table you need to specific how long you will keep data in each cell, and there is a background process cleaning the data up.
Because of the multi-version model, BigTable has only inserts, no updates. To avoid random IO, inserts are written to a “redo log” and are also applied to a version of the table in memory. Every once in a while the contents of the memory table are written to disk in a format called SSTable. There are background processes responsible for organizing the data once it is on disk in a process called compaction. The goal is to keep the data sorted by row key and column families together. The data model and physical models are optimized for writes, so each select has to go through multiple tables in memory and on disk to retrieve the contents of the cell. Bloom filters are used to detect which SSTable contains data for which row.
MapReduce is a method for aggregating data in a distributed system. It can work with any data source (text files, images, XML, Oracle), and many NoSQL databases use it for off line reporting, ETL or index creation.
The aggregation operation is split into a Map part, which extracts the needed information from the data source (for example number of comments from a blog post) and Reduce part which aggregates the information (for example – summing the data). The map and reduce run on each node in parallel, and then data is sent to “super nodes” for farther aggregation.
It works in theory, but will it work in practice?
When evaluating competing NoSQL solutions for use in production, there are several points to consider:
Most NoSQL solutions are 2-4 years old. This means that many of them are not as feature complete as we’d like them to be, they have many bugs, and often they don’t scale as well as their developers would like to believe.
Most of these databases are open-source and don’t have a solid support model except to ask for help in the mailing list. While I see this as a great opportunity for Pythian, this is still a problem when you attempt to deploy these databases in production.
Because NoSQL databases are so new, have no support and are built using well known algorithms, many companies choose to build their own NoSQL databases.
NoSQL databases don’t support SQL as a method for retrieving data from the data store. Instead they offer APIs. These APIs usually arrive with Java libraries and some more exotic languages (Ruby, Erlang, Python). You need to be sure that your favourite language is supported. This usually reduces the number of ad-hoc queries running in the system.
Some of the NoSQL databases offer an ad-hoc CLI where you can call built-in procedures and get your data in a somewhat readable format. Depending on the data model, you can get a value by its key or you can perform range scans. Joins, filters and aggregation should be done in the application code. Some NoSQL databases support exports of the data, and some allow reporting and ETL using MapReduce.
NoSQL databases make amazing performance claims as a rule. They sometimes use benchmarks to prove their claims. Most of these benchmarks are poorly planned, executed and reported. Mistakes such as testing insignificant loads, non-representative load distributions and forgetting to report about the specific configuration tested (did the database write to disk during the benchmark?) are common. If you seriously considering use of NoSQL databases – write your own benchmark.
You want to inquire what are the usual bottlenecks of the database – IO, memory, CPU, network? So you’ll know if it matches your requirement. You want to check if it was tuned for reads or writes, latency or throughput. And you definitely want to check how node failures affect the performance – some NoSQL databases have excellent latency, but will take hours to regain usual performance following node failure.
If you need high availability, make sure the system you chose doesn’t have a single point of failure, usually in the form of a “master node”.
Make sure that data is re-replicated to maintain replication levels following a crash.
Check how you can monitor the system to detect failures and to know when the system doesn’t recover as expected.
Be sure you can get data in and out of the database in a way that satisfies your needs. That you understand the data model and can model your data to use it in an efficient way. That you know how to develop your application to support its consistency model. That you know how to lock data if needed, and how to avoid locking data when you’d rather not.
You need to know how to backup and recover the system. Check if it supports hot backups and point in time recovery. Figure out what to do when you discover that it does not. Check that you can add and remove nodes transparently without disrupting operations. If required – make sure its consistency and replication models allow it to be deployed across data centers and make sure the replication is rack and data center aware.
NoSQL databases are a good solution for a specific set of problems. Before choosing NoSQL database make sure you understand the application you are writing and its requirements. Be sure that NoSQL model will work, that it provides value and that you can live with the trade-offs.
Make sure you understand the specific NoSQL database that you chose and that it fits your requirements. Make sure you know how to model your data in a way that uses this database in the best way and that you can deploy it in operations without losing sleep. Do your own tests – don’t believe blog posts and benchmarks.
Deployed correctly, NoSQL databases are an efficient and reliable solution for a set of non-relational applications. Chosen for the wrong problems or the wrong environment, they will make your life very difficult indeed.