“Dynamo: Amazon’s Highly Available Key-value Store” is a high level description of a data store, written by Amazon to solve the problem of a system where updates must never ever fail and must take less than a specific amout of time in 99.99% of the cases. No matter what happens to the servers or the network, updates to the system must continue as usual, and they emphasize that they deal with hardware and network failures nearly constantly.
The paper has one of the best descriptions on the trade-offs involved in eventual consistency, and when it makes sense. But even more interesting is the implicit decision that disks, commits and synchronous writes to redo logs are not really needed for durability.
To allow for simple design, robustness and high performance, Amazon limited the reporting capabilities – data is accessed by primary key only. To implement the high availability requirement, Dynamo replicates each key-value pair between multiple machines in different data stores. And to make sure that replication doesn’t impact performance, Amazon decided they can live without consistency – requests from the system can sometimes return old data or several contradicting versions of data – for Amazon’s requirements, this is much better than failure or delays.
The consistency trade-off is very explicitly dealt with in the paper – there will be multiple versions of the data, a request may get any number of replies and may not get the most recent updates. The application developers have to build applications that can handle this situation and merge several versions of the data into the information they want.
The trade-off of durability is done implicitly. The paper simply does not mention writing anything to any disk. Nothing like redo logs is mentioned either (except in the context of tracking changes for a missing node). This is obviously very fast, but can we really trust this system not to lose any changes?
To avoid losing data, the system uses a quorum-like system. There are 3 important parameters to tune for the replication: How many times each key-value pair should be replicated (N), how many nodes should participate in a successful write (W) and how many nodes should participate in successful read (R).
When W=3, at least 3 nodes should signal that the write was successful before the application gets an “ack” from the data store. A coordinator node will attempt to write to N nodes, but once W nodes approve the write, it is considered permanent. In order to lose an update we need at least W nodes to crash simultaneously after an update and probably more because W is just the lower bound. As long as at least one of those nodes is up, the data is not lost.
The idea is to set the replication level to a number high enough to satisfy durability requirements, but low enough to prevent too much latency. You also want to split those nodes between at least two data centers to reduce the chance of losing all nodes at once. Considering that this system doesn’t allow for backups, I would set the number of replicated nodes really high. I would expect about 50 nodes covering each piece of data before I’d consider it as durable as disk bases system (considering how often servers crash vs. how often disks crash). Amazon, on the other hand, mentioned in the paper that they use replication level of 3, and they keep stressing that not losing a single update is vital to their business (because each update can be a sell of an item), so maybe I’m under-estimating the durability of this system.
From the description, it also appears that it is impossible to dump the contents of the system. This means that backups are impossible, no point-in-time-recovery, and ETL into a DW system also sounds impossible.
I like the idea of databases without disks – disks are notorious for just slowing everything down. I’m just not sure I would want my bank to use a replication-only system, even if it does seem durable in theory. So, if we can’t use Dynamo for banks, what is it good for?
Amazon’s paper gives Amazon shopping cart as an example – The data is temporary anyway, there are many more writes to the system than reads (you keep puting items in the cart, but only look at the cart once before checking out) and there is a clear rule on how to merge conflicting versions (union). Social Networks is another obvious use-case, with tons of updates and trivial way to merge different version, I’m not sure if access is limited to just primary key though – you want to see your own data and that of your friends.
I think that an interesting possible use case is for system monitoring data – it is also update heavy, can be mostly primary-key access (metric, such as CPU on specific machine, being the key), and merging versions is trivial (union). Monitoring data does need to be reported, so I would expect my data store to allow dumping the data into a more traditional store in specific intervals.
The Dynamo paper was written four years ago, I’m sure the system improved since. I’m also sure that the paper gives only a partial description of the features of the system (and I suspect that they do write data to disks after all). Still, it paved way for a future of special purpose databases , ones that solve a specific use-case better than a general purpose RDBMS.
If you ask yourself “What are my requirements?” and “What is the best data store for my requirements?” instead of saying “We have Oracle, lets throw everything there and it will take care of everything magically”, your application will be better for it. As Cary Millsap recently wrote – it is better to think clearly than to be correct.