FourSquare, the location based social network, suffered from extended outage yesterday. They explained the causes in a blog post, which caused much discussion around the web.
Here’s the gist of the analysis: FourSquare are using MongoDB, which is a sharded database. Data is split between nodes based on a shard key, usually the User ID or something similar. One of the shards became overly loaded. After failing to resolve the issue in other ways, FourSquare decided to add another shard to share the load. This caused the entire cluster to fail.
It is easy, and many have done it, to point the finger and blame MongoDB for the mess – after all, it is likely their bug that caused the entire cluster to fail when adding a shard.
I thing that the root cause is still that sharding is inherently difficult. Although many NoSQL databases market themselves as the ultimate solution, having a silver bullet you don’t completely understand is actually more dangerous than solving the problem yourself. This is why many companies that are successful with NoSQL solution do so by employeed one of the developers of the solution.
I’ve seen a lot of shards in the last 10 years. I’ve worked for SaaS provider, I have a social network provider as a customer, my husband works for another social network provider. Here are few lessons that may help you avoid the more obvious mistakes:
- Once the system is overloaded, it is too late to add shards. Because re-balancing the load between the shards creates more load and usually lots of locking is involved. Your planning and design must include another way of reducing load on a database server, and of course recognizing load growth and adding shards before it is too late.
- It is better to design the system with a lot of queues, connection pools and caches between the app servers and the databases. Contrary to popular belief, app servers can be better than databases at handling excessive load.
Database is all shared resource, there are lots of locks involved that can cause even mild contention to escalate into complete hangs. The database also can’t prioritize requests and serve partial data.
If you can control the load on the app server, by using queues, connection pools and caches – you can prioritize the right requests and decide what is the right feedback to give to your users. Graceful degradation is all about handling the excessive requests in a smart way that only the application is capable of doing.
Doing this also means you can add database shards, because the database does not get too loaded.
- Capacity planning is probably the biggest problems in sharding. The planning is necessary because once a shard is loaded, its too late. I’ve seen two ways of spliting data between shards and each way has its own planning problems:
You can add shards sequentially: Start with one server, when it hits 70% load, add another and start creating new users/customers/whatevers on the new server. When it gets to 70% load, add another. The upside is that very little planning is needed. The downside is that you can get easily screwed by various usage patters. Are new users more or less active than existing users? What if one of the existing users becomes much more active all of the sudden?
Another way is to use consistent hashing – you generate a random key for each user and use this to distribute users between your existing servers. When adding new server, you need to move few users from each existing server to the new one to rebalance the load. The upside is that you can take load from existing server, the downside is that adding servers causes anything from few minutes to few hours of high response time variable on all existing servers. Also, the problem of taking equal portion of the load from each server is more difficult than it sounds.
- You sharded the DB, but what about your app servers?
You can decide that all app servers will serve all traffic, connecting to the different shards as needed. This is usually the right thing to do in terms of app load balancing and redundancy.
But your app will need to know how to handle one failing shard without impacting other customers. The other downside is that I’ve seen the amount of network traffic this generates overwhelms the LAN causing performance issues all over the cluster.
The alternative is to have small group of app servers handle every shard specifically. This is suboptimal in terms of resource usage, but can be easier to manage. Issues with one shard will only impact very well defined subset of your product.
Regardless, the app must know how to handle users that move from one shard to another
- If your app does much more reading than writing, having few read-only replica for each shard is a good solution for farther controlling load.
- Regardless of how you do the sharding, you will want a way to manually rebalance servers. Sometimes you *know* that a load issue will be resolved by moving a specific group of users to a new server. It will certainly happen. Make sure you have tools to do it.
Good luck sharding and don’t believe anyone who tells you their tool makes it easy.
One of the most useful things we ever did at WebTV was require all customer-facing applications to support database-read-only mode with the ability to fail over to a secondary server. Not only did this significantly reduce the impact of outages, it reduced the visibility of planned maintenance work to the point that our customers often didn’t notice major back-end service upgrades.
For the big Christmas Day load, we could partition the service so that existing customers were pointed to several different read-only replicas, and the main database was reserved for new subscribers.
Mandatory thinking for everone who tries to design (too-)large systems.
Your Item nr 2 is key. The (ACID, shared) database should be the best protected part of the system. And the applications, the queues and the connection-pools should all be designed/deployed to protect that database from overload.
BTW: Re-sharding (in RAC: re-mastering) is essentially what causes the freeze and brownout periods when cluster nodes join/leave. And every clustered (sharded) system will display this behaviour in some form. Which is probably why the number of shared (sharded) resources in a cluster should be kept to a minimum.
I think that you can’t do range queries if the database uses consistent hashing, can you?
MongoDB supports range queries on the shard key.
Comment on item 2 – some databases actually do support workload prioritization to some extent (like Resource Manager in Oracle). The problem is that it’s usually very difficult to pass the workload classification information from application down to databases and it’s far from being a pure technical challenge.
[…] is lunacy. This kind of lunacy used to be too-common with MongoDB adopters. I wrote two blog posts (No silver bullet and Difficulty of migrations) just ranting about this kind of optimism in MongoDB […]