My thoughts on the resilience of Cassandra

Posted in: Cassandra, Technical Track
Cassandra

This blog is a part 1 of a 2 in a series. This will be different from my previous blogs, as this is more about some decisions you can make with Cassandra regarding the resilience of your system. I will talk deeply about this topic in the upcoming Datastax Days in London, this is more of an introduction!

TL;DR: Cassandra is tough!

DataStax describes Cassandra as delivering “…continuous availability, linear scalability , and operational simplicity across many commodity servers with no single point of failure, along with a powerful data model designed for maximum flexibility and fast response times“. In a production system, having your persistence layer failure tolerant is a big thing. Even more so when you can make it resilient to full locations failure through geographic replication (and easily).

As in any production system you need to plan for failure. Should we blindly trust in Cassandra resilience and forget about the plan because “Cassandra can handle it”? By reading the documentation, some may think that by having several data centers and a high enough replication factor we are covered. In part this is true. Cassandra will handle servers down, even a full DC (or several!) down. But, anyway, you should always prepare for chaos! Failure will increase pressure on your remaining servers, latency will increase, etc. And when things get up again, will it just work? Getting all data in sync, are you ready for that? Did you forgot about gc_grace_seconds? There are lots of variables and small details that can be forgotten if you don’t plan ahead. And then in the middle of a problem, it will not help having those details forgotten!

My experience tells me that you must take Cassandra failures seriously, and plan for them! Having a B plan is never a bad thing, and a C even. Also, make sure those plans work! So for this short introduction I will leave a couple of recommendations:

  • Test your system against Cassandra delivering a bad service (timeouts, high latency, etc).
  • Set a “bare minimum” for your system to work (how low can we go on consistency, for example).
  • Test not only your system going down, but also prepare for the coming up!
  • Keep calm! Cassandra will help you!

Overall, Cassandra is a tough and robust system. I’ve had major problems with network, storage, Cassandra itself, etc. And in the end Cassandra not only survived, it gave me no downtime. But with every problem I had, it increased my knowledge and awareness of what I could expect. This lead to planning for major problems (which did happen) and this combined with the natural resilience of Cassandra made me go through those events without downtime.

Fell free to comment/discuss about it, in the comment section below! Juicy details will be left for London!

email

Author

Want to talk with an expert? Schedule a call with our team to get the conversation started.

About the Author

Carlos Rolo is a Datastax Certified Cassandra Architect, and has deep expertise with distributed architecture technologies. Carlos is driven by challenge, and enjoys the opportunities to discover new things and new ways of learning that come with working at Pythian. He has become known and trusted by customers and colleagues for his ability to understand complex problems, and to work well under pressure. He prides himself on being a tenacious problem solver, while remaining a calm and positive presence on any team. When Carlos isn’t working he can be found playing water polo or enjoying the his local community. Carlos holds a Bachelor of Electro-technical Engineering, and a Master of Control Systems and Automation.

No comments

Leave a Reply

Your email address will not be published. Required fields are marked *