Keeping Cassandra alive

Posted in: Cassandra, Technical Track

Troubleshooting Cassandra Under Pressure

This is the second blog post in the series. This is a bit more technical than the first one. I will explain some things that can be made to keep a cluster/server running when you are having problems in that cluster.

There were a lot of changes in Cassandra over the last 4 years (from 0.8 to 2.1), so I will refrain from discussing troubleshooting problems that affect some specific versions. Also, this is the kind of troubleshooting you need when you can’t “add a node”.

Why can’t I just add a node? Well, if you aren’t on vnodes, and you didn’t pre-calculate the token ranges, adding a node is a big effort. Other constrains may also apply, like budget or deliver time for hardware (if you are on bare metal). Plus, rack capacity, power constrains, etc…

Now you may say:

“Ok, we can’t add a node! What should we do?! We have a storm coming!”

So, I did navigate over that storm and it’s not an easy task, but it’s doable! First thing, you have to know what you have, that is critical! You also need to know where you can take more damage.

Let’s assume you have the following situation, and what I recommend for it:

  • Heavy Write Cluster, Low Read

Now let’s define “storm”: A storm is not when when Cassandra fails, it’s about an unanticipated load increase or a disaster. What happens is that you have more load than your planned capacity (Either because of failure of nodes or because of a sudden traffic increase). This will increase your resource usage to a point where your machines will start to die.

Let’s understand what can cause a Cassandra process to die, and a probably the machine (If you OOM and you didn’t configure swap… I warned you!) for the scenario described above.

  1. More data to the commitlog = more I/O pressure (Discard if you have commitlog on a different HDD)
  2. Data is written to memtables = Memory is used
  3. Memtables reach thresholds faster, get flushed to disk = I/O pressure
  4. Compaction starts faster and frequently = I/O pressure, CPU pressure
  5. Too many I/O compaction can’t compact fast enough and the memtables aren’t flushing fast enough = Memory not being released.
  6. Too much memory usage, JVM triggers GC more frequently = CPU pressure
  7. JVM can’t release memory = OOM
  8. OOM = PUM! Node dies (if you are “lucky” kernel will kill Cassandra)

And I didn’t go trough the hints that would be stored as nodes became unresponsive and send out once they get back online.

So now we know where our pain points are. Let’s understand them and see what we can do about it:

  • Commitlog – Let’s just assume you have this on separate HDD, and don’t do anything about it (after all it’s your safeguard).
  • Memtables – We can control how often they are flushed. It is a possible tweak point. Although it requires a Cassandra restart for the changes to produce an effect.
  • Compaction – This we can control via nodetool, inclusive we can disable it in the later versions.
  • JVM GC – We can change settings, but difficult to tweak and a restart is needed.
  • Swap – We can play a bit here if we do have a swap partition.
  • Dirty_ratio – How often the data is actually written to the HDD. This can put your data at risk, but also can help.
  • Replication Factor – this can be changed on the fly, will help by having less pressure on the nodes.

So, what do to? Where to start? It depends on a case by case scenario. I would probably make my Read performance suffer to keep the writes getting in. To allow that, the easiest way should be making the reads CL = ONE. That sometimes does look like the fast and easy option. But if you’re writes are not using Quorum or/and you have read_repair… You will spread more writes (And RF>1). I would pick compaction as my first target, you can always try to get it up to pace (re-enable, increase compaction throughput). Another option would be increase dirty_ratio and risk losing data (trusting the commitlogs + RF>1 helps not losing data) but this will give your HDD more room until the cluster recovers.

But every case is a case. I will talk about my own case, problems and resolutions this Wednesday at the Datastax Day in London! Fell free to join me!

Learn more about our Cassandra services.

email
Want to talk with an expert? Schedule a call with our team to get the conversation started.

About the Author

Carlos Rolo is a Datastax Certified Cassandra Architect, and has deep expertise with distributed architecture technologies. Carlos is driven by challenge, and enjoys the opportunities to discover new things and new ways of learning that come with working at Pythian. He has become known and trusted by customers and colleagues for his ability to understand complex problems, and to work well under pressure. He prides himself on being a tenacious problem solver, while remaining a calm and positive presence on any team. When Carlos isn’t working he can be found playing water polo or enjoying the his local community. Carlos holds a Bachelor of Electro-technical Engineering, and a Master of Control Systems and Automation.

3 Comments. Leave new

Nice article. I like the way you analyzed how cassandra node can die

Reply

Hi Carlos,

thanks for the post and explanation – I see it’s still current.
Can you please link the first article somewhere in the post?

Thanks!

Reply

I will look for it and do it so!

Reply

Leave a Reply

Your email address will not be published. Required fields are marked *