I have a database server that has these features:
- High available by design.
- Can be globally distributed.
- Allows applications to write to any node anywhere, anytime.
- Linearly scalable by simply adding more nodes to the cluster.
- Automatic workload and data balancing.
- A query language that looks a lot like SQL.
With the list of features above, why don’t we all use Cassandra for all our database needs? This is the hype I hear at conferences and from some commercial entities pushing their version of Cassandra. Unfortunately, some people believe it. Especially now when many users of proprietary database technologies like Oracle and SQL Server are looking to get out of massive license fees. The (apparent) low cost of open-source in combination with the list of features above, make Cassandra very attractive to many corporate CTOs and CFOs. What they are missing is the core features they assume a database has, but are missing from Cassandra.
I am a database architect and consultant. I have been working with Cassandra since version 0.7. came out in 2010.
I like, and often promote Cassandra to my customers—for the right use cases.
Unfortunately, I often find myself being asked to help after the choice was already made and it turned out to be a poor use case for Cassandra, or they made some poor choices in their data modeling for Cassandra.
In this blog post I am going to discuss some of the pitfalls to avoid, suggest a few good use cases for Cassandra and offer just a bit of data modeling advice.
Where Cassandra users go wrong
Cassandra projects tend to fail as a result of one or more of these reasons:
- The wrong Cassandra features were used.
- The use case was totally wrong for Cassandra.
- The data modeling was not done properly.
To be honest, it doesn’t help that Cassandra has a bunch of features that probably shouldn’t be there. Features leading one to believe you can do some of the things everyone expects a relational database to do:
- Secondary indexes: They have their uses but not as an alternative access path into a table.
- Counters: They work most of the time, but they are very expensive and should not be used very often.
- Light weight transactions: They are not transactions nor are they light weight.
- Batches: Sending a bunch of operations to the server at one time is usually good, saves network time, right? Well in the case of Cassandra not so much.
- Materialized views: I got taken in on this one. It looked like it made so much sense. Because of course it does. But then you look at how it has to work, and you go…Oh no!
- CQL: Looks like SQL which confuses people into thinking it is SQL.
Using any of the above features the way you would expect them to work in a traditional database is certain to result in serious performance problems and in some cases a broken database.
Get your data model right
Another major mistake developers make in building a Cassandra database is making a poor choice for partition keys.
Cassandra is distributed. This means you need to have a way to distribute the data across multiple nodes. Cassandra does this by hashing a part of every table’s primary key called the partition key and assigning the hashed values (called tokens) to specific nodes in the cluster. It is important to consider the following rules when choosing you partition keys:
- There should be enough partition key values to spread the data for each table evenly across all the nodes in the cluster.
- Keep data you want to retrieve in single read within a single partition
- Don’t let partitions get too big. Cassandra can handle large partitions >100 Megabytes but its not very efficient. Besides, if you are getting partitions that large, it’s unlikely your data distribution will be even.
- Ideally all partitions would be roughly the same size. It almost never happens.
Typical real-world partition keys are user id, device id, account number etc. To manage partition size, often a time modifier like year and month or year are added to the partition key.
If you get this wrong, you will suffer greatly. I should probably point out that this is true in one way or another of all distributed databases. The key word here is distributed.
Wrong Use Cases for Cassandra
If you have a database where you depend on any of the following things– Cassandra is wrong for your use case. Please don’t even consider Cassandra. You will be unhappy.
- Tables have multiple access paths. Example: lots of secondary indexes.
- The application depends on identifying rows with sequential values. MySQL autoincrement or Oracle sequences.
- Cassandra does not do ACID. LSD, Sulphuric or any other kind. If you think you need it go elsewhere. Many times people think they do need it when they don’t.
- Aggregates: Cassandra does not support aggregates, if you need to do a lot of them, think another database.
- Joins: You many be able to data model yourself out of this one, but take care.
- Locks: Honestly, Cassandra does not support locking. There is a good reason for this. Don’t try to implement them yourself. I have seen the end result of people trying to do locks using Cassandra and the results were not pretty.
- Updates: Cassandra is very good at writes, okay with reads. Updates and deletes are implemented as special cases of writes and that has consequences that are not immediately obvious.
- Transactions: CQL has no begin/commit transaction syntax. If you think you need it then Cassandra is a poor choice for you. Don’t try to simulate it. The results won’t be pretty.
If you are thinking about using Cassandra with any of the above requirements, you likely don’t have an appropriate use case. Please think about using another database technology that might better meet your needs.
When you should think about using Cassandra
Every database server ever designed was built to meet specific design criteria. Those design criteria define the use cases where the database will fit well and the use cases where it will not.
Cassandra’s design criteria are the following:
- Distributed: Runs on more than one server node.
- Scale linearly: By adding nodes, not more hardware on existing nodes.
- Work globally: A cluster may be geographically distributed.
- Favor writes over reads: Writes are an order of magnitude faster than reads.
- Democratic peer to peer architecture: No master/slave.
- Favor partition tolerance and availability over consistency: Eventually consistent (see the CAP theorem: https://en.wikipedia.org/wiki/CAP_theorem.)
- Support fast targeted reads by primary key: Focus on primary key reads alternative paths are very sub-optimal.
- Support data with a defined lifetime: All data in a Cassandra database has a defined lifetime no need to delete it after the lifetime expires the data goes away.
There is nothing in the list about ACID, support for relational operations or aggregates. At this point you might well say, “what is it going to be good for?” ACID, relational and aggregates are critical to the use of all databases. No ACID means no Atomic and without Atomic operations, how do you make sure anything ever happens correctly–meaning consistently. The answer is you don’t. If you were thinking of using Cassandra to keep track of account balances at a bank, you probably should look at alternatives.
Ideal Cassandra Use Cases
It turns out that Cassandra is really very good for some applications.
The ideal Cassandra application has the following characteristics:
- Writes exceed reads by a large margin.
- Data is rarely updated and when updates are made they are idempotent.
- Read Access is by a known primary key.
- Data can be partitioned via a key that allows the database to be spread evenly across multiple nodes.
- There is no need for joins or aggregates.
Some of my favorite examples of good use cases for Cassandra are:
- Transaction logging: Purchases, test scores, movies watched and movie latest location.
- Storing time series data (as long as you do your own aggregates).
- Tracking pretty much anything including order status, packages etc.
- Storing health tracker data.
- Weather service history.
- Internet of things status and event history.
- Telematics: IOT for cars and trucks.
- Email envelopes—not the contents.
Frequently, executives and developers look at the feature set of a technology without understanding the underlying design criteria and the methods used to implement those features. When dealing with distributed databases, it’s also very important to recognize how the data and workload will be distributed. Without understanding the design criteria, implementation and distribution plan, any attempt to use a distributed database like Cassandra is going to fail. Usually in a spectacular fashion.
Whether you’re considering an open source or commercial Cassandra deployment, planning to implement it, or already have it in production, Pythian’s certified experts can work with your team to ensure the success of your project at every phase. Learn more about Pythian Services for Cassandra.
Thanks for the valuable information.
Great post John. I was left wondering what a “Light weight transaction” actually is, and also, what a good use case for secondary indexes were.
Care to explain?
A lightweight transaction is a construct which allows an atomic operation to take place on a partition. It uses the PAXOS ( https://en.wikipedia.org/wiki/Paxos_(computer_science ) algorithm to manage consensus between a group of Cassandra nodes. The syntax is to use a CQL update, or insert operation with an If clause as well. If the operation succeeded then the transaction was successful. If it was not then it was not. https://docs.datastax.com/en/cql/3.3/cql/cql_using/useInsertLWT.html
Secondary indexes can be very useful in improving performance when querying a large partition (one with a significant number of rows in it) on non-primary key columns. Secondary indexes should not be used to provide an alternate access path into a table. Used as an alternate access path, they limit the scalablity of the cluster.
Thank you, John.
Your post is so good and it’s soooo short.
Why don’t you think writing a book ?
I understand it’s an enormous effort and not much money at all
but for the sake of humanity :)
Thank you again
I’ve a question regarding consistency. What happens when two threads are adding a new record to a ‘clustering column ordered table’ at the same time? especially when the little bit faster thread writes a record that contains a younger timestamp than the slower one?
Does it result in a wrong ordered table or is there a kind of a wait or a correcting mechanism?
Thank you very much.
Sorry for the delay in responding.
I’m not quite sure what you are asking here. It looks like you are asking what order rows will show up on retrieval when two rows with a timestamp cluster key are written to a table. If that is the case then the answer is the rows being returned will be in timestamp order. That is also how they will be stored since Cassandra doesn’t write data right away it can re-order requests like this in the correct order before the data gets written to the table.
On the other hand, if you are asking which of two rows with identical keys are going to end up in the table?
Then Timestamp is King.
It doesn’t matter which thread writes first the one with the larger (later) timestamp wins. Most of the time the timestamp used is supplied by the Cassandra coordinator node receiving the request but the application can supply its own.
In “Counters: They work most of the time, but they are very expensive and should not be used very often.”
What do you mean by “very often”? And how expensive?
The counter algorithm in Cassandra is based on the PAXOS https://en.wikipedia.org/wiki/Paxos_(computer_science).
Cassandra is designed from the top down to avoid doing updates. There are no atomic or lock operations in Cassandra. If you were to try to do a simple update in Cassandra without counters or lightweight transactions you would read a column value and then write back the updated version. Because there are no locks or atomic operations there would be no guarantee it would work correctly. But it would be fairly fast.
With PAXOS which is used in both Lightweight transactions (which are not transactions at all) Cassandra has to do several round trips to all available nodes containing replicates to complete the operation. Its best to think of an LWT or Counter operation being about 6 times as expensive as the basic update I mentioned above.
It is okay to use counters or LWT operations when they are a relatively small part of your workload. It is also important to read about how both work in detail before using them to avoid getting surprised.
This link does a decent job of explaining how lightweight transactions work in Cassandra. Counters are essentially a special case of LWT. https://www.beyondthelines.net/databases/cassandra-lightweight-transactions/
Such a great, concise writeup of everything I’d want people to know when considering using Cassandra. Thanks for writing it!
Thanks, Anthony! Please do share with your peers!
Say you want to store a lot of events, but you also want to filter by special type, and date, and the search by multiple columns, and also order by multiple columns.. would you choose Cassandra?
Excellent write up, thank you!
I think the main Cassandra problem is when we go to the Cloud and we have to deal with its configuration. If you need to find a proper documentation what I think that there is a lack documentation on their Official Site, when you need go to a production environment and then you have to spend a lot of time trying to find a proper deployment on Cloud.
Hey John – can you elaborate in a little more detail what aggregations on Cassandra data look like? For instance, you mention a fitting use-case for Cassandra is storing time-series data (because the write rate is much higher than the read), then you caveat this by saying you should implement your own aggregations. What do you mean by this exactly?
I would like to know your opinion on using Cassandra for managing a school’s data. In my use case, there would be more reads than writes and also this is my first experience working with Non Relational DBMS. Do you think it is a good idea? Please suggest
Cassandra is optimized for writes, so we usually don’t recommend using it for workloads where over 90% of the client operations are reads. That being said, there are a few use cases where Cassandra can still perform well under read heavy workloads.
One very relevant factor to decide whether Cassandra fits, is by knowing what query patterns you intend to run against the database. In Cassandra, for each read query we design a table in the data model. This is because Cassandra doesn’t support joins or aggregates.
For example, if we want to read “students by class”, we design a table where class is the partition key, and student a clustering column. If we also want to know in what classes a student is enrolled in “classes by student”, we have another table where student is the partition key, and class is a clustering column. To keep those two tables in sync, it’s up to the application logic and not Cassandra.
Another factor to have in mind is availability vs. consistency. Cassandra is highly available due to data replication, meaning you can have a full rack of Cassandra hosts down and still satisfy all requests with no downtime. On the other hand, data consistency is eventual, meaning that your reads can have a small chance of not returning the most up-to-date results (depending on the consistency level of your queries), but eventually all replicas become consistent. There are many decision points besides the ones mentioned above, such as scaling and geographical distribution. For a more detailed answer, please contact us at https://pythian.com/contact/
Excellent article! When I ask myself if I should use a library, framework, language, tool, etc… it’s not how good is it, but does it match my use case?