When you have multiple database servers working together as nodes in a cluster, it’s beneficial to understand how data consistency is established. In this post, we’re going to take a look at the various cluster consistency levels that you can choose from within InnoDB Group Replication and see how they impact data flow and failure recovery.
Let’s start with a quick refresher on what data consistency is. In a nutshell, consistency is just a way of thinking about how data is duplicated across multiple nodes in a cluster. If you write a row of data to one node, that data is not considered to be consistent within the cluster until it has been written to all of the other participating nodes. Without consistency, it’s possible to write data to one node in the cluster but not see it immediately when reading from another node in the cluster. In some cases, the aforementioned scenario, better known as eventual consistency, is acceptable but sometimes you need to ensure the data you wrote is available on all your nodes before any subsequent reads occur. For that, we need to look at the system server variables group_replication_consistency, which allows you to control consistency levels globally or per transaction.
Let’s have a look at each option and see what they mean for consistency and your workload.
The value of ‘EVENTUAL’ implies that you are okay with having eventual consistency in your cluster. This means that as data gets written on one node, it’s possible that immediate reads on other nodes may not see that data.
This is the default value for group_replication_consistency and this is going to be the cluster consistency level that most of us are familiar with if you’ve been working with Galera or earlier versions of InnoDB Group Replication. When you get an ‘OK’ return status on your transaction, you are only getting confirmation that the write you’re performing doesn’t conflict with anything pending on the other participating nodes and that the other participating nodes have the change logged so that it can eventually be replayed. This is part of why we consider Galera to be ‘virtually synchronous’ replication.
The advantage of using this cluster consistency level is speed. You lose overhead of consistency verification as part of your transaction which allows you to commit data changes much faster.
But what if we want to introduce a bit more consistency?
We may want to consider using the ‘BEFORE’ option for group_replication_consistency. When set to this value, any transaction or request will wait for the node to complete any pending transactions in its queue before allowing the request to occur. This ensures that that request is looking at the most up-to-date version of the data, which in turn assures consistency for the transaction or request.
The advantage here is obviously the fact that we get greater consistency, but the disadvantage is greater overhead and potential delays. Depending on the number of data changes pending in the node’s queue, you could notice delays in getting the results for your request.
Keep in mind there is no data checking on any of the other nodes in order to ensure consistency like you might see in a similar database shared-nothing clustering products like Cassandra. It simply checks the log and assumes that if the queue is up to date with the time the request was submitted, that data is consistent. However, the fact that it has to wait for transactions to process prior to the execution of the request will still add overhead.
If you have a write-heavy workload and want to ensure that your reads are consistent, you can consider working with this option.
Another option for consideration is ‘AFTER’. When set to this value, any transaction will wait to complete until the associated changes are applied to all the other participating nodes in the cluster. This takes things one step further than the basic conflict checking that occurs with eventual consistency.
The advantage, once again, is that you have greater assured cluster consistency, but it comes at the cost of performance as you’ll have to wait for all participating nodes to write the request before getting an ‘OK’. If you have a node with a large queue, this can cause further delays.
Another limitation to consider is that if you use this option, there will be an impact on other concurrently running transactions as they have to be committed in the order in which they were received. For example, if you have one transaction executing with the ‘AFTER’ option enabled, any other transaction that commits after will have to wait, even if the other transactions are using the ‘EVENTUAL’ consistency option.
If you have a read-heavy workload and want to ensure that your writes are consistent, you can consider working with this option.
Before and After
The option ‘before_and_after’ is simply a combination of the logic found in options ‘BEFORE’ and ‘AFTER’. Before any request occurs, it will make sure the node is up to date with transactions waiting in its queue. Once the transaction is complete it will verify that the write is done on all nodes prior to giving an ‘OK’.
This will offer the highest level of consistency and may sometimes be applicable when you are not operating in single-primary mode, but it obviously comes with the highest overhead cost.
If you are using multi-primary, have a read-heavy workload, and want to ensure that your writes with read dependencies are consistent, you can consider working with this option.
Before on Primary Failover
If you’ve checked the documentation for this option, you will notice that there is a fifth option, called ‘BEFORE_ON_PRIMARY_FAILOVER’. This does have an impact on cluster level consistency but mainly for when a failover occurs when running in single-primary mode.
Single primary mode in InnoDB Group Replication means that you have one node in your cluster designated as the primary. All write requests should be directed to this node, whereas read requests can be evenly distributed over the remaining nodes in the cluster. Should the single primary node fail, a promotion will need to occur and the value of ‘BEFORE_ON_PRIMARY_FAILOVER’ denotes how soon traffic should be directed to the promoted node.
If you use the option ‘EVENTUAL’, the newly promoted node will start taking write traffic immediately after the promotion is complete. If you use ‘BEFORE_ON_PRIMARY_FAILOVER’, then it will wait for all pending transactions in its queue to complete before accepting any new read or write traffic.
The big trade-off here is availability vs consistency. You’ll have to determine what is best for your use case.
Recommended Variable Configuration
Given that the value of this variable impacts cluster level recovery and data flow, I would be inclined to set the global value of this variable to either ‘EVENTUAL’ or ‘BEFORE_ON_PRIMARY_FAILOVER’ so you have a default configuration on how write requests are handled during a promotion. Beyond that, I would consider setting the variable to other values like ‘BEFORE’, ‘AFTER’, or ‘BEFORE_AND_AFTER’ at the session level in accordance with the consistency requirements of my specific transaction.
InnoDB Group Replication takes us a step forward when it comes to providing options for cluster level data consistency in MySQL-based virtually synchronous clustering solutions, but we have to be aware of the cost of overhead that comes with them. Ensure that you run adequate performance tests before increasing your cluster consistency level!
Thank you to Nuno Carvalho for clarifying points regarding BEFORE_ON_PRIMARY_FAILOVER. You can see his comment below for full details.