Cluster Level Consistency in InnoDB Group Replication

Posted in: MySQL, Open Source, Technical Track

When you have multiple database servers working together as nodes in a cluster, it’s beneficial to understand how data consistency is established. In this post, we’re going to take a look at the various cluster consistency levels that you can choose from within InnoDB Group Replication and see how they impact data flow and failure recovery.

Let’s start with a quick refresher on what data consistency is. In a nutshell, consistency is just a way of thinking about how data is duplicated across multiple nodes in a cluster. If you write a row of data to one node, that data is not considered to be consistent within the cluster until it has been written to all of the other participating nodes. Without consistency, it’s possible to write data to one node in the cluster but not see it immediately when reading from another node in the cluster. In some cases, the aforementioned scenario, better known as eventual consistency, is acceptable but sometimes you need to ensure the data you wrote is available on all your nodes before any subsequent reads occur. For that, we need to look at the system server variables group_replication_consistency, which allows you to control consistency levels globally or per transaction.

Let’s have a look at each option and see what they mean for consistency and your workload.

Eventual

The value of ‘EVENTUAL’ implies that you are okay with having eventual consistency in your cluster. This means that as data gets written on one node, it’s possible that immediate reads on other nodes may not see that data.

This is the default value for group_replication_consistency and this is going to be the cluster consistency level that most of us are familiar with if you’ve been working with Galera or earlier versions of InnoDB Group Replication. When you get an ‘OK’ return status on your transaction, you are only getting confirmation that the write you’re performing doesn’t conflict with anything pending on the other participating nodes and that the other participating nodes have the change logged so that it can eventually be replayed. This is part of why we consider Galera to be ‘virtually synchronous’ replication.

The advantage of using this cluster consistency level is speed. You lose overhead of consistency verification as part of your transaction which allows you to commit data changes much faster.

But what if we want to introduce a bit more consistency?

Before

We may want to consider using the ‘BEFORE’ option for group_replication_consistency. When set to this value, any transaction or request will wait for the node to complete any pending transactions in its queue before allowing the request to occur. This ensures that that request is looking at the most up-to-date version of the data, which in turn assures consistency for the transaction or request.

The advantage here is obviously the fact that we get greater consistency, but the disadvantage is greater overhead and potential delays. Depending on the number of data changes pending in the node’s queue, you could notice delays in getting the results for your request.

Keep in mind there is no data checking on any of the other nodes in order to ensure consistency like you might see in a similar database shared-nothing clustering products like Cassandra. It simply checks the log and assumes that if the queue is up to date with the time the request was submitted, that data is consistent. However, the fact that it has to wait for transactions to process prior to the execution of the request will still add overhead.

If you have a write-heavy workload and want to ensure that your reads are consistent, you can consider working with this option.

After

Another option for consideration is ‘AFTER’. When set to this value, any transaction will wait to complete until the associated changes are applied to all the other participating nodes in the cluster. This takes things one step further than the basic conflict checking that occurs with eventual consistency.

The advantage, once again, is that you have greater assured cluster consistency, but it comes at the cost of performance as you’ll have to wait for all participating nodes to write the request before getting an ‘OK’. If you have a node with a large queue, this can cause further delays.

Another limitation to consider is that if you use this option, there will be an impact on other concurrently running transactions as they have to be committed in the order in which they were received. For example, if you have one transaction executing with the ‘AFTER’ option enabled, any other transaction that commits after will have to wait, even if the other transactions are using the ‘EVENTUAL’ consistency option.

If you have a read-heavy workload and want to ensure that your writes are consistent, you can consider working with this option.

Before and After

The option ‘before_and_after’ is simply a combination of the logic found in options ‘BEFORE’ and ‘AFTER’. Before any request occurs, it will make sure the node is up to date with transactions waiting in its queue. Once the transaction is complete it will verify that the write is done on all nodes prior to giving an ‘OK’.

This will offer the highest level of consistency and may sometimes be applicable when you are not operating in single-primary mode, but it obviously comes with the highest overhead cost.

If you are using multi-primary, have a read-heavy workload, and want to ensure that your writes with read dependencies are consistent, you can consider working with this option.

Before on Primary Failover

If you’ve checked the documentation for this option, you will notice that there is a fifth option, called ‘BEFORE_ON_PRIMARY_FAILOVER’. This does have an impact on cluster level consistency but mainly for when a failover occurs when running in single-primary mode.

Single primary mode in InnoDB Group Replication means that you have one node in your cluster designated as the primary. All write requests should be directed to this node, whereas read requests can be evenly distributed over the remaining nodes in the cluster. Should the single primary node fail, a promotion will need to occur and the value of ‘BEFORE_ON_PRIMARY_FAILOVER’ denotes how soon traffic should be directed to the promoted node.

If you use the option ‘EVENTUAL’, the newly promoted node will start taking write traffic immediately after the promotion is complete. If you use ‘BEFORE_ON_PRIMARY_FAILOVER’, then it will wait for all pending transactions in its queue to complete before accepting any new read or write traffic.

The big trade-off here is availability vs consistency. You’ll have to determine what is best for your use case.

Recommended Variable Configuration

Given that the value of this variable impacts cluster level recovery and data flow, I would be inclined to set the global value of this variable to either ‘EVENTUAL’ or ‘BEFORE_ON_PRIMARY_FAILOVER’ so you have a default configuration on how write requests are handled during a promotion. Beyond that, I would consider setting the variable to other values like ‘BEFORE’, ‘AFTER’, or ‘BEFORE_AND_AFTER’  at the session level in accordance with the consistency requirements of my specific transaction.

Conclusion

InnoDB Group Replication takes us a step forward when it comes to providing options for cluster level data consistency in MySQL-based virtually synchronous clustering solutions, but we have to be aware of the cost of overhead that comes with them. Ensure that you run adequate performance tests before increasing your cluster consistency level!

Thank you to Nuno Carvalho for clarifying points regarding BEFORE_ON_PRIMARY_FAILOVER. You can see his comment below for full details.

email

Interested in working with Peter? Schedule a tech call.

About the Author

Internal Principal Consultant
Peter Sylvester is one of the Internal Principal Consultants in the Open Source Database Consulting Group at Pythian. He has been with Pythian since January of 2015 and has been working with MySQL since 2008. Apart from work, Peter is an avid ice hockey player to stay in keeping with the stereotypical Canadian lifestyle, playing typically no less than twice a week!

2 Comments. Leave new

Hi Peter,

Thank you for writing this blog post about Group Replication consistency option, though you have two incorrect sentences on it. Please allow me to correct them.

Apart from implementing most of this feature, I wrote a 4 blog posts series to explain Group Replication consistency:
https://mysqlhighavailability.com/group-replication-consistency-levels/
https://mysqlhighavailability.com/group-replication-preventing-stale-reads-on-primary-fail-over/
https://mysqlhighavailability.com/group-replication-consistent-reads/
https://mysqlhighavailability.com/group-replication-consistent-reads-deep-dive/
We also have a documentation dedicated to this
https://dev.mysql.com/doc/refman/8.0/en/group-replication-consistency-guarantees.html

“””
Keep in mind there is no data checking on any of the other nodes in order to ensure consistency like you might see in a similar database shared-nothing clustering products like Cassandra. It simply checks the log and assumes that if the queue is up to date with the time the request was submitted, that data is consistent.
“””

When BEFORE is used, the member on which it is use will get the global order and wait for it locally, that is, it does guarantee that this transaction will see the the most up-to-date snapshot of the data.
This is explained detailed at https://mysqlhighavailability.com/group-replication-consistent-reads-deep-dive/

“””
If you use ‘BEFORE_ON_PRIMARY_FAILOVER’, then it will wait for all pending transactions in its queue to complete before accepting any new write traffic.
“””
BEFORE_ON_PRIMARY_FAILOVER applies to all transactions, read or write, that is, the new primary blocks both reads and writes until after the backlog is fully applied. This ensures:
* No stale reads for read-only and read-write transactions. This prevents stale reads from being externalized to the application by the new primary.
* No spurious rolling backs for read-write transactions, due to write-write conflicts with replicated read-write transactions still in the backlog to be applied.
* No read skew on read-write transactions.
Detailed explanation at https://mysqlhighavailability.com/group-replication-preventing-stale-reads-on-primary-fail-over/

Best regards,
Nuno Carvalho

Reply
Peter Sylvester
December 30, 2019 9:07 am

Nuno,

Thank you very much for your reply.

In regard to the comment “…Keep in mind there is no data checking on any of the other nodes in order to ensure consistency…”. Please note that the comment here was to state that the node where the data is being accessed will not actually check the values of data on other nodes. There is still consistency assurance based on row based binary logging and log processing prior to execution, which is less overhead. Ultimately I think we agree on this point. I apologize if that wasn’t clear.

In regard to the comment “…If you use ‘BEFORE_ON_PRIMARY_FAILOVER’, then it will wait for all pending transactions in its queue to complete before accepting any new write traffic…”. I appreciate the clarification and I will update the text accordingly. Thank you very much for pointing that out.

Reply

Leave a Reply

Your email address will not be published. Required fields are marked *