High availability for MySQL has become increasingly relevant given the ever increasing rate of adoption and implementation. It’s no secret to anyone in the community that the popularity of MySQL has become noteworthy. I still remember my start with MySQL in the early 5.0 days and people told me that I may not want to consider wasting my time training on a database that didn’t have a large industry adoption, but look at where we are now! One of my favorite pages to cite when trying to exhibit this fact is the db-engines.com ranking trend page where we can see that MySQL is right up there and contending with enterprise products such as Microsoft SQL Server and Oracle.
MySQL has gone from being part of the ever famous LAMP stack for users looking to set up their first website to seeing adoption from major technical players such as Twitter, Wikipedia, and Facebook. With these type of implementations we’re seeing MySQL used in places where failure can have massive associated cost, as such high availability for MySQL has become critical, and that’s what I’m here to talk to you about today.
Up until a few years ago we were looking at situations where there may have been at least some reliance on storage or general resource high availability tools that worked for the most part, but weren’t exactly designed with MySQL in mind. I’ve heard many horror stories of the infamous DRBD split brain which was common in the typical DRBD heartbeat / corosync solution. Even today we’re still seeing clients that are leveraging generalized HA solutions like RHEL Cluster where the people driving the solution may actually not even be your DBA, but your Linux sys admin.
These days the power is coming back to the MySQL DBA in a big way and I think this is clearly evident by looking at tools like MHA, Orchestrator, and most importantly SQL interpreting proxies like ProxySQL. These tools have been designed by those that exist within the MySQL community to serve MySQL specifically.
This started off in 2011 with the initial public release of Yoshinori Matsunobu’s MHA, which became wildly popular as a way to programmatically approach monitoring and failover management within standard mysql replication topologies. Despite never having reached a 1.0 release (still at 0.57 at the time of this writing), we still see wide adoption of this solution. Given that it included features like VIP management and admin initiated failover and promotion, it greatly eased the process of creating a viable high availability solution for MySQL administrators.
That’s not to say that it didn’t have its faults. There is really no great way to set up a fully atomic VIP failover script, the command set for using it was pretty far from intuitive but has improved, and active development for the solution declined sharply at the end of 2012 with the last release being in May of 2015. But despite these shortcomings you won’t find many in the community that would be hesitant to praise this product as a great contribution to solving the MySQL HA challenge.
The next big solution to emerge was Shlomi Noach’s Orchestrator, a fairly recent addition to the menu of available options for your replication management needs. In order to make things a little more intuitive it included a graphical web interface. The installation and configuration are both fairly simple and it even comes with various ways to make Orchestrator highly available, so you don’t have to worry about single point of failure when it comes to this layer in your HA stack.
Orchestrator does have a requirements for GTID, pseudo GTID, or binlog servers for topology recovery. We know that GTID isn’t really new to MySQL anymore, it was released with MySQL 5.6 which means it’s been around since 2013 but not everyone is willing or ready to make the leap from standard binary log and position replication to GTID. For those users, pseudo GTID is an available option that can be used to help bridge the gap and allow you to work with Orchestrator.
Furthermore, if you want to do a simple master failover you’ll find this is very simple with a tool like MHA where you can specify what server should be the new master and if the old master should become a slave. However with Orchestrator you need to set all of your slaves to replicate in a multi tiered fashion under the slave you want to promote, then you have to initiate the failover to the new master, and at that point the old master will become isolated from the rest of the group, forcing you to manually rejoin it to the topology.
Even with its shortcomings we are starting to see organizations adopt this as part of their HA solution, but typically only as part. This is an important distinction. While you can have a VIP failover process similar to what is done in MHA using hooks, we have yet to use this as all implementation projects have elected to have Orchestrator simply manipulate the read_only variable and let a MySQL monitoring proxy like ProxySQL, Maxscale, or ScaleArc handle traffic redirection.
The most commonly discussed proxy of the three listed above has been ProxySQL, created by René Cannaò. There are a number use cases for a ProxySQL such as query read write splitting via the use of query rules, but in this case we want to remain focused on the high availability features.
Using a fairly simple installation and configuration process you can group MySQL servers together in replication hostgroups. Database servers can then be monitored by checking the read_only global variable to determine whether or not they are a server that can be written to, and traffic can be directed to the appropriate server leveraging replication hostgroups without modification of the application. Additionally, if you are concerned about network partitioning level failures you can implement a solution such as Consul for service discovery, or you can rely on STONITH logic that would be initiated by the replication manager, such as Orchestrator, to ensure proper node fencing. You can read more about this here.
It should also be noted that ProxySQL is attempting to implement a cluster solution called ProxySQL Cluster that, much like Orchestrator, will attempt to remove single point of failure from its layer in the stack. This is already available but is considered to be experimental, is not GA, and thus should not be implemented in production environments at this time.
Virtually Synchronous Replication
Another technology to note is virtually synchronous replication methods such as Galera, as incorporated as part of Percona XtraDB Cluster or MariaDB Galera Cluster, or with MySQL Group Replication. Each of these solutions are very similar in the sense that they ensure that a transaction is repeatable on a majority of other nodes in the cluster prior the commit of a transaction, then pass along the full transaction details to all nodes in the cluster for replay. This leaves the cluster in a “mostly fully synchronous state”. The big drawback here is that there can be network related delays at transaction commit, or even full stops placed on incoming transaction if flow control detects that the other nodes in the cluster aren’t keeping up.
This has been a pretty commonly implemented solution for organizations with moderate queries per second and where there is a strong need for replication level consistency and high availability. While it’s still fairly common to see Galera base products grouped with HAproxy, there are options to implement SQL interpreting proxies like ProxySQL with Galera, but it’s a little more complicated given that Galera node that’s “up” can be in several states where traffic may or may not be permitted and the proxy needs to be aware of this.
Public Cloud DBaaS
All of the aforementioned technologies are important for consideration when working with on-prem or Infrastructure as a Service (IaaS) solutions, but doesn’t really cover what’s going on with with HA in Database as a Service (DBaaS) solutions from popular public cloud providers.
When using Cloud SQL as part of the Google Cloud Platform, you’re covered so long as you’ve provisioned a failover node, but there are a few things to take into consideration. Firstly, we need keep in mind that failover nodes in Cloud SQL are reliant on MySQL semi synchronous replication which requires the semi sync node in question to confirm that the replication events have been received prior to transaction commit, which can slow down processing of transactions on the master. Secondly, if a failover is initiated on a failover node where replication lag is excessive, the failover will not compete and traffic will not be served until replication has completely caught up, so it’s highly advised to monitor replication lag on your failover node in Cloud SQL.
The advantage of the Cloud SQL solution is that you can connect to your failover instance and read from it, unlike Amazon RDS where the instance isn’t available to you until the event of a failover.
Relational Database Services (RDS), as offered by Amazon Web Services (AWS), has a similar solution in place where you can setup high availability for your RDS instance by specifying is as a multi availability zone (Multi-AZ) instance. You can specify this when you create your RDS instance, or after the fact by modifying it. With this enabled, Amazon will use its internal failover technology to keep a standby replica available in a different availability zone in the case that a failover is required. The drawback here is that you can’t connect to your standby replica, however replication lag doesn’t have a factor in how long it will take to fail over to the standby.
Aurora, as also offered by AWS, automatically partitions your data into 10G segments replicated six ways over three availability zone, making it very fault tolerant. By default it is also Multi-AZ and self healing as data blocks are scanned and any errors that are found are automatically repaired. It should also be noted that failovers will occur regardless of whether or not you have provisioned any Aurora replicas. There are a lot of advantages of high availability in Aurora that you can read about in the FAQ, but I think one of the biggest things to note is the recent announcement of Aurora Multi Master and Aurora Serverless, which very well may change not just how we think about high availability in MySQL, but how we approach implementation and scaling all together.
There are a lot of great solutions out there to help you keep your MySQL environment up and running in the event of a failure. If you haven’t implemented GTID and aren’t quite willing to use it or pseudo GTID just yet, consider MHA. If you have a system running GTID, you can consider Orchestrator, but you should certainly be familiar with the gotchas and thoroughly proof of concept the solution in QA before implementing in production.
Virtually synchronous products that leverage Galera or MySQL Group Replication can be used to keep your data set mostly synchronous across all the servers in your topology so long as you can accept the requirements of network activity at transaction commit.
If SQL interpreting proxies like ProxySQL aren’t already on your radar, now is a good of a time as any to get familiar and see how it can assist you, not just will high availability, but for scale out.
Finally, public cloud providers are making it easier than ever to help remove the anxiety and cost of management aspects of making your MySQL solution highly available. Make sure you’re familiar with their offerings and that HA is taken into consideration when provisioning your DBaaS instances.
Learn more about Pythian’s end-to-end MySQL support, from consulting and implementation to 24×7 management.