MySQL Replication Failures

Posted in: MySQL, Technical Track

Over the weekend, I worked on a client’s two computers, trying to get a slave in sync with the master. It was during this time that I began thinking about:

a) how this never should have happened in the first place.

b) how “slave drift” could be kept from happening.

c) how this is probably keeping some businesses from using MySQL.

d) how MySQL DBAs must spend thousands of hours a year wasting time fixing replication issues.

I’ll be the first person to tell you that the replication under MySQL is pretty much dead-simple to set up. My only complaint is that it is annoying to type in the two-line “CHANGE MASTER” command to set up a new slave. Even so, it makes sense.

It is also very easy, however, for a slave to end up with different data than the master server has. This can be caused by replication bugs, hardware problems, or by using non-deterministic functions. Without proper permissions, a user/developer/DBA can log into the slave server and mess the data up that way. This last is a database administrator problem, but it affects replication. There are probably other issues that astute readers will point out.

I would like to point out one common issue that would probably be categorized as a replication bug. If the master crashes for whatever reason (say, a hosting company accidentally punches the power button on a master server) it will often cause corruption of the binary log. When the master comes back up, the slave cries about a non-existent binary log position.

Possible solutions:

  • Checksums on the replicated data stream. A simple CRC checksum would work, and if the checksum doesn’t match, the data should be resent, and the error logged. If the checksum continues to fail on retries, the slave should stop so the database does not become corrupted.
  • Internal checksums on the tables, and configurable periodic comparisons of the checksums on master and slave(s), with the ability to have the servers resolve issues if needed. Interestingly, Sheeri pointed out to me that these are available on MyISAM tables. I didn’t know that, but it is only part of the answer, anyway.
  • Moving to row-based replication will resolve issues of non-deterministic function. This is one area that MySQL for which has a fix in place: the soon-to-be-GA 5.1 release allows you to do row-based instead of statement-based replication.

To me, as a DBA, these replication issues are very important. I would say that secure, “fail-safe” replication ranks right under being able to do proper backups and restore (which should be number one in every DBA’s book).

I put fail-safe in quotes as I realize that true fail-safe replication isn’t really possible.  But, when a problem occurs the server should alert the DBA through error logging. Currently, the server often doesn’t even realize when there is a problem. If you are lucky, you find out only because replication gags on something. If you aren’t lucky, you don’t know there is a problem; the master server dies and the slave is promoted to master. Then you find out when someone stumbles across the errant data.

One of the reasons why MySQL is looking to move to row-based replication in server 5.1 is that this move should fix the problem of non-deterministic functions causing slave drift.  But, it will be quite some time before issues with row-based replication are resolved and this feature is extensively used in production. In addition, this fixes only one part of the problem.

How can MySQL’s replication issues be fixed? Can we depend on Sun/MySQL to resolve these problems? Ummm . . .  no.  I’m afraid that, for whatever reason, this isn’t very big on their radar. The only realistic way to resolve this problem is for someone, or some group, to develop a patch for the server that provides some or all of this functionality. (I would attempt this myself, but the MySQL server code is so far above my skill level it isn’t even funny. The only thing I can volunteer to do is coordinate things. Probably not very useful.)

I am sure there are more and possibly better solutions to these problems. I would love to hear from the readers about what they think. It would be cool for a company to step forward and sponsor work on this.

In the meantime—if you do not wish to lose all your hair or have it turn grey—when you are setting up a server for MySQL on Linux you should absolutely use LVM. If you are using Solaris, you should use ZFS as your filesystem. Although I don’t have time to go into this, you can create point-in-time partition snapshots with these, and much more easily re-create a slave if it does get out of sync. If you are using Windows or some other system, I don’t know what to tell you.

Note: I would like to thank Baron Schwartz for being the first person, to my knowledge, to spend time showing how replication can get out of sync and demonstrating how fragile it can be.  Here’s one of his blog posts on the subject. Complaints have been around about this for years, but Baron spent time and effort to demonstrate how it happens and to provide tools for potential fixes. His “mk-table-sync” utility is a good tool, although I find that it is sometimes difficult to get working. More important is that it is a “fix-it-after-the-fact” tool, not a prevention of the problem. Ever heard the saying, an ounce of prevention is worth a pound of cure?

email
Want to talk with an expert? Schedule a call with our team to get the conversation started.

About the Author

I am a MySQL DBA at The Pythian Group (https://www.pythian.com).

8 Comments. Leave new

The Continuent folks recently released their “Tungsten Stack” that is supposed to make replication more robust:

https://www.continuent.com/index.php?option=com_content&task=view&id=417&Itemid=88

Excerpt:

The initial feature set for Continuent Tungsten Replicator includes:

* Simple set-up procedure
* Proper handling of master failover in presence of multiple slaves (for high availability)
* Master/slave replication of one, some, or all databases
* MySQL statement replication
* Checksums on replication events
* Table consistency check mechanism
* Runs on Linux, Solaris and Windows.

Cheers

Reply
Gerry Narvaja
October 2, 2008 1:19 pm

The replication ‘checksum’ patch has been in the works since last UC. Not sure when it will make it out and why it hasn’t happened yet.

Reply

Gerry, I’m not sure that’s accurate. People have been talking about the need for it, yeah, and MySQL agreed, but that doesn’t mean it’s in the works.

Reply
David Holoboff
October 2, 2008 5:09 pm

I also did not want to wonder about the status of tables and their data in our replicated setup, so I made a little interface to help me along… It did the following:

1) compared the number of rows in tables on master and slave (after a flush, I selected from information_schema.TABLES). This is easily done with MyISAM tables, and thus was performed for these tables nightly. If values did not match, I then performed a COUNT(*) and this resolved more than half of them (interesting anomoly in an of itself);
2) if number of rows was off by more than one (sometimes replication would be sending over a record for inserting on the slave), then I had a utility that checked for matching values, as long as the table had an auto_increment column for the primary key (there are other solutions if it does not). This would insert the missing row(s) on the slave or remove (and log) extra rows from the slave;
3) if there was time, I would optimize the table; most often, I did not have time until a planned outage.

Note that I did not do this for InnoDB tables (although possible) and I did not have to worry about ACID compliance with this setup. This was meant for website data that did not need ACID compliance.

I also would not perform checksum unless in the most extreme of circumstances, as this took a very long time.

Reply

Hey everyone,
thanks for the input. Jason, I am aware of Continuent’s work. I haven’t had a chance to work with it so I don’t know if it really resolves any issues. And to me, this is something that should be resolved “in the database”.

Keith

Reply
Diamond Notes » MySQL Replication Failures
October 31, 2008 12:53 pm

[…] MySQL Replication Failures October 02nd, 2008 | Category: MySQL, Replication, checksums, maatkit, replication failure, row-based replication, slave drift Over the weekend, I worked on a client’s two computers, trying to get a slave in sync with the master. It was during this time that I began thinking about: a) how this never should have happened in the first place. b) how “slave drift” could be kept from happening. c) how this is probably keeping some businesses from using MySQL. d) how MySQL DBAs must spend thousands of hours a year wasting time fixing replication issues. I’ll be the first person to tell you that the replication under MySQL is pretty much dead-simple to set up. My only complaint is that it is annoying to type in the two-line “CHANGE MASTER” command to set up a new slave. Even so, it makes sense. It is also very easy, however, for a slave to end up with different data than the master server has. This can be caused by replication bugs, hardware problems, or by using non-deterministic functions. Without proper permissions, a user/developer/DBA can log into the slave server and mess the data up that way. This last is a database administrator problem, but it affects replication. There are probably other issues that astute readers will point out. I would like to point out one common issue that would probably be categorized as a replication bug. If the master crashes for whatever reason (say, a hosting company accidentally punches the power button on a master server) it will often cause corruption of the binary log. When the master comes back up, the slave cries about a non-existent binary log position. Possible solutions: (more…) […]

Reply

Is there any update on these replication issues? Are the toolkits the best way to go, or should you roll your own to check the status of tables? Does anyone know what would cause a slave to get different data, and if you had two slaves, would one of them drift differently than the other one, or does it have to do with the commands sent to it.

I liked the points raised in this article, but there hasn’t be an update in two years… There must be something out there now to cover some of these issues.

Reply

Leave a Reply

Your email address will not be published. Required fields are marked *