MySQL Replication is a powerful tool and it’s hard to find a production system not using it. On the other hand debugging replication issues can be very hard and time consuming. Especially if your replication setup is not straightforward and you are using filtering of some kind.
Recently we got an alert from our monitoring system that replication stopped on production slave with the following error:
Can't find record in 'some_table', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log binlog.000011, end_log_pos 735766642
This means that a ROW-based replication event was going to be applied on slave, but could not find the row it was supposed to be applied to. This is something I like about ROW format — it allows you to catch such data synchronization issues right away. In this particular case MIXED format was used, but if this event was written in STATEMENT format, slave would just apply it with no error. Event was just an UPDATE and in STATEMENT format when update doesn’t affect any rows on slave it just proceeds to next event, etc.
This error was very concerning because this was a new production machine which was built out from latest hot backup several hours ago. Two ideas which come into mind in such situations are: something went wrong during switch over to new master server or something was left running on old production master (which now became slave) and is changing data there.
The first idea about initial replication misconfiguration was proven not valid when we checked the timestamp of the row which existed on master, but was missing on slave. The row was added several hours after switchover happened, so it was very unlikely that new slave was just pointed to wrong binary log position.
To check second idea about something still writing to slave we scanned slave binary logs in search of statements with server_id equal slave server id, but there was nothing there. There were also no connections except replication in slave’s SHOW PROCESSLIST. Further investigation has shown that there were many more rows missing on slave, with several different tables affected.
So when all the simple potential root causes are checked and rejected, it’s time to start digging into binary logs. Unfortunately, there not many tools available out there to help you with this. All you have is mysqlbinlog utility and Linux text processing tools like sedand grep. To start with something I took first row which caused issues on slave, I had it’s primary key value and created_at timestamp.
After some time of master’s binary logs debugging I could see the following picture: there were several UPDATE statements, but I couldn’t find original INSERT statement where the row was initially added. Check on several other missing rows gave similar results: there were some UPDATEs in master’s binary log, but no initial INSERT. Another interesting fact was that all the missing rows were created on master around the same time in 5-10 minutes interval.
At this point it became clear that some rows were added on master, but were never written to binary log. How could this happen? I had another set of possibilities: someone was running a manual session on master with SQL_LOG_BIN=0, we were hitting some exotic MySQL bug (new master was built with MySQL 5.5) or something else happened and I just had not enough facts available.
I tried to find correlation between missing rows, but no luck and I knew customer had pretty strict access policy to this servers, so most of the people didn’t have write access to the database. Checking the MySQL bug database also didn’t reveal anything relevant and I would expect a bug which prevents some events from being written to binary log to be found very soon. And then I remembered something, something that was implemented on this system several month ago.
There are some stored procedures that are being executed on daily basis. These stored procedures create several temporary tables, fill them with data, massage the data in some way and then return result set back to client. While there procedures do not modify production data in any way, all operations related to temporary table creation are written to the binary log and thus replicated to slaves, creating absolutely unnecessary additional load on slaves.
To avoid this a decision was made to disable binary logging inside the stored procedure — it was safe as no tables other than temporary were updated. Binary logging was disabled at very beginning of the procedure and was enabled when procedure ended:
CREATE DEFINER = 'user'@'localhost' PROCEDURE proc1() BEGIN SET SESSION SQL_LOG_BIN = 0; ............. ............. ............. SET SESSION SQL_LOG_BIN = 1; END;
This approach worked well, until something went wrong.
We did have logging enabled for stored procedures using our own framework to log every stored procedure step and overall execution status in a separate logging table. When I check this log table I found out that one of the stored procedures failed around the time we start seeing missing rows on slave. All pieces came together and it became clear what actually happened.
When stored procedure failed in the middle (due to space issue in MySQL temp directory) it’s connection was left in state where binary log was disabled but was never enabled back. Then application connection pooling came into play and this “altered” connection was reused. This means for some time one of the application threads was using a connection with disabled binary logging. I didn’t have any facts that would be pointing exactly to this, and I’m not sure if it is possible at all to obtain such evidence, but all other facts were pointing that this was the root cause.
Conclusion. I would like to say do not use custom replication filtering and this solves potential issues, but first of all there are many cases where replication filters are very useful and as with any more or less complex incident there were several factors that in particular combination triggered the issue. MySQL temp directory was placed on a smaller volume on the new master and this fact was overlooked. When decision was made to disable binary logging for stored procedures not all test cases were considered, like what happens if procedure fails. Finally, I would expect connection pool to be able to save connection session variables and restore them when connection is recycled.
This incident had serious impact on the customer’s production system, but it was also a good lesson on how even very unlikely scenarios can happen and how looking at the problem from different angles can help to reveal the real root cause.
The much better solution to suppress replication of certain temporary tables, is to use either a separate schema for such tables or a unique table name prefix and then have matching replicate-wild-ignore-table option(s) on the slave(s).
Yep, this is a good suggestion. Requires more work to be done, but should be more error safe. Thanks!
nice troubleshooting. It’s scary that the application could reuse a connection from the pool without resetting its parameters/settings to a well known default. This could cause all sorts of problems. Not on the SQL_LOG_BIN, but with many other variables. Imagine if a session change SQL_MODE or other important variable, that could cause a big mess, couldn’t it?
Thanks for posting!
Interesting example. I would never let an application user have SUPER privileges, too many dangerous things can happen, SET SQL_LOG_BIN is only one of them. A different schema (already referenced) is the simplest way for replication filtering.
I would also ask why there is no exception handling in the SP. There is an entire code path not tested on error if your SP changes thread specific settings.
Agree, proper exception handling would prevent this issue, but as far as I know you need to explicitly specify handler for an error code, so catching all cases can be tricky.
Anyway, one of the points of the post is that filtering replication like this is a bad idea in general.