I’ve seen many a good DBA make the master of starting slaves from the position in the master.info file, most recently this week, that I want to bring it to everyone’s attention. Of course I mean the underlying issue and not the names of the DBA because that would be cruel.
In the typical scenario where this is an issue, the sequence of events is roughly the same with some small variation. A cold backup or a snapshot is restored onto a new server to build out a new slave. The binary log position from the master.info file, which is part of the backup, is used to start replication. Eventually after a short while, someone notices data discrepancies on the new slave compared to the master or replication stops due to an error.
The problem can be best looked by looking the slave status output in MySQL like below:
mysql> show slave status\G *************************** 1. row *************************** Slave_IO_State: Waiting for master to send event Master_Host: 127.0.0.1 Master_User: msandbox Master_Port: 26768 Connect_Retry: 60 Master_Log_File: mysql-bin.000001 Read_Master_Log_Pos: 4723 Relay_Log_File: mysql_sandbox26769-relay-bin.000002 Relay_Log_Pos: 874 Relay_Master_Log_File: mysql-bin.000001 Slave_IO_Running: Yes Slave_SQL_Running: Yes Replicate_Do_DB: Replicate_Ignore_DB: Replicate_Do_Table: Replicate_Ignore_Table: Replicate_Wild_Do_Table: Replicate_Wild_Ignore_Table: Last_Errno: 0 Last_Error: Skip_Counter: 0 Exec_Master_Log_Pos: 729 Relay_Log_Space: 1042 Until_Condition: None Until_Log_File: Until_Log_Pos: 0 Master_SSL_Allowed: No Master_SSL_CA_File: Master_SSL_CA_Path: Master_SSL_Cert: Master_SSL_Cipher: Master_SSL_Key: Seconds_Behind_Master: 0 Master_SSL_Verify_Server_Cert: No Last_IO_Errno: 0 Last_IO_Error: Last_SQL_Errno: 0 Last_SQL_Error: 1 row in set (0.00 sec) mysql>
There are two sets of coordinates for replication in consideration here, the Execute Master Position and the Read Master Position. The Execute Master Position, denoted by the ‘Relay_Master_Log_File’ and Exec_Master_Log_Pos’ in the output above, is the position in the master’s binary log of the current statement being executed by the slave from the relay log. The Read Master Position, denoted by the ‘Master_Log_File’ and Exec_Master_Log_Pos’ in the output above, is the position in the master’s binary logs which the slave is reading from and writing to the relay log. So if the slave is lagged for some reason, which can happen often in the MySQL world, the Read Master Position will be different and in fact newer then the Execute Master Position. On the file system, the Read Master Position is stored in the master.info file and the Execute Master Position is stored instead in the relay-log.info file.
So if the slave was lagged when it was shut down for the cold backup or when the snapshot was taken, then the two positions would be different. Consider the following example from a server which was lagged behind.
The master.info file shows
--(Wed:20110112:1527)-(0:$)-- more master.info 15 mysql-bin.000001 4723 127.0.0.1 msandbox msandbox 26768 60 0 0 --(wang@polygon)-(/home/wang/sandboxes/rsandbox_5_1_53/node1/data)--
The relay-log.info file shows
--(Wed:20110112:1527)-(0:$)-- more relay-log.info ./mysql_sandbox26769-relay-bin.000002 874 mysql-bin.000001 729 --(wang@polygon)-(/home/wang/sandboxes/rsandbox_5_1_53/node1/data)--
When the replication is started on the new server from the position in the master.info file, the statements in the binary logs between the Execute Master Position and the Read Master Position get skipped and cause data corruption.
In the end, the solution is to always start a new slave using the position in the relay-log.info file.