Over-the-Top Tales from the Trenches.
Motto: Bringing order to the chaos of every day DBA life.
So you have got your nice MySQL Master-Slave replication pair setup. Everything is sweet, then the master dies/restarts or you have a slightly extended network outage.
Your monitoring software (in our case Avail) fires off a page and you are rudely interrupted from reading the Pythian blog.
These real world interruptions, what can I say… it pays the bills.
Anyway being the rounded DBA or enlightened DBA as Babette would say, you are capable of handling any type of database. You log into the machine and check out why the slave threw an error or if your monitoring is slow, why the slave is lagging by 2 hours.
You run SHOW SLAVE STATUS\G
mysql> show slave status \G *************************** 1. row *************************** Slave_IO_State: Waiting for master to send event Master_Host: x.x.x.x Master_User: rep Master_Port: 3306 Connect_Retry: 60
Master_Log_File: mysql-bin.000480 Read_Master_Log_Pos: 690470773 Relay_Log_File: db2-relay-bin.000028 Relay_Log_Pos: 683977007 Relay_Master_Log_File: mysql-bin.000480
Slave_IO_Running: Yes Slave_SQL_Running: No Replicate_Do_DB: Replicate_Ignore_DB: avail,avail Replicate_Do_Table: Replicate_Ignore_Table: Replicate_Wild_Do_Table: Replicate_Wild_Ignore_Table: Last_Errno: 0 Last_Error: Could not parse relay log event entry. The possible reasons are: the master's binary log is corrupted (you can check this by running 'mysqlbinlog' on the binary log), the slave's relay log is corrupted (you can check this by running 'mysqlbinlog' on the relay log), a network problem, or a bug in the master's or slave's MySQL code. If you want to check the master's binary log or slave's relay log, you will be able to know their names by issuing 'SHOW SLAVE STATUS' on this slave. Skip_Counter: 0
Relay_Log_Space: 690471192 Until_Condition: None Until_Log_File: Until_Log_Pos: 0 Master_SSL_Allowed: No Master_SSL_CA_File: Master_SSL_CA_Path: Master_SSL_Cert: Master_SSL_Cipher: Master_SSL_Key: Seconds_Behind_Master: NULL 1 row in set (0.00 sec)
The bits which are important are in red.
Now the normal procedure for this kind of error is to get the Exec_Master_Log_Pos and tell the slave to restart from there. Using a command like:
change master to Master_Log_File='mysql-bin.000480', Master_Log_Pos=126; start slave;
In this case you can kiss goodbye to your slave as a consistent copy of your master database. SHOW SLAVE STATUS has basically lied through its teeth. The master_log_pos is not 126.
To quote my mate Yannick:
this is the error log file entries when the problem happened:
080206 16:03:48 [ERROR] Error reading packet from server: Lost connection to MySQL server during query ( server_errno=2013) 080206 16:03:48 [Note] Slave I/O thread: Failed reading log event, reconnecting to retry, log '
mysql-bin.000480′ position 683976897
080206 16:03:48 [Note] Slave: connected to master 'firstname.lastname@example.org:3306',replication resumed in log 'mysql-bin.000480' at position 683976897 080206 16:04:31 [ERROR] Error in Log_event::read_log_event(): 'read error', data_len: 1966107, event_type: 16 080206 16:04:31 [ERROR] Error reading relay log event: slave SQL thread aborted because of I/O error 080206 16:04:31 [ERROR] Slave: Could not parse relay log event entry. The possible reasons are: the master's binary log is corrupted (you can check this by running 'mysqlbinlog' on the binary log), the slave's relay log is corrupted (you can check this by running 'mysqlbinlog' on the relay log), a network problem, or a bug in the master's or slave's MySQL code. If you want to check the master' s binary log or slave's relay log, you will be able to know their names by issuing 'SHOW SLAVE STATUS' on this slave. Error_code: 0 080206 16:04:31 [ERROR] Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log
‘mysql-bin.000480’ position 126
Comparing both output, “show slave status” tells us that “I’m stuck at …”
Master_Log_File: mysql-bin.000480 Exec_Master_Log_Pos: 126
while the error log file says at first that it is stuck at:
log 'mysql-bin.000480' position 683976897
then later in the error log is says it is stuck at:
mysql-bin.000480' position 126
So If I had to choose I would choose to restart the slave at log ‘mysql-bin.000480’ position 683976897.
If not you are going to try to re-run a full binlog that has been already applied.
Yannick’s Rules for dodgy master log positions when a slave gets stuck on the relay log corruption:
- If the position is less then let says 500, suspect that the position is wrong (it might be right but it needs to be checked out).
- Always go and check the MySQL error log to see what it has to say about position.
- If the position in the error log as well as in the “show slave status” agree more or less (both are a big number), then normally I would pick the one from “show slave status”.
- If not, then I would pick the one in the error log.
NOTE: you must understand that it is possible for the slave to be stuck at a small position.
This would happen if the master did change binlog and the slave got stuck just after the switch of the master binlog, so you have to be careful.