Over-the-Top Tales from the Trenches.
Motto: Bringing order to the chaos of every day DBA life.
So you have got your nice MySQL Master-Slave replication pair setup. Everything is sweet, then the master dies/restarts or you have a slightly extended network outage.
Your monitoring software (in our case Avail) fires off a page and you are rudely interrupted from reading the Pythian blog.
These real world interruptions, what can I say… it pays the bills.
Anyway being the rounded DBA or enlightened DBA as Babette would say, you are capable of handling any type of database. You log into the machine and check out why the slave threw an error or if your monitoring is slow, why the slave is lagging by 2 hours.
You run SHOW SLAVE STATUS\G
mysql> show slave status \G *************************** 1. row *************************** Slave_IO_State: Waiting for master to send event Master_Host: x.x.x.x Master_User: rep Master_Port: 3306 Connect_Retry: 60
Master_Log_File: mysql-bin.000480 Read_Master_Log_Pos: 690470773 Relay_Log_File: db2-relay-bin.000028 Relay_Log_Pos: 683977007 Relay_Master_Log_File: mysql-bin.000480
Slave_IO_Running: Yes Slave_SQL_Running: No Replicate_Do_DB: Replicate_Ignore_DB: avail,avail Replicate_Do_Table: Replicate_Ignore_Table: Replicate_Wild_Do_Table: Replicate_Wild_Ignore_Table: Last_Errno: 0 Last_Error: Could not parse relay log event entry. The possible reasons are: the master's binary log is corrupted (you can check this by running 'mysqlbinlog' on the binary log), the slave's relay log is corrupted (you can check this by running 'mysqlbinlog' on the relay log), a network problem, or a bug in the master's or slave's MySQL code. If you want to check the master's binary log or slave's relay log, you will be able to know their names by issuing 'SHOW SLAVE STATUS' on this slave. Skip_Counter: 0
Exec_Master_Log_Pos: 126
Relay_Log_Space: 690471192 Until_Condition: None Until_Log_File: Until_Log_Pos: 0 Master_SSL_Allowed: No Master_SSL_CA_File: Master_SSL_CA_Path: Master_SSL_Cert: Master_SSL_Cipher: Master_SSL_Key: Seconds_Behind_Master: NULL 1 row in set (0.00 sec)
The bits which are important are in red.
Now the normal procedure for this kind of error is to get the Exec_Master_Log_Pos and tell the slave to restart from there. Using a command like:
change master to Master_Log_File='mysql-bin.000480', Master_Log_Pos=126; start slave;
In this case you can kiss goodbye to your slave as a consistent copy of your master database. SHOW SLAVE STATUS has basically lied through its teeth. The master_log_pos is not 126.
To quote my mate Yannick:
this is the error log file entries when the problem happened:
080206 16:03:48 [ERROR] Error reading packet from server: Lost connection to MySQL server during query ( server_errno=2013) 080206 16:03:48 [Note] Slave I/O thread: Failed reading log event, reconnecting to retry, log '
mysql-bin.000480′ position 683976897
080206 16:03:48 [Note] Slave: connected to master '[email protected]:3306',replication resumed in log 'mysql-bin.000480' at position 683976897 080206 16:04:31 [ERROR] Error in Log_event::read_log_event(): 'read error', data_len: 1966107, event_type: 16 080206 16:04:31 [ERROR] Error reading relay log event: slave SQL thread aborted because of I/O error 080206 16:04:31 [ERROR] Slave: Could not parse relay log event entry. The possible reasons are: the master's binary log is corrupted (you can check this by running 'mysqlbinlog' on the binary log), the slave's relay log is corrupted (you can check this by running 'mysqlbinlog' on the relay log), a network problem, or a bug in the master's or slave's MySQL code. If you want to check the master' s binary log or slave's relay log, you will be able to know their names by issuing 'SHOW SLAVE STATUS' on this slave. Error_code: 0 080206 16:04:31 [ERROR] Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log
‘mysql-bin.000480’ position 126
Comparing both output, “show slave status” tells us that “I’m stuck at …”
Master_Log_File: mysql-bin.000480 Exec_Master_Log_Pos: 126
while the error log file says at first that it is stuck at:
log 'mysql-bin.000480' position 683976897
then later in the error log is says it is stuck at:
mysql-bin.000480' position 126
So If I had to choose I would choose to restart the slave at log ‘mysql-bin.000480’ position 683976897.
If not you are going to try to re-run a full binlog that has been already applied.
Yannick’s Rules for dodgy master log positions when a slave gets stuck on the relay log corruption:
- If the position is less then let says 500, suspect that the position is wrong (it might be right but it needs to be checked out).
- Always go and check the MySQL error log to see what it has to say about position.
- If the position in the error log as well as in the “show slave status” agree more or less (both are a big number), then normally I would pick the one from “show slave status”.
- If not, then I would pick the one in the error log.
NOTE: you must understand that it is possible for the slave to be stuck at a small position.
This would happen if the master did change binlog and the slave got stuck just after the switch of the master binlog, so you have to be careful.
Have Fun
Paul
8 Comments. Leave new
I think these are tow different errors, one inside I/O thread and another in sql thread. error related to I/O thread occurred at :
‘mysql-bin.000480′ position 683976897
and error related to SQL thread at :
‘mysql-bin.000480′ position 126
What version was this on?
I recently worked with the replication team to find and fix bugs of this nature. The resulting fixes were pushed into 5.0.54 and 5.1.23, and I haven’t seen one case of SHOW SLAVE STATUS being wrong in this manner since then.
The bugs in question were:
https://bugs.mysql.com/bug.php?id=31190
https://bugs.mysql.com/bug.php?id=12691
Thanks!
Hi Harrison,
Thanks for the update. This was MySQL 5.0.41
[…] Here on the Pythian Group Blog, Paul Moen posted about a situation in which SHOW SLAVE STATUS lies. […]
A similar issue, but equally nasty:
https://bugs.mysql.com/bug.php?id=26540
If the slave server crashes (not mysqld process) then the syncronisation point will be lost and can not be recoved. This effectively requires the slave to be rebuilt. Very nasty. MySQL SHOULD be flushing the .info files on every commit to ensure consistency.
If not done the difference between the “.info” files and the real committed last transaction can be as long as 30 seconds. When the slave starts up it will try to reapply transactions from 30 seconds previously…
Tell your slaves not to crash and don’t pull the power on them. If they do, you’ll need to rebuild them.
Hi Paul,
I tried to replication of master-slave of mysql server. 2 problems are there
1. After setting up in master when I run
show master status;
It shows
mysql> show master status;
+——————+———-+————–+——————+
| File | Position | Binlog_Do_DB | Binlog_Ignore_DB |
+——————+———-+————–+——————+
| mysql-bin.000001 | 98 | test,test | |
+——————+———-+————–+——————+
1 row in set (0.00 sec)
Why 2 same database names will come in Binlog_Do_DB col, where I was configured only under [mysqld]
log-bin=mysql-bin
binlog-do-db=test
server-id=1
2. In slave server when I call show slave status\G
it shows
**************************
Slave_IO_State:
Master_Host: 192.168.0.50
Master_User: slave_user
Master_Port: 3306
Connect_Retry: 60
Master_Log_File:
Read_Master_Log_Pos: 4
Relay_Log_File: mysqld-relay-bin.000001
Relay_Log_Pos: 98
Relay_Master_Log_File:
Slave_IO_Running: No
Slave_SQL_Running: Yes
Replicate_Do_DB:
Replicate_Ignore_DB:
Replicate_Do_Table:
Replicate_Ignore_Table:
Replicate_Wild_Do_Table:
Replicate_Wild_Ignore_Table:
Last_Errno: 0
Last_Error:
Skip_Counter: 0
Exec_Master_Log_Pos: 0
Relay_Log_Space: 98
Until_Condition: None
Until_Log_File:
Until_Log_Pos: 0
Master_SSL_Allowed: No
Master_SSL_CA_File:
Master_SSL_CA_Path:
Master_SSL_Cert:
Master_SSL_Cipher:
Master_SSL_Key:
Seconds_Behind_Master: NULL
1 row in set (0.00 sec)
Slave_IO_State value is blank. can u please let me know why this happened ? and this is not working at all.
Please revert back with u’r expert comments
You would need to post the error log after trying to start the slave, not the whole thing. As it will have more detail as to why the slave SQL thread is running and the slave IO thread has stopped.
Paul, I am using 5.0.45, and have been hunting quite a while for something which properly explained this. You have completely saved me from messing up replication. I followed your example, read my error log carefully, and got everything back up and running. Thank you so very, very much for taking the time to post this.