Tonight a client restarted some instances on EC2 and they got different IPs.
This broke replication because the slaves were trying to connect to the masters using IPs that weren’t existing any more.
To solve the task is easy: just CHANGE MASTER TO using the new IP (or the hostname).
Although there is a small gotcha, that probably you already know but it can be easy to forget.
From https://dev.mysql.com/doc/refman/5.1/en/change-master-to.html :
If you specify the MASTER_HOST or MASTER_PORT option, the slave assumes that the master server is different from before (even if the option value is the same as its current value.) In this case, the old values for the master binary log file name and position are considered no longer applicable, so if you do not specify MASTER_LOG_FILEand MASTER_LOG_POS in the statement, MASTER_LOG_FILE=” and MASTER_LOG_POS=4 are silently appended to it.
That is: CHANGE MASTER TO MASTER_HOST=’new_ip_address’ is NOT enough: actually, it will break replication because it will start replication from the first binlog.
You need to specify all the parameters required to setup replication, like master_user, master_password, master_log_file and master_log_pos .
To get replication position (where replication stopped) you can either check the output of SHOW SLAVE STATUS looking for Master_Log_File and Read_Master_Log_Pos, or the error log.
A side note for the curious ones: there was no downtime, as the master/slave are really a master/master pair. The “slave” failing to connect to the “master” was the write master trying to connect to the stand-by master.