This is just a silly article about a couple of recent instances where my colleague and I have both been duped by scheduled jobs. The feint and obvious lesson here is to be aware of your surroundings. I’m guilty of berating my 3 year old for spilling every glass of water she is handed because she’s unaware of her surroundings. She puts it down, forgets that it’s there and moments later I’m soaking up tears and a beaker’s worth of water. If she knew her Daddy can be so absent minded too, she’d probably think that it’s in the genes.
Recently, I performed an upgrade on a Master-Slave pair at a customer site. The requirements were minimal downtime, so we used the traditional ‘upgrade slave, promote to master, rinse and repeat’ to make sure that the application was only interrupted for the duration of the flush and failover.
Once completed, I returned to the scene to check that things were as expected. I logged in, checked the processlist, the server status, and tailed the error log. Nothing unusual to be seen except that the slave’s SQL Thread had been stopped. I checked with the customer and there was no reason for this to be stopped. There were no errors or claims of manual intervention, and we don’t monitor these machines, so there were no alerts when it had happened. I proceeded to start the SQL Thread and hung around for a while. I reviewed the slave after a few minutes and it was fine. Curious, I thought. I continued with some other tasks and then flicked back to check. It had stopped again! I checked again with the customer but, ‘not us’ was the answer. It must have been mischievous goblins.
I started the SQL thread again, and after some time, the thread was down again. Odd. A few more rounds of stoppages without errors, and I began to suspect something crazy was happening. Whilst I investigated, I thought it harmless to deploy pt-restart-slave to monitor and restart the slave when it detected a stop. After all, there was no error. Worst comes to worse, the standby slave is inconsistent we can take action then. I had a brief scan of the process list at OS level and this dedicated MySQL machine was just that, running nothing else then MySQL. A quick scan of the MySQL user’s crontab showed nothing of interest.
I began to review the bugs db and old support tickets for ‘sql thread stopped, no error’ or similar. Understandably, there was nothing that matched my issue. Google produced some results, but nothing relevant. It was getting late in my day so I handed the troubleshooting off to my colleague on the other side of the world. Maybe he’d have a different perspective of it (wahey!). Indeed he found the culprit. He had reviewed the issue and went straight for the crontabs. He located a job in the root’s crontab that initiates a ZFS snapshot every 30 minutes. Within the script logic, it checks for the ‘skip-slave-start’ presence in the my.cnf file, and if it detects it, it’s unwilling to start your slave again. After kicking my own shins, I promptly made a promise to ALWAYS look through crontab entries (and scripts) for these potential trip-hazards.
Another one! I was told another story recently by a different colleague. As part of some maintenance, a slave was rebuilt, but once it had started up it kept showing errors and killing the replication threads. Time was spent trying to find the root cause, but this was a read-only slave and the processlist didn’t show evidence of other connections being made from outside the host. Nothing in the crontab(s) was running was making writes on the machine. After plenty of head scratching the recon on the master, it transpires that the slave was built using a binary copy of the master, thus taking all server objects in the same state as the master. This included an archive job running under the event scheduler. The master was sending delete and update statements to a table where these records had already been removed. Oh how we laugh in retrospect!
Another important lesson here is that if we knew about or had documented these jobs and factored them into our execution plans, maybe we wouldn’t have spent precious time fault finding them.
We are currently having a big rework of our wiki pages internally, and this is exactly the kind of information that should be recorded for the next engineer that works on the client needs to know about, just in case. Do you maintain documentation on your environments? What practices of this nature have paid off for you?
1 Comment. Leave new
I sympathize: had a similar case with “pt-slave-delay” which I forgot about, running in the background. And the slave would periodically start, then stop again…
Also daily cleanup jobs, or sphinx read-it-all queries coming in from other servers… Need to stop the cron job on other servers as well.
This is definitely best left to an automated deployment script like “cease/resume all maintenance and major queries on server.sh”