Understanding MySQL fabric faulty server detection

Posted in: MySQL, Open Source, Technical Track

Awhile ago I found myself analyzing a MySQL fabric installation to understand why a group member was occasionally being marked as FAULTY even when the server was up and running and no failures were observed.  

                         server_uuid     address  status       mode weight
------------------------------------ ----------- ------- ---------- ------
ab0b0653-6121-11c5-55a0-007543445454 mysql1:3306 PRIMARY READ_WRITE    1.0
f34dd331-2432-11f4-a2d3-006754678533 mysql2:3306 FAULTY  READ_ONLY     1.0


Upon reviewing mysqlfabric logs, I found the following warnings were being logged from time to time:

[WARNING] 1442221217.920115 - FailureDetector(xc_grp_1) - Server (f34dd331-2432-11f4-a2d3-006754678533) in group (xc_grp_1) is unreachable


Since I was not clear under which circumstances a server is marked as FAULTY, I decided to review MySQL Fabric code (Python) to better understand the process.

The module responsible for printing this message is failure_detection.py and more specifically, the _run method belonging to FailureDetector class. This method will loop through every server in a group, and attempt a connection to the MySQL instance running on that node. MySQLServer.Is_alive (mysql/fabric/server.py) method is called for this purpose.

Before reviewing the failure detection process, we first need to know that there are four MySQL fabric parameters that will affect when a server is considered unstable or faulty:



Based on the above variables, the logic followed by FailureDetector._run() to mark a server as FAULTY is the following:

1) Every {DETECTION_INTERVAL/DETECTIONS} seconds, a connection against each server in the group is attempted with a timeout equal to DETECTION_TIMEOUT

2) If DETECTION_TIMEOUT is exceeded, the observed message is logged and a counter incremented

3) When this counter reaches DETECTIONS, the server is marked as “unstable” and if the last time the master changed was greater than FAILOVER_INTERVAL ago, the server is marked as FAULTY

With a better understanding of the logic followed by MySQL fabric to detect faulty nodes, I went to the configuration file to check the existing values for each of the parameters:


From the values above we can notice that each group will be polled every 2 seconds (DETECTION_INTERVAL/DETECTIONS) and that the monitored server should respond within a second for the test to be considered successful.

On high concurrency nodes, or nodes under heavy load, a high polling frequency combined with tight timeouts could cause the servers to be marked as FAULTY just because the connection attempt would not be completed or processed (in the case of high connection rates or saturated network interfaces) before the timeout occurs.

Also, having FAILOVER_INTERVAL reduced to 0, will cause the server to be marked as FAULTY even if a failover had just occurred.

A less aggressive configuration would be more appropriated for heavy loaded environment:



As with any other database clustering solution that relies on a database connection to test node status, situations where the database server would take longer to respond should also be considered. The polling frequency should be adjusted so the detection window is within an acceptable range, but the rate of monitoring connections generated is also kept to the minimum. Check timeouts should also be adjusted to avoid false positives caused by the server not being able to respond in a timely manner.



Interested in working with Gabriel? Schedule a tech call.

Internal Principal Consultant

2 Comments. Leave new

I am getting same issue as replication is working fine, “mysqlfabric group lookup_servers mycluster” command is showing everything fine but “mysqlfabric group health mycluster” command is showing is_alive 0 for both servers and also showing status as FAULTY for both servers. when I tried to check file “/usr/lib/python2.7/site-packages/mysql/fabric/server.py” as you suggested then only getting ” _FAILOVER_INTERVAL = _DEFAULT_FAILOVER_INTERVAL = 3600″ but not getting other parameters like “DETECTIONS”, “DETECTIONS_TIMEOUT”, “DETECTIONS_INTERVAL”. May you help me in this.

Gabriel Ciciliani
September 9, 2016 7:15 am

Each of the python variables mentioned corresponds with a parameter with the same name (lower case) in the [failure_tracking] section, within the fabric.cnf.
Check this links for configuration file details
The link below has a parameter reference


Leave a Reply

Your email address will not be published. Required fields are marked *