Understanding MySQL fabric faulty server detection

Posted in: MySQL, Open Source, Technical Track

Awhile ago I found myself analyzing a MySQL fabric installation to understand why a group member was occasionally being marked as FAULTY even when the server was up and running and no failures were observed.  

         
                         server_uuid     address  status       mode weight
------------------------------------ ----------- ------- ---------- ------
ab0b0653-6121-11c5-55a0-007543445454 mysql1:3306 PRIMARY READ_WRITE    1.0
f34dd331-2432-11f4-a2d3-006754678533 mysql2:3306 FAULTY  READ_ONLY     1.0

 

Upon reviewing mysqlfabric logs, I found the following warnings were being logged from time to time:

[WARNING] 1442221217.920115 - FailureDetector(xc_grp_1) - Server (f34dd331-2432-11f4-a2d3-006754678533) in group (xc_grp_1) is unreachable

 

Since I was not clear under which circumstances a server is marked as FAULTY, I decided to review MySQL Fabric code (Python) to better understand the process.

The module responsible for printing this message is failure_detection.py and more specifically, the _run method belonging to FailureDetector class. This method will loop through every server in a group, and attempt a connection to the MySQL instance running on that node. MySQLServer.Is_alive (mysql/fabric/server.py) method is called for this purpose.

Before reviewing the failure detection process, we first need to know that there are four MySQL fabric parameters that will affect when a server is considered unstable or faulty:

DETECTION_TIMEOUT
DETECTION_INTERVAL
DETECTIONS
FAILOVER_INTERVAL

 

Based on the above variables, the logic followed by FailureDetector._run() to mark a server as FAULTY is the following:

1) Every {DETECTION_INTERVAL/DETECTIONS} seconds, a connection against each server in the group is attempted with a timeout equal to DETECTION_TIMEOUT

2) If DETECTION_TIMEOUT is exceeded, the observed message is logged and a counter incremented

3) When this counter reaches DETECTIONS, the server is marked as “unstable” and if the last time the master changed was greater than FAILOVER_INTERVAL ago, the server is marked as FAULTY

With a better understanding of the logic followed by MySQL fabric to detect faulty nodes, I went to the configuration file to check the existing values for each of the parameters:

DETECTIONS=3
DETECTION_TIMEOUT=1
FAILOVER_INTERVAL=0
DETECTION_INTERVAL=6

From the values above we can notice that each group will be polled every 2 seconds (DETECTION_INTERVAL/DETECTIONS) and that the monitored server should respond within a second for the test to be considered successful.

On high concurrency nodes, or nodes under heavy load, a high polling frequency combined with tight timeouts could cause the servers to be marked as FAULTY just because the connection attempt would not be completed or processed (in the case of high connection rates or saturated network interfaces) before the timeout occurs.

Also, having FAILOVER_INTERVAL reduced to 0, will cause the server to be marked as FAULTY even if a failover had just occurred.

A less aggressive configuration would be more appropriated for heavy loaded environment:

DETECTIONS=3 -> It's Ok
DETECTION_TIMEOUT -> 5 TO 8
FAILOVER_INTERVAL -> 1200
DETECTION_INTERVAL -> 45 TO 60

Conclusion

As with any other database clustering solution that relies on a database connection to test node status, situations where the database server would take longer to respond should also be considered. The polling frequency should be adjusted so the detection window is within an acceptable range, but the rate of monitoring connections generated is also kept to the minimum. Check timeouts should also be adjusted to avoid false positives caused by the server not being able to respond in a timely manner.

 

email

Interested in working with Gabriel? Schedule a tech call.

Internal Principal Consultant

2 Comments. Leave new

I am getting same issue as replication is working fine, “mysqlfabric group lookup_servers mycluster” command is showing everything fine but “mysqlfabric group health mycluster” command is showing is_alive 0 for both servers and also showing status as FAULTY for both servers. when I tried to check file “/usr/lib/python2.7/site-packages/mysql/fabric/server.py” as you suggested then only getting ” _FAILOVER_INTERVAL = _DEFAULT_FAILOVER_INTERVAL = 3600″ but not getting other parameters like “DETECTIONS”, “DETECTIONS_TIMEOUT”, “DETECTIONS_INTERVAL”. May you help me in this.

Reply
Gabriel Ciciliani
September 9, 2016 7:15 am

Each of the python variables mentioned corresponds with a parameter with the same name (lower case) in the [failure_tracking] section, within the fabric.cnf.
Check this links for configuration file details
https://dev.mysql.com/doc/mysql-utilities/1.5/en/fabric-configuration-file.html
The link below has a parameter reference
https://dev.mysql.com/doc/mysql-utilities/1.5/en/fabric-cfgref.html

Reply

Leave a Reply

Your email address will not be published. Required fields are marked *