Monitoring your 5.7 InnoDB cluster status

Matthias Crauwels

February 4, 2019

Tags: Monitoring, Technical Track, Cassandra, Grafana

Recently we had a customer who came to us for help with monitoring their InnoDB cluster setup. They run a 5.7 InnoDB cluster and suffered from a network outage that split up their cluster leaving it with no writeable primaries. As part of their incident followup, they asked us for methods to monitor this cluster. I had tested before with InnoDB clusters (in both 8.0 and 5.7 variants) and I was confident that we could parse the cluster node "role" (read-write aka primary vs. read-only aka secondary) from the performance_schema tables. As it turned out, this feature is not in 5.7 but only 8.0. However, the docs on this are wrong for 5.7 as these docs suggest that the performance_schema.group_replication_members table would show PRIMARY and SECONDARY role of each cluster node. I have submitted a bug report to Oracle to request this MEMBER_ROLE information to be back-ported from 8.0 (or to at least have the docs updated).

Initial monitoring attempts

Our initial monitoring approach is to check the current number of nodes in the cluster. To have a writeable cluster we need at least two nodes, preferably three, for fault-tolerance. I started exploring the options in the current situation. My test cluster looked like this at the start: [code] MySQL node1:3306 ssl JS > cluster.status(); { "clusterName": "innodb_cluster", "defaultReplicaSet": { "name": "default", "primary": "node1:3306", "ssl": "REQUIRED", "status": "OK", "statusText": "Cluster is ONLINE and can tolerate up to ONE failure.", "topology": { "node1:3306": { "address": "node1:3306", "mode": "R/W", "readReplicas": {}, "role": "HA", "status": "ONLINE" }, "node2:3306": { "address": "node2:3306", "mode": "R/O", "readReplicas": {}, "role": "HA", "status": "ONLINE" }, "node3:3306": { "address": "node3:3306", "mode": "R/O", "readReplicas": {}, "role": "HA", "status": "ONLINE" } }, "topologyMode": "Single-Primary" }, "groupInformationSourceMember": "node1:3306" } [/code] Which looks like this in performance_schema: [code] mysql> SELECT * FROM performance_schema.replication_group_members; +---------------------------+--------------------------------------+-------------+-------------+--------------+ | CHANNEL_NAME | MEMBER_ID | MEMBER_HOST | MEMBER_PORT | MEMBER_STATE | +---------------------------+--------------------------------------+-------------+-------------+--------------+ | group_replication_applier | 3811bcf1-1e85-11e9-bcae-5254008481d5 | node2 | 3306 | ONLINE | | group_replication_applier | 381400ec-1e85-11e9-b322-5254008481d5 | node1 | 3306 | ONLINE | | group_replication_applier | 39380c64-1e85-11e9-b2d2-5254008481d5 | node3 | 3306 . | ONLINE | +---------------------------+--------------------------------------+-------------+-------------+--------------+ 3 rows in set (0.00 sec) [/code] The client mentioned that their cluster issues were triggered by a network issue so I simulated a network issue by blocking all traffic on my eth1 adapter (the adapter over which the cluster communicates) using this iptables command: [code]iptables -A INPUT -p tcp -i eth1 -j DROP[/code] My error log shows the node splitting of the cluster: [code] # tail -3 /var/log/mysqld.log 2019-01-23T09:36:11.223618Z 0 [Warning] Plugin group_replication reported: 'Member with address node3:3306 has become unreachable.' 2019-01-23T09:36:12.226226Z 0 [Warning] Plugin group_replication reported: 'Members removed from the group: node3:3306' 2019-01-23T09:36:12.226309Z 0 [Note] Plugin group_replication reported: 'Group membership changed to node2:3306, node1:3306 on view 15481913107442210:14.' [/code] And in performance_schema this looks like this: [code] mysql> SELECT * FROM performance_schema.replication_group_members; +---------------------------+--------------------------------------+-------------+-------------+--------------+ | CHANNEL_NAME | MEMBER_ID | MEMBER_HOST | MEMBER_PORT | MEMBER_STATE | +---------------------------+--------------------------------------+-------------+-------------+--------------+ | group_replication_applier | 3811bcf1-1e85-11e9-bcae-5254008481d5 | node2 | 3306 | ONLINE | | group_replication_applier | 381400ec-1e85-11e9-b322-5254008481d5 | node1 | 3306 | ONLINE | +---------------------------+--------------------------------------+-------------+-------------+--------------+ 2 rows in set (0.00 sec) [/code] So what you could do is run a query like this to get the number of online nodes: [code] mysql> SELECT COUNT(*) FROM performance_schema.replication_group_members WHERE MEMBER_STATE = 'ONLINE'; +----------+ | COUNT(*) | +----------+ | 2 | +----------+ 1 row in set (0.00 sec) [/code] Here, you need to make sure that you have more than one member-count as one node would be a partitioned node without a quorum and thus not writeable by design. In my cluster, value 2 could be a WARNING level alert as your cluster is still operational but it's no longer tolerating node failures. Value 3 or higher would be considered a "good" value. If your cluster is larger than three nodes, the appropriate values can be found here .

Alternative approach

What we really want to monitor is that there is at least one PRIMARY node in the cluster. InnoDB Cluster is designed to allow for multiple primaries, however, this is not a default approach . Therefore we would check for "at least one" and not "just one" primary. In this approach, we could attempt to find the current primary and write a check to see if this node is "ONLINE". This could be done like this: [code] mysql> SELECT COUNT(*) FROM performance_schema.replication_group_members WHERE member_id = (SELECT variable_value FROM performance_schema.global_status WHERE variable_name = 'group_replication_primary_member') AND MEMBER_STATE = 'ONLINE'; +----------+ | COUNT(*) | +----------+ | 1 . | +----------+ 1 row in set (0.00 sec) [/code] Note: I have not tested how this would look on a multi-primary cluster. So I would check for value 1 at this point.

Let's go JSON

The cluster is designed to always promote a new primary as long as there is a quorum in the cluster. So what you really want to monitor is the cluster status from the JSON output in the `cluster.status()` command. I've been testing with the `mysqlsh` tool to see if you can script that out. I've been using MySQL shell 8.0 (even on a 5.7 cluster this is the recommended version to run, so my suggestion would be to upgrade the mysql-shell package). To script this out you can use this command `mysqlsh root@127.0.0.1 -- cluster status`, this provides a JSON output which you can easily parse using the command line `jq` tool: With three active nodes: [code] # mysqlsh root@127.0.0.1 -- cluster status | jq .defaultReplicaSet.status "OK" [/code] With two active nodes: [code] # mysqlsh root@127.0.0.1 -- cluster status | jq .defaultReplicaSet.status "OK_NO_TOLERANCE" [/code] When you are in a NO_QUORUM state, there are some messages that prevent the output from being valid JSON: [code] # mysqlsh root@127.0.0.1 -- cluster status WARNING: Cluster has no quorum and cannot process write transactions: 2 out of 3 members of the InnoDB cluster are unreachable from the member we’re connected to, which is not sufficient for a quorum to be reached. WARNING: You are connected to an instance in state 'Read Only' Write operations on the InnoDB cluster will not be allowed. ... [/code] However, we can grep these messages out to become valid JSON: [code] # mysqlsh root@127.0.0.1 -- cluster status | grep -v WARNING | grep -v 'Write operations on the InnoDB cluster will not be allowed.' | jq .defaultReplicaSet.status "NO_QUORUM" [/code]

Conclusion

These three options can be implemented in almost any monitoring solution and will help you to provide some insight into the status of your InnoDB cluster.

Insight and analysis of technology and business strategy

Monitoring your 5.7 InnoDB cluster status

Initial monitoring attempts

Alternative approach

Let's go JSON

Conclusion

Top Categories

Tell us how we can help!