MariaDB version upgrade to 10.1.31 breaks Galera cluster

Posted in: MySQL, Open Source, Technical Track

Recently we faced an issue where the config management software had automatically upgraded the mariadb-server-10.1 package to the latest 10.1.31 version. This upgrade broke the galera cluster setup for this installation.

I’ve started to recreate this issue in my local lab setup and I managed to reproduce this problem.

I have created a 3 node galera setup: galera1 (192.168.55.100), galera2 (192.168.55.101) and galera3 (192.168.55.102). All 3 servers run MariaDB-10.1.30. Galera replication is working fine.

This is the basic galera config:

# cat /etc/mysql/conf.d/cluster.cnf
#########################################################
# Galera config
#########################################################
[mysqld]
wsrep_on                                  = ON
wsrep_provider                            = /usr/lib/libgalera_smm.so
wsrep_provider_options                    =
wsrep_cluster_name                        = pxc_bootstrap
wsrep_cluster_address                     = gcomm://192.168.55.100,192.168.55.101,192.168.55.102
wsrep_node_address                        = 192.168.55.101
wsrep_log_conflicts                       = 1
wsrep_sst_method                          = xtrabackup-v2
wsrep_sst_auth                            = sstuser:sstpass
# Galera overrides
binlog_format                             = ROW
innodb_autoinc_lock_mode                  = 2

When I upgrade the galera2 node to 10.1.31 it will not rejoin the cluster:

2018-02-19 18:07:09 140541460941568 [Note] WSREP: State transfer required:
	Group state: ba08f7ac-1589-11e8-8944-27e143bb408f:285676
	Local state: ba08f7ac-1589-11e8-8944-27e143bb408f:273759
2018-02-19 18:07:09 140541460941568 [Note] WSREP: New cluster view: global state: ba08f7ac-1589-11e8-8944-27e143bb408f:285676, view# 33: Primary, number of nodes: 3, my index: 0, protocol version 3
2018-02-19 18:07:09 140541460941568 [Warning] WSREP: Gap in state sequence. Need state transfer.
2018-02-19 18:07:09 140541165565696 [Note] WSREP: Running: 'wsrep_sst_xtrabackup-v2 --role 'joiner' --address '192.168.55.101' --datadir '/var/lib/mysql/'   --parent '4754'  '' '
WSREP_SST: [INFO] Streaming with xbstream (20180219 18:07:09.327)
WSREP_SST: [INFO] Using socat as streamer (20180219 18:07:09.329)
WSREP_SST: [INFO] Stale sst_in_progress file: /var/lib/mysql//sst_in_progress (20180219 18:07:09.332)
WSREP_SST: [INFO] Evaluating timeout -k 110 100 socat -u TCP-LISTEN:4444,reuseaddr stdio | xbstream -x; RC=( ${PIPESTATUS[@]} ) (20180219 18:07:09.354)
2018-02-19 18:07:11 140541190731520 [Note] WSREP: (b2ffafb5, 'tcp://0.0.0.0:4567') connection to peer b2ffafb5 with addr tcp://192.168.55.101:4567 timed out, no messages seen in PT3S
2018-02-19 18:07:11 140541190731520 [Note] WSREP: (b2ffafb5, 'tcp://0.0.0.0:4567') turning message relay requesting off
WSREP_SST: [ERROR] Possible timeout in receving first data from donor in gtid stage (20180219 18:08:49.362)
WSREP_SST: [ERROR] Cleanup after exit with status:32 (20180219 18:08:49.364)
2018-02-19 18:08:49 140541165565696 [ERROR] WSREP: Failed to read 'ready <addr>' from: wsrep_sst_xtrabackup-v2 --role 'joiner' --address '192.168.55.101' --datadir '/var/lib/mysql/'   --parent '4754'  ''
	Read: '(null)'
2018-02-19 18:08:49 140541165565696 [ERROR] WSREP: Process completed with error: wsrep_sst_xtrabackup-v2 --role 'joiner' --address '192.168.55.101' --datadir '/var/lib/mysql/'   --parent '4754'  '' : 32 (Broken pipe)
2018-02-19 18:08:49 140541460941568 [ERROR] WSREP: Failed to prepare for 'xtrabackup-v2' SST. Unrecoverable.
2018-02-19 18:08:49 140541460941568 [ERROR] Aborting

The xtrabackup-script on the donor side is never executed:

2018-02-19 18:07:08 139902381455104 [Note] WSREP: Node b91cf7da state prim
2018-02-19 18:07:08 139902381455104 [Note] WSREP: view(view_id(PRIM,b2ffafb5,33) memb {
	b2ffafb5,0
	b91cf7da,0
	c9241678,0
} joined {
} left {
} partitioned {
})
2018-02-19 18:07:08 139902381455104 [Note] WSREP: save pc into disk
2018-02-19 18:07:08 139902373062400 [Note] WSREP: New COMPONENT: primary = yes, bootstrap = no, my_idx = 2, memb_num = 3
2018-02-19 18:07:08 139902373062400 [Note] WSREP: STATE EXCHANGE: Waiting for state UUID.
2018-02-19 18:07:09 139902373062400 [Note] WSREP: STATE EXCHANGE: sent state msg: b3995870-159f-11e8-8bdd-c675484f2591
2018-02-19 18:07:09 139902373062400 [Note] WSREP: STATE EXCHANGE: got state msg: b3995870-159f-11e8-8bdd-c675484f2591 from 0 (galera2)
2018-02-19 18:07:09 139902373062400 [Note] WSREP: STATE EXCHANGE: got state msg: b3995870-159f-11e8-8bdd-c675484f2591 from 1 (galera3)
2018-02-19 18:07:09 139902373062400 [Note] WSREP: STATE EXCHANGE: got state msg: b3995870-159f-11e8-8bdd-c675484f2591 from 2 (galera1)
2018-02-19 18:07:09 139902373062400 [Note] WSREP: Quorum results:
	version    = 4,
	component  = PRIMARY,
	conf_id    = 32,
	members    = 2/3 (joined/total),
	act_id     = 285676,
	last_appl. = 0,
	protocols  = 0/7/3 (gcs/repl/appl),
	group UUID = ba08f7ac-1589-11e8-8944-27e143bb408f
2018-02-19 18:07:09 139902373062400 [Note] WSREP: Flow-control interval: [28, 28]
2018-02-19 18:07:09 139902373062400 [Note] WSREP: Trying to continue unpaused monitor
2018-02-19 18:07:09 139902653332224 [Note] WSREP: New cluster view: global state: ba08f7ac-1589-11e8-8944-27e143bb408f:285676, view# 33: Primary, number of nodes: 3, my index: 2, protocol version 3
2018-02-19 18:07:09 139902653332224 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
2018-02-19 18:07:09 139902653332224 [Note] WSREP: REPL Protocols: 7 (3, 2)
2018-02-19 18:07:09 139902653332224 [Note] WSREP: Assign initial position for certification: 285676, protocol version: 3
2018-02-19 18:07:09 139902431176448 [Note] WSREP: Service thread queue flushed.
2018-02-19 18:07:11 139902381455104 [Note] WSREP: (c9241678, ‘tcp://0.0.0.0:4567’) turning message relay requesting off
2018-02-19 18:08:51 139902381455104 [Note] WSREP: declaring b91cf7da at tcp://192.168.55.102:4567 stable
2018-02-19 18:08:51 139902381455104 [Note] WSREP: forgetting b2ffafb5 (tcp://192.168.55.101:4567)
2018-02-19 18:08:51 139902381455104 [Note] WSREP: Node b91cf7da state prim
2018-02-19 18:08:51 139902381455104 [Note] WSREP: view(view_id(PRIM,b91cf7da,34) memb {
	b91cf7da,0
	c9241678,0
} joined {
} left {
} partitioned {
	b2ffafb5,0
})

After some digging into the wsrep_sst_xtrabackup-v2 script I traced back the problem to the function wait_for_listen. This function was rewritten for MariaDB 10.1.31 to be able to add support for FreeBSD. This rewrite seems to have somehow broken it for Linux.

I have created a bug report in the MariaDB Jira but if you’re using MariaDB with Galera cluster, I suggest you wait a while before upgrading your installation to 10.1.31.

email
Want to talk with an expert? Schedule a call with our team to get the conversation started.

About the Author

Principal Consultant
Matthias has been passionate about computers since the age of 10. He has been working with them ever since. Currently he's a Lead Database Consultant in one of the MySQL teams at Pythian where he's the technical lead for his team. Together with his team he works to provide the best possible service to the customers.

3 Comments. Leave new

Francis Guslinski
February 24, 2018 8:38 am

You can download https://raw.githubusercontent.com/MariaDB/server/10.2/scripts/wsrep_sst_xtrabackup-v2.sh and substitute the script on /usr/bin/wsrep_sst_xtrabackup-v2

It will work again.

Reply

it seems to another good update for maria db which will be helpful for different aspects.

Reply
trangtriquangcao
June 26, 2020 12:35 pm

I am having the same problem when deploying xtrabackup-v2 in mariadb 10.2, and then I deploy with mariadb 10.4, I always get log “Resource Limits:” although my server have few

Reply

Leave a Reply

Your email address will not be published. Required fields are marked *