How to Set Up REPMGR with WITNESS for PostgreSQL 10

Posted in: Technical Track

This blog post will go over how to set up and implement repmgr which is the PostgreSQL application to manage replication between primary and replica nodes, allowing for quick and easy failover and rebuilding of replicas. For reference, all commands are run as root. For the commands that need to be run as the Postgres user, I will run them using su. As an example:

su - postgres -c 'COMMAND RUN AS POSTGRES USER'

Outline

  • The Setup
  • SSH
  • REPMGR
  • PostgreSQL Primary Configuration
  • PostgreSQL Replica Configuration
  • PostgreSQL Witness Configuration
  • Summary

The Setup

The environment that I will be working in consists of four CentOS 7 servers with a default install of PostgreSQL 10 installed using yum from the PostgreSQL 10 repo.  With PostgreSQL installed on each server, the Postgres user will already be on each server. I will configure the first server as the primary master. The second and third server will become replicas of the primary master. Then the fourth and final server will become a witness server used for voting in automatic failover scenarios.

base-centos-1 = primary master
base-centos-2 = replica
base-centos-3 = replica
base-centos-4 = witness

SSH

In order for repmgr to work correctly, the Postgres user account will need to be able to SSH to each of the other servers with no prompt.  In order to do this I went to each server and generated an SSH key, and then collected all of the public SSH keys and placed them in an authorzied_keys file on each server.  This allowed me to SSH between the servers with no prompt. I have SELINUX disabled by default. If you are using SELINUX, make sure it is set up correctly for SSH and authorized keys.

On each server, I did the following:

1) [ON ALL SERVERS] Become the Postgres user, create the ssh key, then cat out the public key:

su - postgres -c 'ssh-keygen'
su - postgres -c 'cat ~/.ssh/id_rsa.pub'

2) [ON ALL SERVERS] Create an authorized keys file and set the permissions:

su - postgres -c 'touch ~/.ssh/authorized_keys'
su - postgres -c 'chmod 600 ~/.ssh/authorized_keys'
su - postgres -c 'vi ~/.ssh/authorized_keys'

3) [ON ALL SERVERS] Place all of the public keys from Step 1 into the authorized_keys files created in Step 2.  The authorized_keys file contents looked like the following on all of my servers, with each key as a new line:

ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQDTXxjFY8dLs2GVRpDY7asAK5SvwITPVSJN9ItnwsVtzCpZgX/Mbnkc/jHgwuIGb0srh/KthByyYJi14QViI+x7xVQm8eyuqMBtORt9rl6rLF73H/gYPO9jONIbD/yihxJMWmJK1Ro6Armhfj5OyTXyW6vjbXqvl7fuSi4n13ubW+G7Pnk8jDK+5rOFYve6Czmde7cQPeueo7sY4oZCbhO+Vr+HwK6qUNUsP3/iRj/bjpbNh8gaqQDl5y5XCNV32y3IhgMgsimO4EkSbb03y0lcWI8dJ6asnXUZumvUwasrFcQ2SFRqb/F3kK4L2Ofy9qYo0yV+FyCuJlsoEur1IRKF postgres@base-centos-1
ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQDYHmDFOtYc/VDxccNRQEnDYBTE8QDiUTMX46PX1p5tvs6qvP3VPMEccs4um0YVFXZTmbnvyeN3bBPe23NS5Pal6ySfAxIdAAOt5YzKoJ0BJAoHksfTchXsCvSs1zIBLXLSRTYdmjb2s+EWxB9elle4Z2KbZrXbzVkSQuUOdMtbjTPUfIqcVh8kokCznpVjKSEXtHg9Vx1Whg20brw99EzhcBeS5q+jJtQbaSa6VG3VUmsznxtoWiA9EgMZ9C7hcmxTrtKNEJrQvd1LpBObTYbueV7+MVlgShTct88alun12iheT6x6ien445X1lYjLmkjGT8CHNqdos+VQR9sgT5R7 postgres@base-centos-2
ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQCgQNra4/rDOVWr5uV5nSa49yPLlgAJ6crsYKpvhfRr3L5J/T48QE468Am5t3lA318Nst2FbObq7dduqBNhOQDurPlTiPd9cWQsj0mVySnZ2gD7sc8epyRcNbR5cNBum2JKkez4X8FxArMZwYkvPd28dAPFd30NwLYgIFhQ/jzScweyfX1RYp4s+AOI0XpBAiuQZ6r5Gxz+h61jmsbCwM4ZmT4J8OFIF/x/GFExF126PB2PqEg41OK2AGZ2eneTtaoDxMoVDPuy1vhdUw/aODf1dPZYRTn2rwbdXDW7E6Y8/dxT3Lo1n8v/syboy/MsF50NdYzwOb0/Q4dtYsVPpemV postgres@base-centos-3
ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQC5UAsmEkw9INHwXL6cpHqOy5O8VIpvLsfklzoHYfmGxBhGqi6nZzV/+TAzpotrmAf7PIUEdzWOm1lTfii1iRU821ks1bSPN2FeBXzK5Mhtl6KHpIXeDOS64WEJRcen2cQIw9f5f6Ji6PBlc3AQBvspxuNtth6Y0vOTq2N+kFTGOM95kuuIzRs7QvU0XrCoMKOn/BBE+e8ad72gwjCP1q9aNheDJBCsB9WhJerN+LOLr2gJiA0Ha6xxSGQ5WoAB17nYSQunsSvvAU079wfVt/e6OMozyNC6X/+I7IRcRg8IHS+aMKvKrasfi5PTawqoideB4QngM6lHmUR9/MFf9Ikt postgres@base-centos-4

4) I am now able to become the Postgres user and SSH to any of the other servers and log in without a password prompt. This is an example on base-centos-1:

[root@base-centos-1 vagrant]# su - postgres
Last login: Mon Jul 22 18:27:34 UTC 2019 on pts/0
-bash-4.2$ ssh base-centos-2
Last login: Mon Jul 22 18:22:31 2019
-bash-4.2$ hostname
base-centos-2

REPMGR

For repmgr, I first installed the repmgr software, and then configured it on each server to make sure all of the configurations are correct. With the installation of PostgreSQL 10 on my servers, I installed the PostgreSQL repository using the following RPM. This repo also includes repmgr, which is where I am installing the software from. You can find the different yum repositories from https://yum.postgresql.org/repopackages.php:

yum install -y https://download.postgresql.org/pub/repos/yum/reporpms/EL-7-x86_64/pgdg-redhat-repo-latest.noarch.rpm

1) [ON ALL SERVERS] I installed repmgr from the PostgreSQL 10 repository. Make sure you install the correct repmgr for the version of PosgreSQL that you are running:

yum install -y repmgr10.x86_64

2) [ON ALL SERVERS] Next, I configured repmgr on each server with the servers specific details. Make sure to update each according to the server and paths. As an example for each server, I updated the node_id, node_name, and conninfo to match the servers I was configuring. If you are using a different version of PostgreSQL you will want to make sure you update the paths for the version you are using. Do NOT assume the paths in my example will be the same on your system. In my setup, the configuration file was located at /etc/repmgr/10/repmgr.conf.

node_id=101
node_name='base-centos-1'
conninfo='host=base-centos-1 dbname=repmgr user=repmgr'
data_directory='/var/lib/pgsql/10/data/'
config_directory='/var/lib/pgsql/10/data'
log_file='/var/log/repmgr.log'
repmgrd_service_start_command = '/usr/pgsql-10/bin/repmgrd -d'
repmgrd_service_stop_command = 'kill `cat $(/usr/pgsql-10/bin/repmgrd --show-pid-file)`'
promote_command='repmgr standby promote -f /etc/repmgr/10/repmgr.conf --siblings-follow --log-to-file'
follow_command='repmgr standby follow -f /etc/repmgr/10/repmgr.conf --log-to-file'
failover=automatic
reconnect_attempts=3
reconnect_interval=5
ssh_options='-q -o StrictHostKeyChecking=no -o ConnectTimeout=10'

3) [ON ALL SERVERS] I created the log file that I configured in Step 2 so I would not get an error when starting the service:

su - postgres -c 'touch /var/log/repmgr.log'

 

PostgreSQL Primary Configuration

Next, I started working on the primary server to get it set up within repmgr. For the configuration of the pg_bha.conf and postgresql.conf, I am only updating them on the primary master, because when I rebuild the replicas from the master, these changes will get copied over. If your server keeps the configuration files separate from the data directory, you will need to update these configurations on all of the servers, not just on the primary master.

1) Create the repmgr user account and repmgr database that will be used for repmgr to manage the cluster. The repmgr user account will be used for replication to the PostgreSQL replica servers to the primary master.

su - postgres -c 'createuser --replication --createdb --createrole --superuser repmgr'
su - postgres -c "psql -c 'ALTER USER repmgr SET search_path TO repmgr, \"\$user\", public;'"
su - postgres -c 'createdb repmgr --owner=repmgr'

2) Update pg_hba.conf to allow the repmgr account to authenticate. With trust being used, this allows the repmgr user account in the database to authenticate without a password. If you are building a production environment, you will want a more secure method using md5 and passwords. These changes won’t take place until the PostgreSQL service is restarted, which I will do after the next step. In my setup this file is located at /var/lib/pgsql/10/data/pg_hba.conf. I found that I had the best results by specifying the IPs.

host    replication     repmgr          192.168.56.101/32       trust
host    replication     repmgr          192.168.56.102/32       trust
host    replication     repmgr          192.168.56.103/32       trust
host    replication     repmgr          192.168.56.104/32       trust
host    repmgr          repmgr          192.168.56.101/32       trust
host    repmgr          repmgr          192.168.56.102/32       trust
host    repmgr          repmgr          192.168.56.103/32       trust
host    repmgr          repmgr          192.168.56.104/32       trust

3) Next, I configured the PostgreSQL configuration file to allow for replication to occur by setting the wal_level and other settings. I also add the repmgr shared libraries into the postgresql.conf file. In my setup, the PostgreSQL configuration file is at /var/lib/pgsql/10/data/postgresql.conf.

listen_addresses = '*'
shared_preload_libraries = 'repmgr'
wal_level = replica
archive_mode = on
max_wal_senders = 10
hot_standby = on
archive_command = 'cp -i %p /var/lib/pgsql/10/data/archive/%f'

4) I then created the archive directory that I specified in the PostgresSQL configuration file using the Postgres user to make sure it had the correct permissions, and then I restarted PostgreSQL server to pick up the new settings.

su - postgres -c 'mkdir /var/lib/pgsql/10/data/archive'
systemctl enable postgresql-10.service
systemctl restart postgresql-10.service
systemctl status postgresql-10.service

5) Now that repmgr and PostgreSQL are both configured, I will register my PostgreSQL server with repmgr and then start the repmgr daemon service so that it monitors the status of the replication cluster.

su - postgres -c 'repmgr primary register'
su - postgres -c 'repmgr daemon start'
su - postgres -c 'repmgr daemon status'

 ID | Name          | Role    | Status    | Upstream      | repmgrd | PID  | Paused? | Upstream last seen
----+---------------+---------+-----------+---------------+---------+------+---------+--------------------
 101 | base-centos-1 | primary | * running |               | running | 5496 | no      | n/a

 

PostgreSQL Replica Configuration

At this stage I already have SSH and regmgr installed and repmgr configured on all of the servers. Now I will utilize repmgr to backup the databases off of the primary server and restore them onto the two replica servers base-centos-2 and base-centos-3. Because my pg_hba.conf and postgresql.conf files are in the same directory as my data, these files will also get copied over during the backup and restore process. I will be running the following commands on both of the replica servers, but not the final witness server which I will configure later.

1) Stop PostgreSQL and clear the data directory:

systemctl stop postgresql-10.service
rm -rf /var/lib/pgsql/10/data/*

2) Next, I backed up and restored the data from the primary server, and then started the PostgreSQL server and viewed the status to make sure it was running:

su - postgres -c "repmgr -h base-centos-1 -U repmgr -d repmgr standby clone"
systemctl start postgresql-10.service
systemctl status postgresql-10.service

3) Then I register the replica with the repmgr cluster and start the repmgr daemon service to monitor the server in the cluster:

su - postgres -c 'repmgr standby register -h base-centos-1 -U repmgr'
su - postgres -c 'repmgr daemon start'
su - postgres -c 'repmgr daemon status'

 ID | Name          | Role    | Status    | Upstream      | repmgrd | PID  | Paused? | Upstream last seen
----+---------------+---------+-----------+---------------+---------+------+---------+--------------------
 101 | base-centos-1 | primary | * running |               | running | 5496 | no      | n/a
 102 | base-centos-2 | standby |   running | base-centos-1 | running | 3414 | no      | 1 second(s) ago
 103 | base-centos-3 | standby |   running | base-centos-1 | running | 4549 | no      | 1 second(s) ago

 

4) To verify that the replication is up and running I can run the following commands to validate the replication status:

ON PRIMARY


su - postgres -c 'psql -c "select pid, usename, client_addr, backend_start, state, sync_state from pg_stat_replication;"'

  pid  | usename |  client_addr   |         backend_start         |   state   | sync_state
-------+---------+----------------+-------------------------------+-----------+------------
 15347 | repmgr  | 192.168.56.102 | 2019-07-22 18:19:31.232492+00 | streaming | async
 15363 | repmgr  | 192.168.56.103 | 2019-07-22 18:19:36.566369+00 | streaming | async

ON REPLICAs


su - postgres -c 'psql --pset expanded=auto -c "select * from pg_stat_wal_receiver;"'

-[ RECORD 1 ]---------+------------------------------------------------------------------
pid                   | 5408
status                | streaming
receive_start_lsn     | 0/B000000
receive_start_tli     | 3
received_lsn          | 0/B0025B8
received_tli          | 3
last_msg_send_time    | 2019-07-22 18:22:17.942712+00
last_msg_receipt_time | 2019-07-22 18:22:17.943306+00
latest_end_lsn        | 0/B0025B8
latest_end_time       | 2019-07-22 18:19:47.617303+00
slot_name             |
conninfo              | user=repmgr host='base-centos-1' application_name='base-centos-3'

PostgreSQL Witness Configuration

Having a separate witness is good in the event that there is a network outage and you want a server to vote on who should become the new master. This will help prevent split-brain scenarios, with the understanding that you cannot truly prevent split brains unless you implement a configuration management solution like consul or zookeeper. The important thing to note when setting up a witness is that the witness has to use its own repmgr database. You CANNOT make a standby replica server a witness. I will NOT be restoring the database from the primary onto the witness server, so I will have to repeat a couple of the steps to create the repmgr user and database which will be separate. I then have to configure the pg_hba settings to allow for the repmgr user from the other servers to talk to the witness.

1) Create the repmgr user account and repmgr database that will be used for repmgr to manage the cluster. The repmgr user account will be used for replication to the PostgreSQL replica servers to the primary master.

su - postgres -c 'createuser --replication --createdb --createrole --superuser repmgr'
su - postgres -c "psql -c 'ALTER USER repmgr SET search_path TO repmgr, \"\$user\", public;'"
su - postgres -c 'createdb repmgr --owner=repmgr'

2) Update pg_hba.conf to allow the repmgr account to authenticate. With trust being used, the PosgreSQL user can authenticate without a password. If you are building a production environment, you will want a more secure method using md5 and passwords. These changes won’t take place until PostgreSQL service has been restarted, which I will do after the next step. In my setup, this file is located at /var/lib/pgsql/10/data/pg_hba.conf.

host    replication     repmgr          192.168.56.101/32       trust
host    replication     repmgr          192.168.56.102/32       trust
host    replication     repmgr          192.168.56.103/32       trust
host    replication     repmgr          192.168.56.104/32       trust
host    repmgr          repmgr          192.168.56.101/32       trust
host    repmgr          repmgr          192.168.56.102/32       trust
host    repmgr          repmgr          192.168.56.103/32       trust
host    repmgr          repmgr          192.168.56.104/32       trust

3) Register the witness with the primary server in the cluster:

su - postgres -c 'repmgr witness register -h base-centos-1 -U repmgr'
su - postgres -c 'repmgr daemon start'
su - postgres -c 'repmgr daemon status'

 ID | Name          | Role    | Status    | Upstream      | repmgrd | PID  | Paused? | Upstream last seen
----+---------------+---------+-----------+---------------+---------+------+---------+--------------------
 101 | base-centos-1 | primary | * running |               | running | 5496 | no      | n/a
 102 | base-centos-2 | standby |   running | base-centos-1 | running | 3414 | no      | 1 second(s) ago
 103 | base-centos-3 | standby |   running | base-centos-1 | running | 4549 | no      | 1 second(s) ago
 104 | base-centos-4 | witness | * running | base-centos-1 | running | 3226 | no      | 0 second(s) ago

Summary

Now I have a fully configured repmgr cluster that allows for me to failover from the primary to one of the replicas. As an example, I can run the following to migrate the primary role to one of the standbys. In order to run this, the failover command has to be run on the standby that is becoming the new primary:

su - postgres -c 'repmgr --dry-run -h base-centos-2 standby switchover --siblings-follow'
su - postgres -c 'repmgr -h base-centos-2 standby switchover --siblings-follow'
su - postgres -c 'repmgr cluster show'
su - postgres -c 'repmgr cluster event'

FAILOVER TO base-centos-2

su - postgres -c 'repmgr -h base-centos-2 standby switchover --siblings-follow'

WARNING: following problems with command line parameters detected:
  database connection parameters not required when executing UNKNOWN ACTION
NOTICE: executing switchover on node "base-centos-2" (ID: 102)
NOTICE: local node "base-centos-2" (ID: 102) will be promoted to primary; current primary "base-centos-1" (ID: 101) will be demoted to standby
NOTICE: stopping current primary node "base-centos-1" (ID: 101)
NOTICE: issuing CHECKPOINT
DETAIL: executing server command "/usr/pgsql-10/bin/pg_ctl  -D '/var/lib/pgsql/10/data' -W -m fast stop"
INFO: checking for primary shutdown; 1 of 60 attempts ("shutdown_check_timeout")
INFO: checking for primary shutdown; 2 of 60 attempts ("shutdown_check_timeout")
NOTICE: current primary has been cleanly shut down at location 0/C000028
NOTICE: promoting standby to primary
DETAIL: promoting server "base-centos-2" (ID: 102) using "/usr/pgsql-10/bin/pg_ctl  -w -D '/var/lib/pgsql/10/data' promote"
waiting for server to promote.... done
server promoted
NOTICE: waiting up to 60 seconds (parameter "promote_check_timeout") for promotion to complete
NOTICE: STANDBY PROMOTE successful
DETAIL: server "base-centos-2" (ID: 102) was successfully promoted to primary
INFO: local node 101 can attach to rejoin target node 102
DETAIL: local node's recovery point: 0/C000028; rejoin target node's fork point: 0/C000098
NOTICE: setting node 101's upstream to node 102
WARNING: unable to ping "host=base-centos-1 dbname=repmgr user=repmgr"
DETAIL: PQping() returned "PQPING_NO_RESPONSE"
NOTICE: starting server using "/usr/pgsql-10/bin/pg_ctl  -w -D '/var/lib/pgsql/10/data' start"
NOTICE: NODE REJOIN successful
DETAIL: node 101 is now attached to node 102
NOTICE: executing STANDBY FOLLOW on 2 of 2 siblings
INFO:  node 104 received notification to follow node 102
INFO: STANDBY FOLLOW successfully executed on all reachable sibling nodes
NOTICE: switchover was successful
DETAIL: node "base-centos-2" is now primary and node "base-centos-1" is attached as standby
NOTICE: STANDBY SWITCHOVER has completed successfully

CLUSTER STATUS

su - postgres -c 'repmgr cluster show'

 ID  | Name          | Role    | Status    | Upstream      | Location | Priority | Timeline | Connection string
-----+---------------+---------+-----------+---------------+----------+----------+----------+----------------------------------------------
 101 | base-centos-1 | standby |   running | base-centos-2 | default  | 100      | 3        | host=base-centos-1 dbname=repmgr user=repmgr
 102 | base-centos-2 | primary | * running |               | default  | 100      | 4        | host=base-centos-2 dbname=repmgr user=repmgr
 103 | base-centos-3 | standby |   running | base-centos-2 | default  | 100      | 3        | host=base-centos-3 dbname=repmgr user=repmgr
 104 | base-centos-4 | witness | * running | base-centos-2 | default  | 0        | 1        | host=base-centos-4 dbname=repmgr user=repmgr

CLUSTER EVENTS FOR FAILOVER


su - postgres -c 'repmgr cluster event'

 Node ID | Name          | Event                      | OK | Timestamp           | Details
---------+---------------+----------------------------+----+---------------------+-------------------------------------------------------------------------
 102     | base-centos-2 | child_node_new_connect     | t  | 2019-07-22 20:00:04 | new witness "base-centos-4" (ID: 104) has connected
 102     | base-centos-2 | child_node_new_connect     | t  | 2019-07-22 20:00:04 | new standby "base-centos-3" (ID: 103) has connected
 104     | base-centos-4 | witness_register           | t  | 2019-07-22 19:59:59 | witness registration succeeded; upstream node ID is 102
 103     | base-centos-3 | standby_follow             | t  | 2019-07-22 19:59:59 | standby attached to upstream node "base-centos-2" (ID: 102)
 104     | base-centos-4 | repmgrd_upstream_reconnect | t  | 2019-07-22 19:59:59 | witness monitoring connection to primary node "base-centos-2" (ID: 102)
 102     | base-centos-2 | repmgrd_reload             | t  | 2019-07-22 19:59:58 | monitoring cluster primary "base-centos-2" (ID: 102)
 101     | base-centos-1 | repmgrd_standby_reconnect  | t  | 2019-07-22 19:59:58 | node has become a standby, monitoring connection to upstream node 102
 102     | base-centos-2 | standby_switchover         | t  | 2019-07-22 19:59:54 | node 102 promoted to primary, node 101 demoted to standby
 101     | base-centos-1 | node_rejoin                | t  | 2019-07-22 19:59:54 | node 101 is now attached to node 102
 102     | base-centos-2 | standby_promote            | t  | 2019-07-22 19:59:54 | server "base-centos-2" (ID: 102) was successfully promoted to primary
email

Interested in working with Kevin? Schedule a tech call.

About the Author

MySQL Database Consultant
Kevin Markwardt has twenty years of system administration experience ranging from MySQL, Linux, Windows, and VMware. Over the last six years he has been dedicated to MySQL and Linux administration with a focus on scripting, automation, HA, and cloud solutions. Kevin has lead and assisted with many projects focusing on larger scale implementations of technologies, including ProxySQL, Orchestrator, Pacemaker, GCP, AWS RDS, and MySQL. Kevin Markwardt is a certified GCP Professional Cloud Architect, and a certified AWS Solutions Architect - Associate. Currently he is a Project Engineer at Pythian specializing in MySQL and large scale client projects. One of his new directives is Postgres and is currently supporting multiple internal production Postgres instances.

No comments

Leave a Reply

Your email address will not be published. Required fields are marked *