Testing Postgres Replication Health

Go to file

starkandwayne-bot 68c8a30abc Updating CODE_OF_CONDUCT.md		2019-01-15 15:28:49 -05:00
.gitignore	Initial commit	2016-03-23 17:31:37 -04:00
CODE_OF_CONDUCT.md	Updating CODE_OF_CONDUCT.md	2019-01-15 15:28:49 -05:00
LICENSE	Initial commit	2016-03-23 17:31:37 -04:00
Makefile	Initial commit	2016-03-23 17:31:37 -04:00
README.md	Drop frontend / backend ports	2016-03-23 22:06:06 -04:00
main.go	Check slave positions before the master	2016-03-30 22:06:38 -04:00

README.md

Postgres Replication Tester

This repository houses pgrt, a small utility that will connect to all of the nodes in a PostgreSQL streaming replication cluster and verify the health and well-being of each node.

Example

A healthy cluster, with some leeway on the replication lag (`-l)

$ pgrt -M 10.244.232.2 -S 10.244.232.3 -S 10.244.232.4 -l 31768
10.244.232.2: 0/B30BA98
10.244.232.3: 0/B30BA98 (0)            to 0/B30BC28 (-400)
10.244.232.4: 0/B30BDC0 (-808)         to 0/B30BDC0 (0)

The same cluster, reporting as unhealthy because we only tolerate 800 bytes of replication lag (admittedly, fairly unrealistic):

$ pgrt -M 10.244.232.2 -S 10.244.232.3 -S 10.244.232.4 -l 800
10.244.232.2: 0/B17F098
10.244.232.3: 0/B17F230 (-408)         to 0/B17F230 (0)
10.244.232.4: 0/B17F558 (-1216)        to 0/B17F6E8 (-400)          !! too far behind write master
FAILED

Exit Codes

pgrt exits 0 if it can contact all nodes, each node is playing the part specified (i.e. write master is a write master, and read slaves are actually read slaves), and the replication lag (first parenthetical figure) is below the acceptable lag (per -l)

It exists non-zero on failure, with the following meanings:

1 - Option processing or other non-runtime error. Check your flags.
2 - Connectivity to at least one node failed.
3 - A query to the write master failed
4 - A query to one of the read slaves failed
5 - xlog conversion failed (if this happens, something is terribly broken...)
6 - One or more of the read slaves was lagging too far behind the master (based on -l)

Options

-M, --master   Replication master host.  May only be specified once
-S, --slave    Replication slave host(s).  May be specified more than once
-p, --port     TCP port that Postgres listens on (default: 6432)
-u, --user     User to connect as
-w, --password Password to connect with
-D, --debug    Enable debugging output (to standard error)
-l, --lag      Maximum acceptable lag behind the master xlog position (bytes) (default: 8192)