1 of 3 nodes failing with "no recovery target specified"

Hi,

I have been struggling for a couple of days now getting a 3 node setup with patroni/postgres running properly. I’ve been following this guide: https://www.linode.com/docs/databases/postgresql/create-a-highly-available-postgresql-cluster-using-patroni-and-haproxy/.

The master and a single secondary node start fine every time, but a third node always fails to start postgres with the following:

Jul 16 14:30:45 devpostgres03 patroni[10951]: 2018-07-16 14:30:45,213 INFO: Lock owner: devpostgres01; I am devpostgres03 Jul 16 14:30:45 devpostgres03 patroni[10951]: 2018-07-16 14:30:45,228 INFO: Lock owner: devpostgres01; I am devpostgres03 Jul 16 14:30:45 devpostgres03 patroni[10951]: 2018-07-16 14:30:45,255 INFO: Local timeline=14 lsn=0/60005F8 Jul 16 14:30:45 devpostgres03 patroni[10951]: 2018-07-16 14:30:45,269 INFO: master_timeline=22 Jul 16 14:30:45 devpostgres03 patroni[10951]: 2018-07-16 14:30:45,271 INFO: master: history=1 0/19C4218 no recovery target specified Jul 16 14:30:45 devpostgres03 patroni[10951]: 2 0/19C4638 no recovery target specified Jul 16 14:30:45 devpostgres03 patroni[10951]: 3 0/40000D0 no recovery target specified Jul 16 14:30:45 devpostgres03 patroni[10951]: 4 0/4000410 no recovery target specified Jul 16 14:30:45 devpostgres03 patroni[10951]: 5 0/4000830 no recovery target specified Jul 16 14:30:45 devpostgres03 patroni[10951]: 6 0/4000C50 no recovery target specified Jul 16 14:30:45 devpostgres03 patroni[10951]: 7 0/4000F90 no recovery target specified Jul 16 14:30:45 devpostgres03 patroni[10951]: 8 0/40011F0 no recovery target specified Jul 16 14:30:45 devpostgres03 patroni[10951]: 9 0/4001370 no recovery target specified Jul 16 14:30:45 devpostgres03 patroni[10951]: 10 0/40015D0 no recovery target specified Jul 16 14:30:45 devpostgres03 patroni[10951]: 11 0/6000178 no recovery target specified Jul 16 14:30:45 devpostgres03 patroni[10951]: 12 0/60002F8 no recovery target specified Jul 16 14:30:45 devpostgres03 patroni[10951]: 13 0/6000478 no recovery target specified Jul 16 14:30:45 devpostgres03 patroni[10951]: 14 0/60007B8 no recovery target specified Jul 16 14:30:45 devpostgres03 patroni[10951]: 15 0/6000938 no recovery target specified Jul 16 14:30:45 devpostgres03 patroni[10951]: 16 0/80000D0 no recovery target specified Jul 16 14:30:45 devpostgres03 patroni[10951]: 17 0/8000330 no recovery target specified Jul 16 14:30:45 devpostgres03 patroni[10951]: 18 0/80004B0 no recovery target specified Jul 16 14:30:45 devpostgres03 patroni[10951]: 19 0/8000630 no recovery target specified Jul 16 14:30:45 devpostgres03 patroni[10951]: 20 0/8000970 no recovery target specified Jul 16 14:30:45 devpostgres03 patroni[10951]: 21 0/8000AF0 no recovery target specified Jul 16 14:30:45 devpostgres03 patroni[10951]: 2018-07-16 14:30:45,280 INFO: starting as a secondary Jul 16 14:30:46 devpostgres03 patroni[10951]: 2018-07-16 14:30:46,004 INFO: postmaster pid=11404 Jul 16 14:30:46 devpostgres03 patroni[10951]: < 2018-07-16 14:30:46.008 UTC >LOG: redirecting log output to logging collector process Jul 16 14:30:46 devpostgres03 patroni[10951]: < 2018-07-16 14:30:46.008 UTC >HINT: Future log output will appear in directory "pg_log". Jul 16 14:30:46 devpostgres03 patroni[10951]: 10.10.10.213:5432 - no response Jul 16 14:30:47 devpostgres03 patroni[10951]: 2018-07-16 14:30:47,020 ERROR: postmaster is not running Jul 16 14:30:47 devpostgres03 patroni[10951]: 2018-07-16 14:30:47,056 INFO: Lock owner: devpostgres01; I am devpostgres03 Jul 16 14:30:47 devpostgres03 patroni[10951]: 2018-07-16 14:30:47,061 INFO: failed to start postgres

Here is my configuration:

`scope: postgres namespace: /db/ name: devpostgres03

restapi: listen: 10.10.10.213:8008 connect_address: 10.10.10.213:8008 etcd: host: 10.10.10.214:2379

bootstrap: dcs: ttl: 30 loop_wait: 10 retry_timeout: 10 maximum_lag_on_failover: 1048576 postgresql: use_pg_rewind: true recovery_conf: recovery_target: immediate initdb: - encoding: UTF8 - data-checksums

pg_hba:
- host replication replicator 127.0.0.1/32 md5
- host replication replicator 10.10.10.211/0 md5
- host replication replicator 10.10.10.213/0 md5
- host replication replicator 10.10.10.212/0 md5
- host all all 0.0.0.0/0 md5

users:
    admin:
        password: admin
        options:
            - createrole
            - createdb

postgresql: listen: 10.10.10.213:5432 connect_address: 10.10.10.213:5432 data_dir: /data/patroni pgpass: /tmp/pgpass authentication: replication: username: replicator password: rep-pass superuser: username: postgres password: secretpassword parameters: unix_socket_directories: ‘.’

tags: nofailover: false noloadbalance: false clonefrom: false nosync: false`

And finally, I am starting via systemd with the following service file:

`[Unit] Description=Patroni runners to orchestrate a high-availability PostgreSQL After=syslog.target network.target

[Service] Type=simple User=postgres Group=postgres ExecStart=/usr/bin/patroni /etc/patroni/patroni.yml KillMode=process TimeoutSec=30 Restart=no

[Install] WantedBy=multi-user.target`

I tried specifing the recovery_target key in my patroni config but I still receive this issue constantly on one of the slaves. Any help is hugely appreciated as I’ve been struggling for a while on this.

Thanks,

Carey

Issue Analytics

State:
Created 5 years ago
Comments:10 (3 by maintainers)

Top GitHub Comments

2reactions

alexeyklyukincommented, Jul 17, 2018

2018-07-17 10:17:34 UTC,,0,LOG,00000,"invalid resource manager ID in primary checkpoint record

 2018-07-17 10:17:34.762 UTC,,,10561,,5b4dc23e.2941,4,,2018-07-17 10:17:34 UTC,,0,PANIC,XX000,"could not locate a valid checkpoint record

Looks like your replica has not being initialized properly (perhaps something happened during the basebackup that made it abort prematurely). You can run
patronictl -c patroni.yaml reinit postgres devpostgres03 to reinitialize this node and see whether it would fix the issue.

I also noticed there is no archive_command and restore_command in your configuration files. This means that if the replica falls out-of-date while streaming your changes (because, for instance, it cannot keep up with the ratio of changes or there is a network issue and the master has already recycled the WAL segments still required for the replica) it won’t be able to continue without reinitializing. However, it looks like your issue is not caused by this, as in the case of the missing WAL segments the error message will be different.

0reactions

devopsadmin-jamescommented, Apr 14, 2020

yes, i made a mistake, thought postgres was the user required to reinitialise the cluster instead of clusterName “postgres-cluster”. Use the following command to fix this issue “2020-04-14 05:43:51,739 INFO: master: history=1 0/678BA08 no recovery target specified”

docker exec -it UAT_postgres2.1.z7i0uuwifizmap7rhgq900fvx patronictl -c patroni.yml reinit postgres-cluster postgres2

current Status

I will work on your suggestions for second query.

One last thing, i didn’t understand in this table why for all postgres services there is Pending restart marked, even if i recreated all services (containers) of this cluster(I am running this cluster in Docker-swarm mode), all of them are still marked as pending restart.

Is this something to worry about. It wasn’t coming when i was testing on brand new setup.

As far as i remember this happened/is coming only once i join my new postgres-cluster to an existing postgres database.

Thanks CyberDem0n

Top Results From Across the Web

Patroni node can not replicate data after failover due ... - GitLab

Patroni node can not replicate data after failover due to timeline before recovery point. A prospect is trying to setup Postgre HA by...

Gateway node failure and recovery - IBM

When a gateway node fails the in-memory queue is lost. The queue is rebuilt in memory by the node taking over for the...

Amazon Elasticache FAQs - Amazon Web Service

If you do not specify an Availability Zone when creating a cluster, ... rate is temporarily reduced during failure recovery of one or...

repmgr node rejoin

The node rejoin succeeded; or if --dry-run was provided, no issues were detected which would prevent the node rejoin. ERR_BAD_CONFIG (1). A configuration...

Configuring and managing high availability clusters Red Hat ...

If you do not specify a monitoring operation for a resource, one is added by ... capabilities to deter and recover from node...