ERROR: Error when fetching backup: pg_basebackup exited with code=1

See original GitHub issue

Hi, Can someone please guide me on this. I am configuring Postgres 12.1 with Patroni. I have a cluster with 7 nodes.Everytime I scale up, i end up with this situation. The master/Leader starts and works but the slave/replica either stops or ends up in ‘creating replica’ mode.

root@103d123f2d5c:/bp2/src# patronictl -c pg_patroni.yml list
+---------+--------+----------------+--------+------------------+----+-----------+
| Cluster | Member |      Host      |  Role  |      State       | TL | Lag in MB |
+---------+--------+----------------+--------+------------------+----+-----------+
|  blue0  |  pg_0  | 127.0.0.1:5432 | Leader |     running      |  1 |           |
|  blue0  |  pg_1  | 127.0.0.1:5432 |        |     stopped      |    |   unknown |
|  blue0  |  pg_2  | 127.0.0.1:5432 |        |     stopped      |    |   unknown |
|  blue0  |  pg_3  | 127.0.0.1:5432 |        |     stopped      |    |   unknown |
|  blue0  |  pg_4  | 127.0.0.1:5432 |        |     stopped      |    |   unknown |
|  blue0  |  pg_5  | 127.0.0.1:5432 |        |     stopped      |    |   unknown |
|  blue0  |  pg_6  | 127.0.0.1:5432 |        | creating replica |    |   unknown |
+---------+--------+----------------+--------+------------------+----+-----------+

The logs constantly shows error such as these: 

2020-01-02 23:04:27,006 DEBUG: Sending request(xid=717): SetData(path='/bp/blue0/members/pg_1', data=b'{"conn_url":"postgres://127.0.0.1:5432/postgres","api_url":"http://127.0.0.1:8008/patroni","state":"stopped","role":"uninitialized","version":"1.6.3"}', version=-1)
2020-01-02 23:04:27,011 DEBUG: Received response(xid=717): ZnodeStat(czxid=12885154300, mzxid=12885161237, ctime=1577999766021, mtime=1578006267006, version=652, cversion=0, aversion=0, ephemeralOwner=31334655534829047, dataLength=150, numChildren=0, pzxid=12885154300)
2020-01-02 23:04:27,011 INFO: trying to bootstrap from leader 'pg_0'
2020-01-02 23:04:27,025 ERROR: Error when fetching backup: pg_basebackup exited with code=1
2020-01-02 23:04:27,025 WARNING: Trying again in 5 seconds
2020-01-02 23:04:32,037 ERROR: Error when fetching backup: pg_basebackup exited with code=1
2020-01-02 23:04:32,037 ERROR: failed to bootstrap from leader 'pg_0'
2020-01-02 23:04:32,037 INFO: Removing data directory: /bp2/data/psql
2020-01-02 23:04:37,004 INFO: Lock owner: pg_0; I am pg_1
2020-01-02 23:04:37,006 DEBUG: Sending request(xid=718): SetData(path='/bp/blue0/members/pg_1', data=b'{"conn_url":"postgres://127.0.0.1:5432/postgres","api_url":"http://127.0.0.1:8008/patroni","state":"stopped","role":"uninitialized","version":"1.6.3"}', version=-1)
2020-01-02 23:04:37,011 DEBUG: Received response(xid=718): ZnodeStat(czxid=12885154300, mzxid=12885161248, ctime=1577999766021, mtime=1578006277007, version=653, cversion=0, aversion=0, ephemeralOwner=31334655534829047, dataLength=150, numChildren=0, pzxid=12885154300)
2020-01-02 23:04:37,012 INFO: trying to bootstrap from leader 'pg_0'
2020-01-02 23:04:37,022 ERROR: Error when fetching backup: pg_basebackup exited with code=1
2020-01-02 23:04:37,023 WARNING: Trying again in 5 seconds
2020-01-02 23:04:42,036 ERROR: Error when fetching backup: pg_basebackup exited with code=1
2020-01-02 23:04:42,036 ERROR: failed to bootstrap from leader 'pg_0'
2020-01-02 23:04:42,036 INFO: Removing data directory: /bp2/data/psql

and here is my yaml file:

scope: blue0
namespace: /bp/
name: pg_1

log:
  level: DEBUG
  traceback_level: debug
  dir: /bp2/log/
   
restapi:
  listen: 127.0.0.1:8008
  connect_address: 127.0.0.1:8008

zookeeper:
  hosts: zookeeper:2181

bootstrap:
  dcs:
    ttl: 30
    loop_wait: 10
    retry_timeout: 10
    maximum_lag_on_failover: 1048576
    postgresql:
      use_pg_rewind: true
      parameters:
  initdb: 
    - encoding: UTF8
    - data-checksums
  pg_hba: 
    - host replication replicator 127.0.0.1/32 md5
    - host all all 0.0.0.0/0 md5
  users:
    sbpadmin:
        password: sbpadminpw
        options:
            - createrole
            - createdb
    bpadmin:
        password: bpadminpw
        options:
            - replication
    wbpadmin:
        password: wbpadminpw
        options:
            - rewind

postgresql:
  listen: 127.0.0.1:5432
  connect_address: 127.0.0.1:5432
  config_dir: /bp2/data/psql
  data_dir: /bp2/data/psql
  bin_dir: /usr/lib/postgresql/12/bin/

  pgpass: /bp2/log/pgpass
  authentication:
    replication:
      username: bpadmin
      password: bpadminpw
    superuser:
      username: sbpadmin
      password: sbpadminpw
    rewind:
      username: wbpadmin
      password: wbpadminpw
  parameters:
    unix_socket_directories: '.'
tags:
    nofailover: false
    noloadbalance: false
    clonefrom: false
    nosync: false

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Comments:10

github_iconTop GitHub Comments

2reactions
sandeepkalracommented, Jan 10, 2020

Thanks. I think I found the issue, but do not know if my fix is correct or not. There were 2 problems

  1. There was no user called ‘replicator’. 2- The pg_hba entry wasn’t correct (permissions wise). I changed pg_hba block to the following:
pg_hba:
 host replication $rep_username $localhost_ip/32 md5
 host all   $rep_username $localhost_ip/32 md5
 host replication $rep_username 0.0.0.0/0 md5
 host all   $rep_username 0.0.0.0/0 md5
 host all all 0.0.0.0/0 md5

the $rep_username and $localhost_ip were replaced with the correct username and host-ip.

After this, the replication completes for replicas.

Thanks,

1reaction
CyberDem0ncommented, Jan 4, 2020

Why do you set listen and connect_address to 127.0.0.1? How nodes will discover each other?

Read more comments on GitHub >

github_iconTop Results From Across the Web

Error on 2-Replica Cluster - Bootstrap from leader
JIC, the PRIMARY POD is hippo-instance1-tdxx. My assumption is (maybe I'm completely wrong), that with replicas=2 we have an HA setup with 1...
Read more >
PGUpgrade script not working on patroni standby leaders
PGUpgrade script not working on patroni standby leaders ... 12:46:42,416 ERROR: Error when fetching backup: pg_basebackup exited with code=1.
Read more >
Re: pg_basebackup: return value 1: reason? - PostgreSQL
> > I tried to run pg_basebackup. Return value is 1. > > > > How to find out its reason? > >...
Read more >
Thread: pg_basebackup: return value 1: reason?
I think it failed to start process to fetching wal logs created during backup: but neither on master node neither on pg_basebackup output ......
Read more >
26.3. Continuous Archiving and Point-in-Time Recovery (PITR)
You should ensure that any error condition or request to a human operator is ... 1.16 and later exit with 1 if a...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found