member is listed as pending restart, even after restart

See original GitHub issue

Describe the bug After restarting my ec2 instance & via patroni restart <cluster> <member> the patronictl list <cluster> still shows a pending restart required.

Restarting the primary member (using patronictl) resulted in the pending_restart no longer being flagged on the member.

To Reproduce This is a basic setup on some test ec2 instances using Consul as the DCS.

Expected behavior After the PostgreSQL server is restarted I would expect the DCS state to no longer reflect a pending restart as being flagged as required for this cluster member.

Screenshots

USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
postgres     686  0.1  3.0 359532 123920 ?       S    05:14   0:00 /usr/lib/postgresql/13/bin/postgres -D /var/lib/postgresql/13/main --config-file=/var/lib/postgresql/13/main/postgresql.conf
postgres     758  0.0  0.1  77620  7124 ?        Ss   05:14   0:00  \_ postgres: 13-main: logger
postgres     762  0.0  0.2 359676  9880 ?        Ss   05:14   0:00  \_ postgres: 13-main: startup recovering 000000050000000000000003
postgres     763  0.0  0.1 359532  6588 ?        Ss   05:14   0:00  \_ postgres: 13-main: checkpointer
postgres     764  0.0  0.2 359532  8400 ?        Ss   05:14   0:00  \_ postgres: 13-main: background writer
postgres     765  0.0  0.1  77620  7160 ?        Ss   05:14   0:00  \_ postgres: 13-main: stats collector
postgres     781  0.0  0.4 362440 19908 ?        Ss   05:14   0:00  \_ postgres: 13-main: postgres postgres 127.0.0.1(55624) idle
postgres     829  0.0  0.3 359568 13112 ?        Ss   05:14   0:00  \_ postgres: 13-main: walreceiver
root@patroni-testing-a:~# patronictl list 13-main
+ Cluster: 13-main (6955339462508686956) ----+---------+----+-----------+-----------------+
| Member            | Host         | Role    | State   | TL | Lag in MB | Pending restart |
+-------------------+--------------+---------+---------+----+-----------+-----------------+
| patroni-testing-a | 10.0.198.26  | Replica | running |  5 |         0 | *               |
| patroni-testing-b | 10.0.198.149 | Leader  | running |  5 |           |                 |
+-------------------+--------------+---------+---------+----+-----------+-----------------+
root@patroni-testing-a:~# patronictl list 13-main
+ Cluster: 13-main (6955339462508686956) ----+---------+----+-----------+-----------------+
| Member            | Host         | Role    | State   | TL | Lag in MB | Pending restart |
+-------------------+--------------+---------+---------+----+-----------+-----------------+
| patroni-testing-a | 10.0.198.26  | Replica | running |  5 |         0 | *               |
| patroni-testing-b | 10.0.198.149 | Leader  | running |  5 |           |                 |
+-------------------+--------------+---------+---------+----+-----------+-----------------+
root@patroni-testing-a:~# patronictl restart 13-main patroni-testing-a
+ Cluster: 13-main (6955339462508686956) ----+---------+----+-----------+-----------------+
| Member            | Host         | Role    | State   | TL | Lag in MB | Pending restart |
+-------------------+--------------+---------+---------+----+-----------+-----------------+
| patroni-testing-a | 10.0.198.26  | Replica | running |  5 |         0 | *               |
| patroni-testing-b | 10.0.198.149 | Leader  | running |  5 |           |                 |
+-------------------+--------------+---------+---------+----+-----------+-----------------+
When should the restart take place (e.g. 2021-04-27T06:16)  [now]:
Are you sure you want to restart members patroni-testing-a? [y/N]: y
Restart if the PostgreSQL version is less than provided (e.g. 9.5.2)  []:
Success: restart on member patroni-testing-a
USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
postgres    1150  0.9  3.0 359528 123864 ?       S    05:16   0:00 /usr/lib/postgresql/13/bin/postgres -D /var/lib/postgresql/13/main --config-file=/var/lib/postgresql/13/main/postgresql.conf
postgres    1152  0.0  0.1  77616  7028 ?        Ss   05:16   0:00  \_ postgres: 13-main: logger
postgres    1153  0.0  0.2 359672  8288 ?        Ss   05:16   0:00  \_ postgres: 13-main: startup recovering 000000050000000000000003
postgres    1154  0.0  0.1 359528  6548 ?        Ss   05:16   0:00  \_ postgres: 13-main: checkpointer
postgres    1155  0.0  0.1 359528  6548 ?        Ss   05:16   0:00  \_ postgres: 13-main: background writer
postgres    1156  0.0  0.1  77616  7064 ?        Ss   05:16   0:00  \_ postgres: 13-main: stats collector
postgres    1157  0.0  0.3 359564 13052 ?        Ss   05:16   0:00  \_ postgres: 13-main: walreceiver
postgres    1170  0.0  0.4 362436 19764 ?        Ss   05:16   0:00  \_ postgres: 13-main: postgres postgres 127.0.0.1(55688) idle
root@patroni-testing-a:~# patronictl list 13-main
+ Cluster: 13-main (6955339462508686956) ----+---------+----+-----------+-----------------+
| Member            | Host         | Role    | State   | TL | Lag in MB | Pending restart |
+-------------------+--------------+---------+---------+----+-----------+-----------------+
| patroni-testing-a | 10.0.198.26  | Replica | running |  5 |         0 | *               |
| patroni-testing-b | 10.0.198.149 | Leader  | running |  5 |           |                 |
+-------------------+--------------+---------+---------+----+-----------+-----------------+

Environment

  • Patroni version: 2.0.2 (pip installed)
  • PostgreSQL version: 13.2 (Ubuntu 13.2-1.pgdg20.04+1)
  • DCS (and its version): Consul 1.5.2 (protocol version 2).

Patroni configuration file

scope: 13-main
namespace: /patroni/
name: patroni-testing-a

restapi:
  listen: 0.0.0.0:8008
  connect_address: 10.0.198.26:8008

consul:
  host: 127.0.0.1:8500

bootstrap:
  dcs:
    ttl: 30
    loop_wait: 10
    retry_timeout: 10
    maximum_lag_on_failover: 1048576
    postgresql:
      use_pg_rewind: true
      parameters:
        wal_level: logical
        hot_standby: "on"
        max_connections: 1000
        vacuum_cost_limit: 400
        max_wal_size: "16GB"
        min_wal_size: "4GB"
        checkpoint_completion_target: 0.9
        random_page_cost: 1
        autovacuum_analyze_scale_factor: 0.05
        shared_preload_libraries: "citus"
        max_worker_processes: 8
        max_wal_senders: 10
        max_replication_slots: 10
        max_prepared_transactions: 0
        max_locks_per_transaction: 64
        wal_log_hints: "on"
#        track_commit_timestamp: "off"
#        archive_mode: "on"
#        archive_timeout: 1800s
#        archive_command: mkdir -p ../wal_archive && test ! -f ../wal_archive/%f && cp %p ../wal_archive/%f

  # some desired options for 'initdb'
  initdb:  # Note: It needs to be a list (some options need values, others are switches)
  - encoding: UTF8
  - data-checksums

  pg_hba:
  - local all postgres peer
  - local replication all peer
  - host replication replicator 127.0.0.1/32 md5
  - host all all 0.0.0.0/0 md5

  # Additional script to be launched after initial cluster creation (will be passed the connection URL as parameter)
# post_init: /usr/local/bin/setup_cluster.sh

  # Some additional users users which needs to be created after initializing new cluster
  users:
    admin:
      password: admin
      options:
        - createrole
        - createdb

postgresql:
  listen: 0.0.0.0:5432
  connect_address: 10.0.198.26:5432
  data_dir: /var/lib/postgresql/13/main
  bin_dir: /usr/lib/postgresql/13/bin
  #config_dir: /etc/postgresql/13/main
  pgpass: /var/lib/postgresql/13-main.pgpass
  authentication:
    replication:
      username: replicator
      password: reppass
    superuser:
      username: postgres
      password: postgres
    rewind:  # Has no effect on postgres 10 and lower
      username: rewind_user
      password: rewind_password
  parameters:
    unix_socket_directories: '/var/run/postgresql/'
    logging_collector: 'on'
    log_directory: '/var/log/postgresql'
    log_filename: 'postgresql-13-main.log'
  # Additional fencing script executed after acquiring the leader lock but before promoting the replica
  #pre_promote: /path/to/pre_promote.sh

tags:
    nofailover: false
    noloadbalance: false
    clonefrom: false
    nosync: false

patronictl show-config

root@patroni-testing-a:~# patronictl show-config
Traceback (most recent call last):
  File "/usr/local/bin/patronictl", line 8, in <module>
    sys.exit(ctl())
  File "/usr/lib/python3/dist-packages/click/core.py", line 764, in __call__
    return self.main(*args, **kwargs)
  File "/usr/lib/python3/dist-packages/click/core.py", line 717, in main
    rv = self.invoke(ctx)
  File "/usr/lib/python3/dist-packages/click/core.py", line 1137, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/lib/python3/dist-packages/click/core.py", line 956, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/lib/python3/dist-packages/click/core.py", line 555, in invoke
    return callback(*args, **kwargs)
  File "/usr/lib/python3/dist-packages/click/decorators.py", line 27, in new_func
    return f(get_current_context().obj, *args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/patroni/ctl.py", line 1264, in show_config
    cluster = get_dcs(obj, cluster_name).get_cluster()
  File "/usr/local/lib/python3.8/dist-packages/patroni/ctl.py", line 161, in get_dcs
    return _get_dcs(config)
  File "/usr/local/lib/python3.8/dist-packages/patroni/dcs/__init__.py", line 92, in get_dcs
    return item(config[name])
  File "/usr/local/lib/python3.8/dist-packages/patroni/dcs/consul.py", line 191, in __init__
    super(Consul, self).__init__(config)
  File "/usr/local/lib/python3.8/dist-packages/patroni/dcs/__init__.py", line 574, in __init__
    self._base_path = re.sub('/+', '/', '/'.join(['', config.get('namespace', 'service'), config['scope']]))
TypeError: sequence item 2: expected str instance, NoneType found

Have you checked Patroni logs?

Apr 27 05:27:44 patroni-testing-a patroni[473]: 2021-04-27 05:27:44,496 INFO: Lock owner: patroni-testing-b; I am patroni-testing-a
Apr 27 05:27:44 patroni-testing-a patroni[473]: 2021-04-27 05:27:44,505 INFO: restart in progress
Apr 27 05:27:44 patroni-testing-a patroni[473]: 2021-04-27 05:27:44,508 INFO: Lock owner: patroni-testing-b; I am patroni-testing-a
Apr 27 05:27:44 patroni-testing-a patroni[473]: 2021-04-27 05:27:44,511 INFO: restart in progress
Apr 27 05:27:44 patroni-testing-a patroni[1953]: localhost:5432 - accepting connections
Apr 27 05:27:44 patroni-testing-a patroni[1955]: localhost:5432 - accepting connections
Apr 27 05:27:44 patroni-testing-a patroni[473]: 2021-04-27 05:27:44,928 INFO: Lock owner: patroni-testing-b; I am patroni-testing-a
Apr 27 05:27:44 patroni-testing-a patroni[473]: 2021-04-27 05:27:44,929 INFO: does not have lock
Apr 27 05:27:44 patroni-testing-a patroni[473]: 2021-04-27 05:27:44,929 INFO: establishing a new patroni connection to the postgres cluster
Apr 27 05:27:44 patroni-testing-a patroni[473]: 2021-04-27 05:27:44,957 INFO: no action.  i am a secondary and i am following a leader

Have you checked PostgreSQL logs?

2021-04-27 05:27:43.979 UTC [1944] LOG:  starting PostgreSQL 13.2 (Ubuntu 13.2-1.pgdg20.04+1) on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0, 64-bit
2021-04-27 05:27:43.980 UTC [1944] LOG:  listening on IPv4 address "0.0.0.0", port 5432
2021-04-27 05:27:43.982 UTC [1944] LOG:  listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432"
2021-04-27 05:27:43.987 UTC [1947] LOG:  database system was shut down in recovery at 2021-04-27 05:27:43 UTC
2021-04-27 05:27:43.988 UTC [1947] LOG:  entering standby mode
2021-04-27 05:27:43.990 UTC [1947] LOG:  redo starts at 0/30005D0
2021-04-27 05:27:43.990 UTC [1947] LOG:  consistent recovery state reached at 0/30006B8
2021-04-27 05:27:43.990 UTC [1947] LOG:  invalid record length at 0/30006B8: wanted 24, got 0
2021-04-27 05:27:43.992 UTC [1944] LOG:  database system is ready to accept read only connections
2021-04-27 05:27:44.000 UTC [1951] LOG:  started streaming WAL from primary at 0/3000000 on timeline 5

Have you tried to use GitHub issue search? Couldn’t find a open bug with this issue.

Additional context OS Ubuntu 20.04 LTS.

/cluster route also shows pending_restart true.

{
  "members": [
    {
      "name": "patroni-testing-a",
      "role": "replica",
      "state": "running",
      "api_url": "http://10.0.198.26:8008/patroni",
      "host": "10.0.198.26",
      "port": 5432,
      "timeline": 5,
      "pending_restart": true,
      "lag": 0
    },
    {
      "name": "patroni-testing-b",
      "role": "leader",
      "state": "running",
      "api_url": "http://10.0.198.149:8008/patroni",
      "host": "10.0.198.149",
      "port": 5432,
      "timeline": 5
    }
  ]
}
pip3 freeze
ansible==2.9.6
apache-libcloud==2.8.0
argcomplete==1.8.1
attrs==19.3.0
Automat==0.8.0
awscli==1.18.69
blinker==1.4
boto3==1.9.253
botocore==1.16.19
certifi==2019.11.28
chardet==3.0.4
Click==7.0
cloud-init==21.1
colorama==0.4.3
command-not-found==0.3
configobj==5.0.6
constantly==15.1.0
cryptography==2.8
dbus-python==1.2.16
distro==1.4.0
distro-info===0.23ubuntu1
dnspython==1.16.0
docutils==0.16
ec2-hibinit-agent==1.0.0
entrypoints==0.3
hibagent==1.0.1
httplib2==0.14.0
hyperlink==19.0.0
idna==2.8
importlib-metadata==1.5.0
incremental==16.10.1
Jinja2==2.10.1
jmespath==0.9.4
jsonpatch==1.22
jsonpointer==2.0
jsonschema==3.2.0
keyring==18.0.1
language-selector==0.1
launchpadlib==1.10.13
lazr.restfulclient==0.14.2
lazr.uri==1.0.3
lockfile==0.12.2
MarkupSafe==1.1.0
more-itertools==4.2.0
netaddr==0.7.19
netifaces==0.10.4
ntlm-auth==1.1.0
oauthlib==3.1.0
olefile==0.46
patroni==2.0.2
pexpect==4.6.0
Pillow==7.0.0
prettytable==2.1.0
psutil==5.8.0
psycopg2==2.8.4
pyasn1==0.4.2
pyasn1-modules==0.2.1
pycrypto==2.6.1
Pygments==2.3.1
PyGObject==3.36.0
PyHamcrest==1.9.0
PyJWT==1.7.1
pykerberos==1.1.14
pymacaroons==0.13.0
PyNaCl==1.3.0
pyOpenSSL==19.0.0
pyrsistent==0.15.5
pyserial==3.4
python-apt==2.0.0+ubuntu0.20.4.4
python-consul==1.1.0
python-dateutil==2.7.3
python-debian===0.1.36ubuntu1
pywinrm==0.3.0
PyYAML==5.3.1
requests==2.22.0
requests-kerberos==0.12.0
requests-ntlm==1.1.0
requests-unixsocket==0.2.0
roman==2.0.0
rsa==4.0
s3transfer==0.3.3
SecretStorage==2.3.1
selinux==3.0
service-identity==18.1.0
simplejson==3.16.0
six==1.14.0
sos==4.1
ssh-import-id==5.10
systemd-python==234
Twisted==18.9.0
ubuntu-advantage-tools==20.3
ufw==0.36
unattended-upgrades==0.1
urllib3==1.25.8
wadllib==1.3.3
wcwidth==0.2.5
xmltodict==0.12.0
ydiff==1.2
zipp==1.0.0
zope.interface==4.7.1

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:14

github_iconTop GitHub Comments

1reaction
CyberDem0ncommented, Apr 27, 2021

pg_control:

max_prepared_xacts setting:           2000

But in the config (patronictl show-config) you have max_prepared_transactions: 0.

This is an inconsistency that clearly needs a restart, which must be performed in a very specific order. Since max_prepared_transactions is one of the parameters which value on the replica must be not smaller than on the primary, first the primary must be restarted, and only after that replicas. But! In order to succeed, the replica should be restarted only when it received WAL record with the new value of max_prepared_transactions.

0reactions
pdeaudneycommented, Apr 27, 2021

I have initialised the database again (without the citus extension loaded) and the max_prepared_transactions value is set to 0 as expected from the configuration.

root@patroni-testing-b:/var/lib/postgresql# /usr/lib/postgresql/13/bin/pg_controldata -D /var/lib/postgresql/13/main/
pg_control version number:            1300
Catalog version number:               202007201
Database system identifier:           6955757562451057988
Database cluster state:               in archive recovery
pg_control last modified:             Tue Apr 27 08:52:02 2021
Latest checkpoint location:           0/2000060
Latest checkpoint's REDO location:    0/2000028
Latest checkpoint's REDO WAL file:    000000010000000000000002
Latest checkpoint's TimeLineID:       1
Latest checkpoint's PrevTimeLineID:   1
Latest checkpoint's full_page_writes: on
Latest checkpoint's NextXID:          0:494
Latest checkpoint's NextOID:          24576
Latest checkpoint's NextMultiXactId:  1
Latest checkpoint's NextMultiOffset:  0
Latest checkpoint's oldestXID:        479
Latest checkpoint's oldestXID's DB:   1
Latest checkpoint's oldestActiveXID:  494
Latest checkpoint's oldestMultiXid:   1
Latest checkpoint's oldestMulti's DB: 1
Latest checkpoint's oldestCommitTsXid:0
Latest checkpoint's newestCommitTsXid:0
Time of latest checkpoint:            Tue Apr 27 08:51:58 2021
Fake LSN counter for unlogged rels:   0/3E8
Minimum recovery ending location:     0/2000100
Min recovery ending loc's timeline:   1
Backup start location:                0/0
Backup end location:                  0/0
End-of-backup record required:        no
wal_level setting:                    logical
wal_log_hints setting:                on
max_connections setting:              1000
max_worker_processes setting:         8
max_wal_senders setting:              10
max_prepared_xacts setting:           0
max_locks_per_xact setting:           64
track_commit_timestamp setting:       off
Maximum data alignment:               8
Database block size:                  8192
Blocks per segment of large relation: 131072
WAL block size:                       8192
Bytes per WAL segment:                16777216
Maximum length of identifiers:        64
Maximum columns in an index:          32
Maximum size of a TOAST chunk:        1996
Size of a large-object chunk:         2048
Date/time type storage:               64-bit integers
Float8 argument passing:              by value
Data page checksum version:           1
Mock authentication nonce:            086d675124b8ad960b182b8e92d58743647b9bdf1580e05c4ee94dc9bb827003
/lib/postgresql/13/main# cat patroni.dynamic.json  |jq
{
  "ttl": 30,
  "loop_wait": 10,
  "retry_timeout": 10,
  "maximum_lag_on_failover": 1048576,
  "postgresql": {
    "use_pg_rewind": true,
    "parameters": {
      "wal_level": "logical",
      "hot_standby": "on",
      "max_connections": 1000,
      "vacuum_cost_limit": 400,
      "max_wal_size": "16GB",
      "min_wal_size": "4GB",
      "checkpoint_completion_target": 0.9,
      "random_page_cost": 1,
      "autovacuum_analyze_scale_factor": 0.05,
      "shared_preload_libraries": "auto_explain,pg_stat_statements",
      "max_worker_processes": 8,
      "max_wal_senders": 10,
      "max_replication_slots": 10,
      "wal_log_hints": "on"
    }
  }
}
root@patroni-testing-b:/var/lib/postgresql/13/main# patronictl list 13-main
+ Cluster: 13-main (6955757562451057988) ----+---------+----+-----------+
| Member            | Host         | Role    | State   | TL | Lag in MB |
+-------------------+--------------+---------+---------+----+-----------+
| patroni-testing-a | 10.0.198.26  | Leader  | running |  1 |           |
| patroni-testing-b | 10.0.198.149 | Replica | running |  1 |         0 |
+-------------------+--------------+---------+---------+----+-----------+

Checking the citus documentation, its indicates that the max_prepared_transaction values needs to be set to a higher value for the default multi-shard commit protocol to work. This leads me to believe the extension is modifying this value instead of the configuration values. http://docs.citusdata.com/en/v10.0/develop/api_guc.html#citus-multi-shard-commit-protocol-enum

Read more comments on GitHub >

github_iconTop Results From Across the Web

Windows 10: Pending restart but after restart it still says
Hello so I bought an SSD and just installed windows 10 home 2004 version and all was working fine, the drivers and such....
Read more >
Stuck in Pending Restart status even after reboot - BigFix Forum
However, the job gets stuck on the first group because the status of the systems stays in PENDING RESTART or in “not reported”...
Read more >
SCCM Report Get list of devices with pending reboot in a ...
To identify devices that are pending a restart, you can go to the Assets and Compliance workspace and select the Devices node ,then...
Read more >
Windows updates stay as "pending restart" and eventually fail ...
The machine restarts daily at 03:00 trying to install the pending updates, but the updates stay in the "Pending restart" state (see background ......
Read more >
Check if windows 10 system needs a restart after automatic ...
The following seems to work for me: Install the PowerShell script Reboot pending: Install-Module -Name PendingReboot.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found