member is listed as pending restart, even after restart
See original GitHub issueDescribe the bug After restarting my ec2 instance & via patroni restart <cluster> <member> the patronictl list <cluster> still shows a pending restart required.
Restarting the primary member (using patronictl) resulted in the pending_restart no longer being flagged on the member.
To Reproduce This is a basic setup on some test ec2 instances using Consul as the DCS.
Expected behavior After the PostgreSQL server is restarted I would expect the DCS state to no longer reflect a pending restart as being flagged as required for this cluster member.
Screenshots
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
postgres 686 0.1 3.0 359532 123920 ? S 05:14 0:00 /usr/lib/postgresql/13/bin/postgres -D /var/lib/postgresql/13/main --config-file=/var/lib/postgresql/13/main/postgresql.conf
postgres 758 0.0 0.1 77620 7124 ? Ss 05:14 0:00 \_ postgres: 13-main: logger
postgres 762 0.0 0.2 359676 9880 ? Ss 05:14 0:00 \_ postgres: 13-main: startup recovering 000000050000000000000003
postgres 763 0.0 0.1 359532 6588 ? Ss 05:14 0:00 \_ postgres: 13-main: checkpointer
postgres 764 0.0 0.2 359532 8400 ? Ss 05:14 0:00 \_ postgres: 13-main: background writer
postgres 765 0.0 0.1 77620 7160 ? Ss 05:14 0:00 \_ postgres: 13-main: stats collector
postgres 781 0.0 0.4 362440 19908 ? Ss 05:14 0:00 \_ postgres: 13-main: postgres postgres 127.0.0.1(55624) idle
postgres 829 0.0 0.3 359568 13112 ? Ss 05:14 0:00 \_ postgres: 13-main: walreceiver
root@patroni-testing-a:~# patronictl list 13-main
+ Cluster: 13-main (6955339462508686956) ----+---------+----+-----------+-----------------+
| Member | Host | Role | State | TL | Lag in MB | Pending restart |
+-------------------+--------------+---------+---------+----+-----------+-----------------+
| patroni-testing-a | 10.0.198.26 | Replica | running | 5 | 0 | * |
| patroni-testing-b | 10.0.198.149 | Leader | running | 5 | | |
+-------------------+--------------+---------+---------+----+-----------+-----------------+
root@patroni-testing-a:~# patronictl list 13-main
+ Cluster: 13-main (6955339462508686956) ----+---------+----+-----------+-----------------+
| Member | Host | Role | State | TL | Lag in MB | Pending restart |
+-------------------+--------------+---------+---------+----+-----------+-----------------+
| patroni-testing-a | 10.0.198.26 | Replica | running | 5 | 0 | * |
| patroni-testing-b | 10.0.198.149 | Leader | running | 5 | | |
+-------------------+--------------+---------+---------+----+-----------+-----------------+
root@patroni-testing-a:~# patronictl restart 13-main patroni-testing-a
+ Cluster: 13-main (6955339462508686956) ----+---------+----+-----------+-----------------+
| Member | Host | Role | State | TL | Lag in MB | Pending restart |
+-------------------+--------------+---------+---------+----+-----------+-----------------+
| patroni-testing-a | 10.0.198.26 | Replica | running | 5 | 0 | * |
| patroni-testing-b | 10.0.198.149 | Leader | running | 5 | | |
+-------------------+--------------+---------+---------+----+-----------+-----------------+
When should the restart take place (e.g. 2021-04-27T06:16) [now]:
Are you sure you want to restart members patroni-testing-a? [y/N]: y
Restart if the PostgreSQL version is less than provided (e.g. 9.5.2) []:
Success: restart on member patroni-testing-a
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
postgres 1150 0.9 3.0 359528 123864 ? S 05:16 0:00 /usr/lib/postgresql/13/bin/postgres -D /var/lib/postgresql/13/main --config-file=/var/lib/postgresql/13/main/postgresql.conf
postgres 1152 0.0 0.1 77616 7028 ? Ss 05:16 0:00 \_ postgres: 13-main: logger
postgres 1153 0.0 0.2 359672 8288 ? Ss 05:16 0:00 \_ postgres: 13-main: startup recovering 000000050000000000000003
postgres 1154 0.0 0.1 359528 6548 ? Ss 05:16 0:00 \_ postgres: 13-main: checkpointer
postgres 1155 0.0 0.1 359528 6548 ? Ss 05:16 0:00 \_ postgres: 13-main: background writer
postgres 1156 0.0 0.1 77616 7064 ? Ss 05:16 0:00 \_ postgres: 13-main: stats collector
postgres 1157 0.0 0.3 359564 13052 ? Ss 05:16 0:00 \_ postgres: 13-main: walreceiver
postgres 1170 0.0 0.4 362436 19764 ? Ss 05:16 0:00 \_ postgres: 13-main: postgres postgres 127.0.0.1(55688) idle
root@patroni-testing-a:~# patronictl list 13-main
+ Cluster: 13-main (6955339462508686956) ----+---------+----+-----------+-----------------+
| Member | Host | Role | State | TL | Lag in MB | Pending restart |
+-------------------+--------------+---------+---------+----+-----------+-----------------+
| patroni-testing-a | 10.0.198.26 | Replica | running | 5 | 0 | * |
| patroni-testing-b | 10.0.198.149 | Leader | running | 5 | | |
+-------------------+--------------+---------+---------+----+-----------+-----------------+
Environment
- Patroni version: 2.0.2 (pip installed)
- PostgreSQL version: 13.2 (Ubuntu 13.2-1.pgdg20.04+1)
- DCS (and its version): Consul 1.5.2 (protocol version 2).
Patroni configuration file
scope: 13-main
namespace: /patroni/
name: patroni-testing-a
restapi:
listen: 0.0.0.0:8008
connect_address: 10.0.198.26:8008
consul:
host: 127.0.0.1:8500
bootstrap:
dcs:
ttl: 30
loop_wait: 10
retry_timeout: 10
maximum_lag_on_failover: 1048576
postgresql:
use_pg_rewind: true
parameters:
wal_level: logical
hot_standby: "on"
max_connections: 1000
vacuum_cost_limit: 400
max_wal_size: "16GB"
min_wal_size: "4GB"
checkpoint_completion_target: 0.9
random_page_cost: 1
autovacuum_analyze_scale_factor: 0.05
shared_preload_libraries: "citus"
max_worker_processes: 8
max_wal_senders: 10
max_replication_slots: 10
max_prepared_transactions: 0
max_locks_per_transaction: 64
wal_log_hints: "on"
# track_commit_timestamp: "off"
# archive_mode: "on"
# archive_timeout: 1800s
# archive_command: mkdir -p ../wal_archive && test ! -f ../wal_archive/%f && cp %p ../wal_archive/%f
# some desired options for 'initdb'
initdb: # Note: It needs to be a list (some options need values, others are switches)
- encoding: UTF8
- data-checksums
pg_hba:
- local all postgres peer
- local replication all peer
- host replication replicator 127.0.0.1/32 md5
- host all all 0.0.0.0/0 md5
# Additional script to be launched after initial cluster creation (will be passed the connection URL as parameter)
# post_init: /usr/local/bin/setup_cluster.sh
# Some additional users users which needs to be created after initializing new cluster
users:
admin:
password: admin
options:
- createrole
- createdb
postgresql:
listen: 0.0.0.0:5432
connect_address: 10.0.198.26:5432
data_dir: /var/lib/postgresql/13/main
bin_dir: /usr/lib/postgresql/13/bin
#config_dir: /etc/postgresql/13/main
pgpass: /var/lib/postgresql/13-main.pgpass
authentication:
replication:
username: replicator
password: reppass
superuser:
username: postgres
password: postgres
rewind: # Has no effect on postgres 10 and lower
username: rewind_user
password: rewind_password
parameters:
unix_socket_directories: '/var/run/postgresql/'
logging_collector: 'on'
log_directory: '/var/log/postgresql'
log_filename: 'postgresql-13-main.log'
# Additional fencing script executed after acquiring the leader lock but before promoting the replica
#pre_promote: /path/to/pre_promote.sh
tags:
nofailover: false
noloadbalance: false
clonefrom: false
nosync: false
patronictl show-config
root@patroni-testing-a:~# patronictl show-config
Traceback (most recent call last):
File "/usr/local/bin/patronictl", line 8, in <module>
sys.exit(ctl())
File "/usr/lib/python3/dist-packages/click/core.py", line 764, in __call__
return self.main(*args, **kwargs)
File "/usr/lib/python3/dist-packages/click/core.py", line 717, in main
rv = self.invoke(ctx)
File "/usr/lib/python3/dist-packages/click/core.py", line 1137, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/usr/lib/python3/dist-packages/click/core.py", line 956, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/usr/lib/python3/dist-packages/click/core.py", line 555, in invoke
return callback(*args, **kwargs)
File "/usr/lib/python3/dist-packages/click/decorators.py", line 27, in new_func
return f(get_current_context().obj, *args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/patroni/ctl.py", line 1264, in show_config
cluster = get_dcs(obj, cluster_name).get_cluster()
File "/usr/local/lib/python3.8/dist-packages/patroni/ctl.py", line 161, in get_dcs
return _get_dcs(config)
File "/usr/local/lib/python3.8/dist-packages/patroni/dcs/__init__.py", line 92, in get_dcs
return item(config[name])
File "/usr/local/lib/python3.8/dist-packages/patroni/dcs/consul.py", line 191, in __init__
super(Consul, self).__init__(config)
File "/usr/local/lib/python3.8/dist-packages/patroni/dcs/__init__.py", line 574, in __init__
self._base_path = re.sub('/+', '/', '/'.join(['', config.get('namespace', 'service'), config['scope']]))
TypeError: sequence item 2: expected str instance, NoneType found
Have you checked Patroni logs?
Apr 27 05:27:44 patroni-testing-a patroni[473]: 2021-04-27 05:27:44,496 INFO: Lock owner: patroni-testing-b; I am patroni-testing-a
Apr 27 05:27:44 patroni-testing-a patroni[473]: 2021-04-27 05:27:44,505 INFO: restart in progress
Apr 27 05:27:44 patroni-testing-a patroni[473]: 2021-04-27 05:27:44,508 INFO: Lock owner: patroni-testing-b; I am patroni-testing-a
Apr 27 05:27:44 patroni-testing-a patroni[473]: 2021-04-27 05:27:44,511 INFO: restart in progress
Apr 27 05:27:44 patroni-testing-a patroni[1953]: localhost:5432 - accepting connections
Apr 27 05:27:44 patroni-testing-a patroni[1955]: localhost:5432 - accepting connections
Apr 27 05:27:44 patroni-testing-a patroni[473]: 2021-04-27 05:27:44,928 INFO: Lock owner: patroni-testing-b; I am patroni-testing-a
Apr 27 05:27:44 patroni-testing-a patroni[473]: 2021-04-27 05:27:44,929 INFO: does not have lock
Apr 27 05:27:44 patroni-testing-a patroni[473]: 2021-04-27 05:27:44,929 INFO: establishing a new patroni connection to the postgres cluster
Apr 27 05:27:44 patroni-testing-a patroni[473]: 2021-04-27 05:27:44,957 INFO: no action. i am a secondary and i am following a leader
Have you checked PostgreSQL logs?
2021-04-27 05:27:43.979 UTC [1944] LOG: starting PostgreSQL 13.2 (Ubuntu 13.2-1.pgdg20.04+1) on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0, 64-bit
2021-04-27 05:27:43.980 UTC [1944] LOG: listening on IPv4 address "0.0.0.0", port 5432
2021-04-27 05:27:43.982 UTC [1944] LOG: listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432"
2021-04-27 05:27:43.987 UTC [1947] LOG: database system was shut down in recovery at 2021-04-27 05:27:43 UTC
2021-04-27 05:27:43.988 UTC [1947] LOG: entering standby mode
2021-04-27 05:27:43.990 UTC [1947] LOG: redo starts at 0/30005D0
2021-04-27 05:27:43.990 UTC [1947] LOG: consistent recovery state reached at 0/30006B8
2021-04-27 05:27:43.990 UTC [1947] LOG: invalid record length at 0/30006B8: wanted 24, got 0
2021-04-27 05:27:43.992 UTC [1944] LOG: database system is ready to accept read only connections
2021-04-27 05:27:44.000 UTC [1951] LOG: started streaming WAL from primary at 0/3000000 on timeline 5
Have you tried to use GitHub issue search? Couldn’t find a open bug with this issue.
Additional context OS Ubuntu 20.04 LTS.
/cluster route also shows pending_restart true.
{
"members": [
{
"name": "patroni-testing-a",
"role": "replica",
"state": "running",
"api_url": "http://10.0.198.26:8008/patroni",
"host": "10.0.198.26",
"port": 5432,
"timeline": 5,
"pending_restart": true,
"lag": 0
},
{
"name": "patroni-testing-b",
"role": "leader",
"state": "running",
"api_url": "http://10.0.198.149:8008/patroni",
"host": "10.0.198.149",
"port": 5432,
"timeline": 5
}
]
}
pip3 freeze
ansible==2.9.6
apache-libcloud==2.8.0
argcomplete==1.8.1
attrs==19.3.0
Automat==0.8.0
awscli==1.18.69
blinker==1.4
boto3==1.9.253
botocore==1.16.19
certifi==2019.11.28
chardet==3.0.4
Click==7.0
cloud-init==21.1
colorama==0.4.3
command-not-found==0.3
configobj==5.0.6
constantly==15.1.0
cryptography==2.8
dbus-python==1.2.16
distro==1.4.0
distro-info===0.23ubuntu1
dnspython==1.16.0
docutils==0.16
ec2-hibinit-agent==1.0.0
entrypoints==0.3
hibagent==1.0.1
httplib2==0.14.0
hyperlink==19.0.0
idna==2.8
importlib-metadata==1.5.0
incremental==16.10.1
Jinja2==2.10.1
jmespath==0.9.4
jsonpatch==1.22
jsonpointer==2.0
jsonschema==3.2.0
keyring==18.0.1
language-selector==0.1
launchpadlib==1.10.13
lazr.restfulclient==0.14.2
lazr.uri==1.0.3
lockfile==0.12.2
MarkupSafe==1.1.0
more-itertools==4.2.0
netaddr==0.7.19
netifaces==0.10.4
ntlm-auth==1.1.0
oauthlib==3.1.0
olefile==0.46
patroni==2.0.2
pexpect==4.6.0
Pillow==7.0.0
prettytable==2.1.0
psutil==5.8.0
psycopg2==2.8.4
pyasn1==0.4.2
pyasn1-modules==0.2.1
pycrypto==2.6.1
Pygments==2.3.1
PyGObject==3.36.0
PyHamcrest==1.9.0
PyJWT==1.7.1
pykerberos==1.1.14
pymacaroons==0.13.0
PyNaCl==1.3.0
pyOpenSSL==19.0.0
pyrsistent==0.15.5
pyserial==3.4
python-apt==2.0.0+ubuntu0.20.4.4
python-consul==1.1.0
python-dateutil==2.7.3
python-debian===0.1.36ubuntu1
pywinrm==0.3.0
PyYAML==5.3.1
requests==2.22.0
requests-kerberos==0.12.0
requests-ntlm==1.1.0
requests-unixsocket==0.2.0
roman==2.0.0
rsa==4.0
s3transfer==0.3.3
SecretStorage==2.3.1
selinux==3.0
service-identity==18.1.0
simplejson==3.16.0
six==1.14.0
sos==4.1
ssh-import-id==5.10
systemd-python==234
Twisted==18.9.0
ubuntu-advantage-tools==20.3
ufw==0.36
unattended-upgrades==0.1
urllib3==1.25.8
wadllib==1.3.3
wcwidth==0.2.5
xmltodict==0.12.0
ydiff==1.2
zipp==1.0.0
zope.interface==4.7.1
Issue Analytics
- State:
- Created 2 years ago
- Comments:14
Top Related StackOverflow Question
pg_control:
But in the config (patronictl show-config) you have
max_prepared_transactions: 0.This is an inconsistency that clearly needs a restart, which must be performed in a very specific order. Since
max_prepared_transactionsis one of the parameters which value on the replica must be not smaller than on the primary, first the primary must be restarted, and only after that replicas. But! In order to succeed, the replica should be restarted only when it received WAL record with the new value ofmax_prepared_transactions.I have initialised the database again (without the citus extension loaded) and the max_prepared_transactions value is set to 0 as expected from the configuration.
Checking the citus documentation, its indicates that the max_prepared_transaction values needs to be set to a higher value for the default multi-shard commit protocol to work. This leads me to believe the extension is modifying this value instead of the configuration values. http://docs.citusdata.com/en/v10.0/develop/api_guc.html#citus-multi-shard-commit-protocol-enum