My wal position exceeds maximum replication lag
See original GitHub issueDescribe the bug We are running a postgres-cluster with the postgres–operator on Kubernetes in a hosted VM environment. We use minimum one standby-node, for testing purposes also more than one. A fallback between nodes was never successful, as we always get these messages from the secondary:
INFO: My wal position exceeds maximum replication lag
2021-09-28 08:00:23,141 INFO: following a different leader because i am not the healthiest node
2021-09-28 08:00:33,130 INFO: My wal position exceeds maximum replication lag
2021-09-28 08:00:33,140 INFO: following a different leader because i am not the healthiest node
2021-09-28 08:00:43,130 INFO: My wal position exceeds maximum replication lag
After the “old” primary comes back to live, everything recovers again:
2021-09-28 08:06:38,474 INFO: Lock owner: None; I am acid-common-cluster-1
2021-09-28 08:06:38,620 INFO: starting as a secondary
2021-09-28 08:06:39,575 INFO: postmaster pid=59
/var/run/postgresql:5432 - no response
2021-09-28 08:06:39 UTC [59]: [1-1] 6152cd0f.3b 0 LOG: Auto detecting pg_stat_kcache.linux_hz parameter...
2021-09-28 08:06:39 UTC [59]: [2-1] 6152cd0f.3b 0 LOG: pg_stat_kcache.linux_hz is set to 1000000
2021-09-28 08:06:39 UTC [59]: [3-1] 6152cd0f.3b 0 LOG: redirecting log output to logging collector process
2021-09-28 08:06:39 UTC [59]: [4-1] 6152cd0f.3b 0 HINT: Future log output will appear in directory "../pg_log".
/var/run/postgresql:5432 - accepting connections
/var/run/postgresql:5432 - accepting connections
2021-09-28 08:06:40,874 INFO: establishing a new patroni connection to the postgres cluster
2021-09-28 08:06:40,962 INFO: Got response from acid-common-cluster-0 http://10.42.1.180:8008/patroni: {"state": "running", "postmaster_start_time": "2021-09-20 13:35:56.575741+00:00", "role": "replica", "server_version": 130004, "cluster_unlocked": true, "xlog": {"received_location": 436207616, "replayed_location": 436207616, "replayed_timestamp": null, "paused": false}, "timeline": 8, "database_system_identifier": "6922433781832740933", "patroni": {"version": "2.1.1", "scope": "acid-common-cluster"}}
2021-09-28 08:06:41,068 WARNING: Could not activate Linux watchdog device: "Can't open watchdog device: [Errno 2] No such file or directory: '/dev/watchdog'"
2021-09-28 08:06:41,172 INFO: promoted self to leader by acquiring session lock
2021-09-28 08:06:41,173 INFO: Lock owner: acid-common-cluster-1; I am acid-common-cluster-1
2021-09-28 08:06:41,240 INFO: updated leader lock during promote
server promoting
2021-09-28 08:06:41,245 INFO: cleared rewind state after becoming the leader
2021-09-28 08:06:42,775 INFO: no action. I am (acid-common-cluster-1) the leader with the lock
2021-09-28 08:06:43,177 INFO: no action. I am (acid-common-cluster-1) the leader with the lock
To Reproduce Steps to reproduce the behavior:
Expected behavior The fallback should happen without a big downtime.
Screenshots If applicable, add screenshots to help explain your problem.
Environment
- We are using this docker image: registry.opensource.zalan.do/acid/spilo-13:2.1-p1
Patroni configuration file
patronictl show-config
loop_wait: 10
maximum_lag_on_failover: 33554432
postgresql:
parameters:
archive_mode: 'on'
archive_timeout: 1800s
autovacuum_analyze_scale_factor: 0.02
autovacuum_max_workers: 5
autovacuum_vacuum_scale_factor: 0.05
checkpoint_completion_target: 0.9
hot_standby: 'on'
log_autovacuum_min_duration: 0
log_checkpoints: 'on'
log_connections: 'on'
log_disconnections: 'on'
log_line_prefix: '%t [%p]: [%l-1] %c %x %d %u %a %h '
log_lock_waits: 'on'
log_min_duration_statement: 500
log_statement: ddl
log_temp_files: 0
max_connections: 100
max_replication_slots: 10
max_wal_senders: 10
tcp_keepalives_idle: 900
tcp_keepalives_interval: 100
track_functions: all
wal_level: hot_standby
wal_log_hints: 'on'
use_pg_rewind: true
use_slots: true
retry_timeout: 10
synchronous_mode: false
synchronous_mode_strict: false
ttl: 30
Have you checked Patroni logs? Please provide a snippet of Patroni log files here
Have you checked PostgreSQL logs? Please provide a snippet here
Have you tried to use GitHub issue search? Maybe there is already a similar issue solved.
Additional context Add any other context about the problem here.
Issue Analytics
- State:
- Created 2 years ago
- Comments:7
Top Related StackOverflow Question
Imho you must monitor replica lag and react as soon as it appears. There are more efficient ways to get this metrics:
pg_stat_replicationviewpatronictl listoutputs info about lag =)Guess how many people are actually regularly checking logs? And how many people are unhappy about logs being too verbose?
If you don’t know - I can give a hint, the second group probably exceeding the first one by the order of magnitude. Hence, it doesn’t make sense.