My wal position exceeds maximum replication lag

See original GitHub issue

Describe the bug We are running a postgres-cluster with the postgres–operator on Kubernetes in a hosted VM environment. We use minimum one standby-node, for testing purposes also more than one. A fallback between nodes was never successful, as we always get these messages from the secondary:

INFO: My wal position exceeds maximum replication lag
2021-09-28 08:00:23,141 INFO: following a different leader because i am not the healthiest node
2021-09-28 08:00:33,130 INFO: My wal position exceeds maximum replication lag
2021-09-28 08:00:33,140 INFO: following a different leader because i am not the healthiest node
2021-09-28 08:00:43,130 INFO: My wal position exceeds maximum replication lag

After the “old” primary comes back to live, everything recovers again:

2021-09-28 08:06:38,474 INFO: Lock owner: None; I am acid-common-cluster-1
2021-09-28 08:06:38,620 INFO: starting as a secondary
2021-09-28 08:06:39,575 INFO: postmaster pid=59
/var/run/postgresql:5432 - no response
2021-09-28 08:06:39 UTC [59]: [1-1] 6152cd0f.3b 0     LOG:  Auto detecting pg_stat_kcache.linux_hz parameter...
2021-09-28 08:06:39 UTC [59]: [2-1] 6152cd0f.3b 0     LOG:  pg_stat_kcache.linux_hz is set to 1000000
2021-09-28 08:06:39 UTC [59]: [3-1] 6152cd0f.3b 0     LOG:  redirecting log output to logging collector process
2021-09-28 08:06:39 UTC [59]: [4-1] 6152cd0f.3b 0     HINT:  Future log output will appear in directory "../pg_log".
/var/run/postgresql:5432 - accepting connections
/var/run/postgresql:5432 - accepting connections
2021-09-28 08:06:40,874 INFO: establishing a new patroni connection to the postgres cluster
2021-09-28 08:06:40,962 INFO: Got response from acid-common-cluster-0 http://10.42.1.180:8008/patroni: {"state": "running", "postmaster_start_time": "2021-09-20 13:35:56.575741+00:00", "role": "replica", "server_version": 130004, "cluster_unlocked": true, "xlog": {"received_location": 436207616, "replayed_location": 436207616, "replayed_timestamp": null, "paused": false}, "timeline": 8, "database_system_identifier": "6922433781832740933", "patroni": {"version": "2.1.1", "scope": "acid-common-cluster"}}
2021-09-28 08:06:41,068 WARNING: Could not activate Linux watchdog device: "Can't open watchdog device: [Errno 2] No such file or directory: '/dev/watchdog'"
2021-09-28 08:06:41,172 INFO: promoted self to leader by acquiring session lock
2021-09-28 08:06:41,173 INFO: Lock owner: acid-common-cluster-1; I am acid-common-cluster-1
2021-09-28 08:06:41,240 INFO: updated leader lock during promote
server promoting
2021-09-28 08:06:41,245 INFO: cleared rewind state after becoming the leader
2021-09-28 08:06:42,775 INFO: no action. I am (acid-common-cluster-1) the leader with the lock
2021-09-28 08:06:43,177 INFO: no action. I am (acid-common-cluster-1) the leader with the lock

To Reproduce Steps to reproduce the behavior:

Expected behavior The fallback should happen without a big downtime.

Screenshots If applicable, add screenshots to help explain your problem.

Environment

  • We are using this docker image: registry.opensource.zalan.do/acid/spilo-13:2.1-p1

Patroni configuration file

patronictl show-config

loop_wait: 10
maximum_lag_on_failover: 33554432
postgresql:
  parameters:
    archive_mode: 'on'
    archive_timeout: 1800s
    autovacuum_analyze_scale_factor: 0.02
    autovacuum_max_workers: 5
    autovacuum_vacuum_scale_factor: 0.05
    checkpoint_completion_target: 0.9
    hot_standby: 'on'
    log_autovacuum_min_duration: 0
    log_checkpoints: 'on'
    log_connections: 'on'
    log_disconnections: 'on'
    log_line_prefix: '%t [%p]: [%l-1] %c %x %d %u %a %h '
    log_lock_waits: 'on'
    log_min_duration_statement: 500
    log_statement: ddl
    log_temp_files: 0
    max_connections: 100
    max_replication_slots: 10
    max_wal_senders: 10
    tcp_keepalives_idle: 900
    tcp_keepalives_interval: 100
    track_functions: all
    wal_level: hot_standby
    wal_log_hints: 'on'
  use_pg_rewind: true
  use_slots: true
retry_timeout: 10
synchronous_mode: false
synchronous_mode_strict: false
ttl: 30

Have you checked Patroni logs? Please provide a snippet of Patroni log files here

Have you checked PostgreSQL logs? Please provide a snippet here

Have you tried to use GitHub issue search? Maybe there is already a similar issue solved.

Additional context Add any other context about the problem here.

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:7

github_iconTop GitHub Comments

1reaction
lalbrechtcommented, Sep 28, 2021

Imho you must monitor replica lag and react as soon as it appears. There are more efficient ways to get this metrics:

  • From inside database - pg_stat_replication view
  • From Patroni REST API
  • Also, patronictl list outputs info about lag =)
0reactions
CyberDem0ncommented, Sep 28, 2021

Wouldn’t it be cool. if patroni is telling us in the log output every second, that the WAL lag is too big to failover?

Guess how many people are actually regularly checking logs? And how many people are unhappy about logs being too verbose?

If you don’t know - I can give a hint, the second group probably exceeding the first one by the order of magnitude. Hence, it doesn’t make sense.

Read more comments on GitHub >

github_iconTop Results From Across the Web

My wal position exceeds maximum replication lag on replica ...
receiveTimestamp severity textPayload 2021‑05‑19T12:03:03.021703390Z ERROR 2021‑05‑19 12:03:02,782 INFO: Lock owner... 2021‑05‑19T12:03:13.038909884Z ERROR 2021‑05‑19 12:03:12,780 INFO: Lock owner... 2021‑05‑19T12:03:23.038097189Z ERROR 2021‑05‑19 12:03:22,780 INFO: Lock owner...
Read more >
Patroni replication lag - Mirantis Container Cloud - Docs
Patroni replication lag¶. PostgreSQL replication in a Patroni cluster is based on the Write-Ahead Log (WAL) syncing between the cluster leader and replica....
Read more >
Thread: Hot standby read slaves exceed max delay on WAL ...
We've been trying to look into a situation where we're seeing that hot standby read slaves are receiving WAL segments, but are exceeding...
Read more >
All members disappeared from a patroni cluster - Stack Overflow
I have 2 members in patroni cluster (1-master and 1-replica). ... the postgres cluster INFO: My wal position exceeds maximum replication lag ......
Read more >
What to Look for if Your PostgreSQL Replication is Lagging
Causes of PostgreSQL Replication Lag​​ In PostgreSQL, a replication lag can occur by these factors: Network issues. Not able to find the WAL ......
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found