My wal position exceeds maximum replication lag

See original GitHub issue

Describe the bug We are running a postgres-cluster with the postgres–operator on Kubernetes in a hosted VM environment. We use minimum one standby-node, for testing purposes also more than one. A fallback between nodes was never successful, as we always get these messages from the secondary:

INFO: My wal position exceeds maximum replication lag
2021-09-28 08:00:23,141 INFO: following a different leader because i am not the healthiest node
2021-09-28 08:00:33,130 INFO: My wal position exceeds maximum replication lag
2021-09-28 08:00:33,140 INFO: following a different leader because i am not the healthiest node
2021-09-28 08:00:43,130 INFO: My wal position exceeds maximum replication lag

After the “old” primary comes back to live, everything recovers again:

2021-09-28 08:06:38,474 INFO: Lock owner: None; I am acid-common-cluster-1
2021-09-28 08:06:38,620 INFO: starting as a secondary
2021-09-28 08:06:39,575 INFO: postmaster pid=59
/var/run/postgresql:5432 - no response
2021-09-28 08:06:39 UTC [59]: [1-1] 6152cd0f.3b 0     LOG:  Auto detecting pg_stat_kcache.linux_hz parameter...
2021-09-28 08:06:39 UTC [59]: [2-1] 6152cd0f.3b 0     LOG:  pg_stat_kcache.linux_hz is set to 1000000
2021-09-28 08:06:39 UTC [59]: [3-1] 6152cd0f.3b 0     LOG:  redirecting log output to logging collector process
2021-09-28 08:06:39 UTC [59]: [4-1] 6152cd0f.3b 0     HINT:  Future log output will appear in directory "../pg_log".
/var/run/postgresql:5432 - accepting connections
/var/run/postgresql:5432 - accepting connections
2021-09-28 08:06:40,874 INFO: establishing a new patroni connection to the postgres cluster
2021-09-28 08:06:40,962 INFO: Got response from acid-common-cluster-0 http://10.42.1.180:8008/patroni: {"state": "running", "postmaster_start_time": "2021-09-20 13:35:56.575741+00:00", "role": "replica", "server_version": 130004, "cluster_unlocked": true, "xlog": {"received_location": 436207616, "replayed_location": 436207616, "replayed_timestamp": null, "paused": false}, "timeline": 8, "database_system_identifier": "6922433781832740933", "patroni": {"version": "2.1.1", "scope": "acid-common-cluster"}}
2021-09-28 08:06:41,068 WARNING: Could not activate Linux watchdog device: "Can't open watchdog device: [Errno 2] No such file or directory: '/dev/watchdog'"
2021-09-28 08:06:41,172 INFO: promoted self to leader by acquiring session lock
2021-09-28 08:06:41,173 INFO: Lock owner: acid-common-cluster-1; I am acid-common-cluster-1
2021-09-28 08:06:41,240 INFO: updated leader lock during promote
server promoting
2021-09-28 08:06:41,245 INFO: cleared rewind state after becoming the leader
2021-09-28 08:06:42,775 INFO: no action. I am (acid-common-cluster-1) the leader with the lock
2021-09-28 08:06:43,177 INFO: no action. I am (acid-common-cluster-1) the leader with the lock

To Reproduce Steps to reproduce the behavior:

Expected behavior The fallback should happen without a big downtime.

Screenshots If applicable, add screenshots to help explain your problem.

Environment

We are using this docker image: registry.opensource.zalan.do/acid/spilo-13:2.1-p1

Patroni configuration file

patronictl show-config

loop_wait: 10
maximum_lag_on_failover: 33554432
postgresql:
  parameters:
    archive_mode: 'on'
    archive_timeout: 1800s
    autovacuum_analyze_scale_factor: 0.02
    autovacuum_max_workers: 5
    autovacuum_vacuum_scale_factor: 0.05
    checkpoint_completion_target: 0.9
    hot_standby: 'on'
    log_autovacuum_min_duration: 0
    log_checkpoints: 'on'
    log_connections: 'on'
    log_disconnections: 'on'
    log_line_prefix: '%t [%p]: [%l-1] %c %x %d %u %a %h '
    log_lock_waits: 'on'
    log_min_duration_statement: 500
    log_statement: ddl
    log_temp_files: 0
    max_connections: 100
    max_replication_slots: 10
    max_wal_senders: 10
    tcp_keepalives_idle: 900
    tcp_keepalives_interval: 100
    track_functions: all
    wal_level: hot_standby
    wal_log_hints: 'on'
  use_pg_rewind: true
  use_slots: true
retry_timeout: 10
synchronous_mode: false
synchronous_mode_strict: false
ttl: 30

Have you checked Patroni logs? Please provide a snippet of Patroni log files here

Have you checked PostgreSQL logs? Please provide a snippet here

Have you tried to use GitHub issue search? Maybe there is already a similar issue solved.

Additional context Add any other context about the problem here.

Issue Analytics

State:
Created 2 years ago
Comments:7

Top GitHub Comments

1reaction

lalbrechtcommented, Sep 28, 2021

Imho you must monitor replica lag and react as soon as it appears. There are more efficient ways to get this metrics:

From inside database - pg_stat_replication view
From Patroni REST API
Also, patronictl list outputs info about lag =)

0reactions

CyberDem0ncommented, Sep 28, 2021

Wouldn’t it be cool. if patroni is telling us in the log output every second, that the WAL lag is too big to failover?

Guess how many people are actually regularly checking logs? And how many people are unhappy about logs being too verbose?

If you don’t know - I can give a hint, the second group probably exceeding the first one by the order of magnitude. Hence, it doesn’t make sense.

Top Results From Across the Web

My wal position exceeds maximum replication lag on replica ...

receiveTimestamp severity textPayload 2021‑05‑19T12:03:03.021703390Z ERROR 2021‑05‑19 12:03:02,782 INFO: Lock owner... 2021‑05‑19T12:03:13.038909884Z ERROR 2021‑05‑19 12:03:12,780 INFO: Lock owner... 2021‑05‑19T12:03:23.038097189Z ERROR 2021‑05‑19 12:03:22,780 INFO: Lock owner...

Patroni replication lag - Mirantis Container Cloud - Docs

Patroni replication lag¶. PostgreSQL replication in a Patroni cluster is based on the Write-Ahead Log (WAL) syncing between the cluster leader and replica....

Thread: Hot standby read slaves exceed max delay on WAL ...

We've been trying to look into a situation where we're seeing that hot standby read slaves are receiving WAL segments, but are exceeding...

All members disappeared from a patroni cluster - Stack Overflow

I have 2 members in patroni cluster (1-master and 1-replica). ... the postgres cluster INFO: My wal position exceeds maximum replication lag ......

What to Look for if Your PostgreSQL Replication is Lagging

Causes of PostgreSQL Replication Lag In PostgreSQL, a replication lag can occur by these factors: Network issues. Not able to find the WAL ......