Skip to content

stolon-keeper deadlock with Postgres: too many clients already #930

@nh2

Description

@nh2

Observed situation

After some degenerated system state (network disconnections, OOM-crashes), my Stolon cluster got degraded and did not auto-heal.

stolon-keeper looped around error pq: sorry, too many clients already:

2025-12-01T12:57:18.169Z        INFO        cmd/keeper.go:1505        our db requested role is master
2025-12-01T12:57:18.170Z        INFO        cmd/keeper.go:1543        already master
2025-12-01T12:57:18.172Z        ERROR        cmd/keeper.go:1022        failed to get replication slots        {"error": "pq: sorry, too many clients already"}
2025-12-01T12:57:18.172Z        ERROR        cmd/keeper.go:1547        error updating replication slots        {"error": "pq: sorry, too many clients already"}
2025-12-01T12:57:19.024Z        ERROR        cmd/keeper.go:720        cannot get configured pg parameters        {"error": "pq: sorry, too many clients already"}
2025-12-01T12:57:21.526Z        ERROR        cmd/keeper.go:720        cannot get configured pg parameters        {"error": "pq: sorry, too many clients already"}

Trying to log into Postgres with psql (via stolon-proxy or directly to the local postgres port) failed with the same error pq: sorry, too many clients already.

In htop / ps, the Postgres processes are as follows:

postgres  298065  0.0  0.1  67536  5888 ?        Ss   Nov27   0:00 postgres: logger 
postgres  298067  0.0  0.2 215100  9888 ?        Ss   Nov27   0:01 postgres: checkpointer 
postgres  298068  0.0  0.2 215116  8352 ?        Ss   Nov27   0:00 postgres: background writer 
postgres  298069  0.0  0.1  67676  5920 ?        Ss   Nov27   0:21 postgres: stats collector 
postgres  803775  0.0  0.2 214984 10400 ?        Ss   08:48   0:00 postgres: walwriter 
postgres  803776  0.0  0.1 215680  7840 ?        Ss   08:48   0:00 postgres: autovacuum launcher 
postgres  803777  0.0  0.1 215556  7328 ?        Ss   08:48   0:00 postgres: logical replication launcher 
postgres  803865  0.0  0.3 216316 13996 ?        Ss   08:48   0:00 postgres: cam postgres 10.0.0.3(37726) COMMIT waiting for 2/CF889F90
postgres  803871  0.0  0.3 216448 14380 ?        Ss   08:48   0:00 postgres: cam postgres 10.0.0.3(37730) UPDATE waiting
postgres  803891  0.0  0.3 216448 14380 ?        Ss   08:48   0:00 postgres: cam postgres 10.0.0.3(37734) UPDATE waiting
postgres  803917  0.0  0.3 216448 14380 ?        Ss   08:48   0:00 postgres: cam postgres 10.0.0.3(37740) UPDATE waiting
postgres  803977  0.0  0.3 216096 12844 ?        Ss   08:48   0:00 postgres: cam postgres 10.0.0.1(47334) UPDATE waiting
  90 more such lines as above
postgres  806711  0.0  0.3 216336 14260 ?        Ss   08:50   0:00 postgres: cam postgres 10.0.0.3(60848) UPDATE waiting
postgres  806712  0.0  0.3 216448 14388 ?        Ss   08:50   0:00 postgres: cam postgres 10.0.0.1(42068) UPDATE waiting
postgres  808225  0.0  0.3 216608 15412 ?        Ss   08:50   0:00 postgres: postgres postgres 10.0.0.1(49002) DO waiting for 2/CF88B348
postgres  808295  0.0  0.3 216608 15412 ?        Ss   08:50   0:00 postgres: postgres postgres 10.0.0.3(33282) DO waiting
postgres  824857  0.0  0.2 215824  9524 ?        Ss   09:00   0:00 postgres: walsender postgres 10.0.0.2(52470) streaming 2/D10005F8
postgres  824914  0.0  0.3 216608 15284 ?        Ss   09:00   0:00 postgres: postgres postgres 10.0.0.2(52574) DO waiting

Manual workaround

I ran kill 803871 to kill 1 postgres process. This allowed me to connect using psql.

It also fixed the stolon-keeper problem.

Environment

Hypothesis

I suspect that:

  • All Postgres max_connections = '100' are exhausted, mainly by the queued UPDATE waiting transactions.
    • Postgres seems to count these processes as connections/clients, even though my actual client process (my web-server) that issued those (via stolon-proxy) is terminated and thus gone for over 8 hours.
  • The transactions cannot complete because there are not enough synchronous standbys.
  • Stolon cannot update the synchronous standbys because it itself acts a postgres client to do that, and gets too many clients already.

Thus, deadlock between

  • stolon-keeper needing a connection slot to to change postgres settings, and
  • postgres needing changed settings to free up connection slots

Possible fix

Probably stolon-keeper should keep a connection to postgres open at all times, so it can use it to change settings.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions