Observed situation
After some degenerated system state (network disconnections, OOM-crashes), my Stolon cluster got degraded and did not auto-heal.
stolon-keeper looped around error pq: sorry, too many clients already:
2025-12-01T12:57:18.169Z INFO cmd/keeper.go:1505 our db requested role is master
2025-12-01T12:57:18.170Z INFO cmd/keeper.go:1543 already master
2025-12-01T12:57:18.172Z ERROR cmd/keeper.go:1022 failed to get replication slots {"error": "pq: sorry, too many clients already"}
2025-12-01T12:57:18.172Z ERROR cmd/keeper.go:1547 error updating replication slots {"error": "pq: sorry, too many clients already"}
2025-12-01T12:57:19.024Z ERROR cmd/keeper.go:720 cannot get configured pg parameters {"error": "pq: sorry, too many clients already"}
2025-12-01T12:57:21.526Z ERROR cmd/keeper.go:720 cannot get configured pg parameters {"error": "pq: sorry, too many clients already"}
Trying to log into Postgres with psql (via stolon-proxy or directly to the local postgres port) failed with the same error pq: sorry, too many clients already.
In htop / ps, the Postgres processes are as follows:
postgres 298065 0.0 0.1 67536 5888 ? Ss Nov27 0:00 postgres: logger
postgres 298067 0.0 0.2 215100 9888 ? Ss Nov27 0:01 postgres: checkpointer
postgres 298068 0.0 0.2 215116 8352 ? Ss Nov27 0:00 postgres: background writer
postgres 298069 0.0 0.1 67676 5920 ? Ss Nov27 0:21 postgres: stats collector
postgres 803775 0.0 0.2 214984 10400 ? Ss 08:48 0:00 postgres: walwriter
postgres 803776 0.0 0.1 215680 7840 ? Ss 08:48 0:00 postgres: autovacuum launcher
postgres 803777 0.0 0.1 215556 7328 ? Ss 08:48 0:00 postgres: logical replication launcher
postgres 803865 0.0 0.3 216316 13996 ? Ss 08:48 0:00 postgres: cam postgres 10.0.0.3(37726) COMMIT waiting for 2/CF889F90
postgres 803871 0.0 0.3 216448 14380 ? Ss 08:48 0:00 postgres: cam postgres 10.0.0.3(37730) UPDATE waiting
postgres 803891 0.0 0.3 216448 14380 ? Ss 08:48 0:00 postgres: cam postgres 10.0.0.3(37734) UPDATE waiting
postgres 803917 0.0 0.3 216448 14380 ? Ss 08:48 0:00 postgres: cam postgres 10.0.0.3(37740) UPDATE waiting
postgres 803977 0.0 0.3 216096 12844 ? Ss 08:48 0:00 postgres: cam postgres 10.0.0.1(47334) UPDATE waiting
90 more such lines as above
postgres 806711 0.0 0.3 216336 14260 ? Ss 08:50 0:00 postgres: cam postgres 10.0.0.3(60848) UPDATE waiting
postgres 806712 0.0 0.3 216448 14388 ? Ss 08:50 0:00 postgres: cam postgres 10.0.0.1(42068) UPDATE waiting
postgres 808225 0.0 0.3 216608 15412 ? Ss 08:50 0:00 postgres: postgres postgres 10.0.0.1(49002) DO waiting for 2/CF88B348
postgres 808295 0.0 0.3 216608 15412 ? Ss 08:50 0:00 postgres: postgres postgres 10.0.0.3(33282) DO waiting
postgres 824857 0.0 0.2 215824 9524 ? Ss 09:00 0:00 postgres: walsender postgres 10.0.0.2(52470) streaming 2/D10005F8
postgres 824914 0.0 0.3 216608 15284 ? Ss 09:00 0:00 postgres: postgres postgres 10.0.0.2(52574) DO waiting
Manual workaround
I ran kill 803871 to kill 1 postgres process. This allowed me to connect using psql.
It also fixed the stolon-keeper problem.
Environment
Hypothesis
I suspect that:
- All Postgres
max_connections = '100' are exhausted, mainly by the queued UPDATE waiting transactions.
- Postgres seems to count these processes as connections/clients, even though my actual client process (my web-server) that issued those (via
stolon-proxy) is terminated and thus gone for over 8 hours.
- The transactions cannot complete because there are not enough synchronous standbys.
- Stolon cannot update the synchronous standbys because it itself acts a postgres client to do that, and gets
too many clients already.
Thus, deadlock between
- stolon-keeper needing a connection slot to to change postgres settings, and
- postgres needing changed settings to free up connection slots
Possible fix
Probably stolon-keeper should keep a connection to postgres open at all times, so it can use it to change settings.
Observed situation
After some degenerated system state (network disconnections, OOM-crashes), my Stolon cluster got degraded and did not auto-heal.
stolon-keeperlooped around errorpq: sorry, too many clients already:Trying to log into Postgres with
psql(viastolon-proxyor directly to the local postgres port) failed with the same errorpq: sorry, too many clients already.In
htop/ps, the Postgres processes are as follows:Manual workaround
I ran
kill 803871to kill 1 postgres process. This allowed me to connect usingpsql.It also fixed the
stolon-keeperproblem.Environment
Hypothesis
I suspect that:
max_connections = '100'are exhausted, mainly by the queuedUPDATE waitingtransactions.stolon-proxy) is terminated and thus gone for over 8 hours.too many clients already.Thus, deadlock between
Possible fix
Probably
stolon-keepershould keep a connection to postgres open at all times, so it can use it to change settings.