stolon-keeper deadlock with Postgres: too many clients already

## Observed situation

After some degenerated system state (network disconnections, OOM-crashes), my Stolon cluster got degraded and did not auto-heal.

`stolon-keeper` looped around error `pq: sorry, too many clients already`:

```
2025-12-01T12:57:18.169Z        INFO        cmd/keeper.go:1505        our db requested role is master
2025-12-01T12:57:18.170Z        INFO        cmd/keeper.go:1543        already master
2025-12-01T12:57:18.172Z        ERROR        cmd/keeper.go:1022        failed to get replication slots        {"error": "pq: sorry, too many clients already"}
2025-12-01T12:57:18.172Z        ERROR        cmd/keeper.go:1547        error updating replication slots        {"error": "pq: sorry, too many clients already"}
2025-12-01T12:57:19.024Z        ERROR        cmd/keeper.go:720        cannot get configured pg parameters        {"error": "pq: sorry, too many clients already"}
2025-12-01T12:57:21.526Z        ERROR        cmd/keeper.go:720        cannot get configured pg parameters        {"error": "pq: sorry, too many clients already"}
```

Trying to log into Postgres with `psql` (via `stolon-proxy` or directly to the local postgres port) failed with the same error `pq: sorry, too many clients already`.

In `htop` / `ps`, the Postgres processes are as follows:

```
postgres  298065  0.0  0.1  67536  5888 ?        Ss   Nov27   0:00 postgres: logger 
postgres  298067  0.0  0.2 215100  9888 ?        Ss   Nov27   0:01 postgres: checkpointer 
postgres  298068  0.0  0.2 215116  8352 ?        Ss   Nov27   0:00 postgres: background writer 
postgres  298069  0.0  0.1  67676  5920 ?        Ss   Nov27   0:21 postgres: stats collector 
postgres  803775  0.0  0.2 214984 10400 ?        Ss   08:48   0:00 postgres: walwriter 
postgres  803776  0.0  0.1 215680  7840 ?        Ss   08:48   0:00 postgres: autovacuum launcher 
postgres  803777  0.0  0.1 215556  7328 ?        Ss   08:48   0:00 postgres: logical replication launcher 
postgres  803865  0.0  0.3 216316 13996 ?        Ss   08:48   0:00 postgres: cam postgres 10.0.0.3(37726) COMMIT waiting for 2/CF889F90
postgres  803871  0.0  0.3 216448 14380 ?        Ss   08:48   0:00 postgres: cam postgres 10.0.0.3(37730) UPDATE waiting
postgres  803891  0.0  0.3 216448 14380 ?        Ss   08:48   0:00 postgres: cam postgres 10.0.0.3(37734) UPDATE waiting
postgres  803917  0.0  0.3 216448 14380 ?        Ss   08:48   0:00 postgres: cam postgres 10.0.0.3(37740) UPDATE waiting
postgres  803977  0.0  0.3 216096 12844 ?        Ss   08:48   0:00 postgres: cam postgres 10.0.0.1(47334) UPDATE waiting
  90 more such lines as above
postgres  806711  0.0  0.3 216336 14260 ?        Ss   08:50   0:00 postgres: cam postgres 10.0.0.3(60848) UPDATE waiting
postgres  806712  0.0  0.3 216448 14388 ?        Ss   08:50   0:00 postgres: cam postgres 10.0.0.1(42068) UPDATE waiting
postgres  808225  0.0  0.3 216608 15412 ?        Ss   08:50   0:00 postgres: postgres postgres 10.0.0.1(49002) DO waiting for 2/CF88B348
postgres  808295  0.0  0.3 216608 15412 ?        Ss   08:50   0:00 postgres: postgres postgres 10.0.0.3(33282) DO waiting
postgres  824857  0.0  0.2 215824  9524 ?        Ss   09:00   0:00 postgres: walsender postgres 10.0.0.2(52470) streaming 2/D10005F8
postgres  824914  0.0  0.3 216608 15284 ?        Ss   09:00   0:00 postgres: postgres postgres 10.0.0.2(52574) DO waiting
```

## Manual workaround

I ran `kill 803871` to kill 1 postgres process. This allowed me to connect using `psql`.

It also fixed the `stolon-keeper` problem.


## Environment

* Stolon commit 4bb4107523c2db09fa711c1d96ddfe33bacf405c


## Hypothesis

I suspect that:

* All Postgres `max_connections = '100'` are exhausted, mainly by the queued `UPDATE waiting` transactions.
  * Postgres seems to count these processes as connections/clients, even though my actual client process (my web-server) that issued those (via `stolon-proxy`) is terminated and thus gone for over 8 hours.
* The transactions cannot complete because there are not enough synchronous standbys.
* Stolon cannot update the synchronous standbys because it itself acts a postgres client to do that, and gets `too many clients already`.

Thus, **deadlock** between

* stolon-keeper needing a connection slot to to change postgres settings, and
* postgres needing changed settings to free up connection slots


## Possible fix

Probably `stolon-keeper` should keep a connection to postgres open at all times, so it can use it to change settings.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

stolon-keeper deadlock with Postgres: too many clients already #930

Observed situation

Manual workaround

Environment

Hypothesis

Possible fix

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

stolon-keeper deadlock with Postgres: too many clients already #930

Description

Observed situation

Manual workaround

Environment

Hypothesis

Possible fix

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions