Skip to content

MySQL Connector creates unbounded new sql.DB pools during sustained DB unavailability #9747

@tushdante

Description

@tushdante

Expected Behavior

When a MySQL database goes offline and comes back, Temporal server's DatabaseHandle.reconnect() should recover gracefully while respecting the configured maxConns limit. The total number of MySQL connections from all pods should never exceed (number of pods) × (pools per pod) × maxConns — e.g., 14 pods × 2 pools × 128 maxConns = 2,688 connections.

Actual Behavior

During sustained database unavailability, reconnect() creates a new sql.DB instance on each attempt (throttled at 1/second/pod), while closing the old instance asynchronously via go prevConn.Close(). The old pool's connections linger in MySQL's view while new pools open fresh connections. This causes the configured maxConns limit to be effectively bypassed, as it is enforced per sql.DB instance rather than globally.

In our incident, a ~2.5-minute MySQL restart resulted in 10,330 simultaneous connections (vs expected max of 2,688) from 14 Temporal pods. This connection storm overwhelmed the database's cold buffer pool, turning a brief restart into a 35-minute outage.

The history service generated 882,402 serviceerror_Unavailable errors per minute at peak, and all workflows stalled for 30 minutes.

The root issue is that multiple generations of sql.DB pools accumulate during the outage window. When the database recovers, all surviving pools race to establish their full maxConns quota simultaneously: 14 pods × ~150 reconnection cycles × 128 maxConns ≈ 10,752 connections.

Steps to Reproduce the Problem

  1. Deploy Temporal cluster with maxConns: 128 (default store) and maxConns: 64 (visibility store), 6 history pods, 14 total pods, and numHistoryShards: 4096 against a MySQL 8.0 database.

  2. Restart or take the MySQL database offline for 2–3 minutes (simulating an HA failover or planned maintenance).

  3. Observe MySQL Threads_connected after the database comes back online, it will spike far beyond (pod count × pools per pod × maxConns) due to accumulated connection pools from repeated reconnect() calls during the outage.

Specifications

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions