MySQL Connector creates unbounded new sql.DB pools during sustained DB unavailability

## Expected Behavior
When a MySQL database goes offline and comes back, Temporal server's DatabaseHandle.reconnect() should recover gracefully while respecting the configured maxConns limit. The total number of MySQL connections from all pods should never exceed (number of pods) × (pools per pod) × maxConns — e.g., 14 pods × 2 pools × 128 maxConns = 2,688 connections.

## Actual Behavior
During sustained database unavailability, reconnect() creates a new sql.DB instance on each attempt (throttled at 1/second/pod), while closing the old instance asynchronously via go prevConn.Close(). The old pool's connections linger in MySQL's view while new pools open fresh connections. This causes the configured maxConns limit to be effectively bypassed, as it is enforced per sql.DB instance rather than globally.

In our incident, a ~2.5-minute MySQL restart resulted in 10,330 simultaneous connections (vs expected max of 2,688) from 14 Temporal pods. This connection storm overwhelmed the database's cold buffer pool, turning a brief restart into a 35-minute outage.

The history service generated 882,402 serviceerror_Unavailable errors per minute at peak, and all workflows stalled for 30 minutes.

The root issue is that multiple generations of sql.DB pools accumulate during the outage window. When the database recovers, all surviving pools race to establish their full maxConns quota simultaneously: 14 pods × ~150 reconnection cycles × 128 maxConns ≈ 10,752 connections.

## Steps to Reproduce the Problem

  
1. Deploy Temporal cluster with maxConns: 128 (default store) and maxConns: 64 (visibility store), 6 history pods, 14 total pods, and numHistoryShards: 4096 against a MySQL 8.0 database.

2. Restart or take the MySQL database offline for 2–3 minutes (simulating an HA failover or planned maintenance).

3. Observe MySQL Threads_connected after the database comes back online, it will spike far beyond (pod count × pools per pod × maxConns) due to accumulated connection pools from repeated reconnect() calls during the outage.



## Specifications

- Version: Temporal server v1.26.3 (custom build with auth patches; core DatabaseHandle logic from PR #5926)

- Platform: Azure Flexible Server MySQL 8.0.21

- Relevant code: common/persistence/sql/sqlplugin/db_handle.go — reconnect() method and ConvertError() trigger

- Related issues: #1703 (Aurora failover), #6514 (stuck server after reconnect throttle bug), PR #9211 (connection spike report)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MySQL Connector creates unbounded new sql.DB pools during sustained DB unavailability #9747

Expected Behavior

Actual Behavior

Steps to Reproduce the Problem

Specifications

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

MySQL Connector creates unbounded new sql.DB pools during sustained DB unavailability #9747

Description

Expected Behavior

Actual Behavior

Steps to Reproduce the Problem

Specifications

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions