Expected Behavior
When a MySQL database goes offline and comes back, Temporal server's DatabaseHandle.reconnect() should recover gracefully while respecting the configured maxConns limit. The total number of MySQL connections from all pods should never exceed (number of pods) × (pools per pod) × maxConns — e.g., 14 pods × 2 pools × 128 maxConns = 2,688 connections.
Actual Behavior
During sustained database unavailability, reconnect() creates a new sql.DB instance on each attempt (throttled at 1/second/pod), while closing the old instance asynchronously via go prevConn.Close(). The old pool's connections linger in MySQL's view while new pools open fresh connections. This causes the configured maxConns limit to be effectively bypassed, as it is enforced per sql.DB instance rather than globally.
In our incident, a ~2.5-minute MySQL restart resulted in 10,330 simultaneous connections (vs expected max of 2,688) from 14 Temporal pods. This connection storm overwhelmed the database's cold buffer pool, turning a brief restart into a 35-minute outage.
The history service generated 882,402 serviceerror_Unavailable errors per minute at peak, and all workflows stalled for 30 minutes.
The root issue is that multiple generations of sql.DB pools accumulate during the outage window. When the database recovers, all surviving pools race to establish their full maxConns quota simultaneously: 14 pods × ~150 reconnection cycles × 128 maxConns ≈ 10,752 connections.
Steps to Reproduce the Problem
-
Deploy Temporal cluster with maxConns: 128 (default store) and maxConns: 64 (visibility store), 6 history pods, 14 total pods, and numHistoryShards: 4096 against a MySQL 8.0 database.
-
Restart or take the MySQL database offline for 2–3 minutes (simulating an HA failover or planned maintenance).
-
Observe MySQL Threads_connected after the database comes back online, it will spike far beyond (pod count × pools per pod × maxConns) due to accumulated connection pools from repeated reconnect() calls during the outage.
Specifications
Expected Behavior
When a MySQL database goes offline and comes back, Temporal server's DatabaseHandle.reconnect() should recover gracefully while respecting the configured maxConns limit. The total number of MySQL connections from all pods should never exceed (number of pods) × (pools per pod) × maxConns — e.g., 14 pods × 2 pools × 128 maxConns = 2,688 connections.
Actual Behavior
During sustained database unavailability, reconnect() creates a new sql.DB instance on each attempt (throttled at 1/second/pod), while closing the old instance asynchronously via go prevConn.Close(). The old pool's connections linger in MySQL's view while new pools open fresh connections. This causes the configured maxConns limit to be effectively bypassed, as it is enforced per sql.DB instance rather than globally.
In our incident, a ~2.5-minute MySQL restart resulted in 10,330 simultaneous connections (vs expected max of 2,688) from 14 Temporal pods. This connection storm overwhelmed the database's cold buffer pool, turning a brief restart into a 35-minute outage.
The history service generated 882,402 serviceerror_Unavailable errors per minute at peak, and all workflows stalled for 30 minutes.
The root issue is that multiple generations of sql.DB pools accumulate during the outage window. When the database recovers, all surviving pools race to establish their full maxConns quota simultaneously: 14 pods × ~150 reconnection cycles × 128 maxConns ≈ 10,752 connections.
Steps to Reproduce the Problem
Deploy Temporal cluster with maxConns: 128 (default store) and maxConns: 64 (visibility store), 6 history pods, 14 total pods, and numHistoryShards: 4096 against a MySQL 8.0 database.
Restart or take the MySQL database offline for 2–3 minutes (simulating an HA failover or planned maintenance).
Observe MySQL Threads_connected after the database comes back online, it will spike far beyond (pod count × pools per pod × maxConns) due to accumulated connection pools from repeated reconnect() calls during the outage.
Specifications
Version: Temporal server v1.26.3 (custom build with auth patches; core DatabaseHandle logic from PR Reconnect to SQL databases when connections fail #5926)
Platform: Azure Flexible Server MySQL 8.0.21
Relevant code: common/persistence/sql/sqlplugin/db_handle.go — reconnect() method and ConvertError() trigger
Related issues: Add support for MySql/Aurora failover #1703 (Aurora failover), Stuck Temporal Server #6514 (stuck server after reconnect throttle bug), PR Issue with spike of connection to database. Potential solution #9211 (connection spike report)