Skip to content

hotpatch: Fix/db startup index lock hang#1502

Merged
MODSetter merged 3 commits into
mainfrom
fix/db-startup-index-lock-hang
Jun 16, 2026
Merged

hotpatch: Fix/db startup index lock hang#1502
MODSetter merged 3 commits into
mainfrom
fix/db-startup-index-lock-hang

Conversation

@MODSetter

@MODSetter MODSetter commented Jun 16, 2026

Copy link
Copy Markdown
Owner

Description

Motivation and Context

FIX #

Screenshots

API Changes

  • This PR includes API changes

Change Type

  • Bug fix
  • New feature
  • Performance improvement
  • Refactoring
  • Documentation
  • Dependency/Build system
  • Breaking change
  • Other (specify):

Testing Performed

  • Tested locally
  • Manual/QA verification

Checklist

  • Follows project coding standards and conventions
  • Documentation updated as needed
  • Dependencies updated as needed
  • No lint/build errors or new warnings
  • All relevant tests are passing

High-level PR Summary

This PR fixes a critical database startup hang issue where abandoned "idle in transaction" sessions could indefinitely hold locks and block FastAPI application startup. The fix introduces protective timeouts (idle_in_transaction_session_timeout and lock_timeout) for both web and Celery worker database connections, refactors index creation to use CONCURRENTLY mode (non-blocking ShareUpdateExclusiveLock), adds graceful handling of invalid leftover indexes, and provides configuration knobs (DB_BOOTSTRAP_ON_STARTUP, DB_DDL_LOCK_TIMEOUT_MS) to control bootstrap behavior and ensure fast-fail instead of indefinite hangs.

⏱️ Estimated Review Time: 30-90 minutes

💡 Review Order Suggestion
Order File Path
1 surfsense_backend/.env.example
2 surfsense_backend/app/config/__init__.py
3 surfsense_backend/app/db.py
4 surfsense_backend/app/tasks/celery_tasks/__init__.py

Need help? Join our Discord

Summary by CodeRabbit

  • New Features

    • Optional database initialization control on startup for flexible deployment scenarios
  • Bug Fixes & Improvements

    • Enhanced concurrent index creation with automatic cleanup of invalid indexes
    • Improved database timeout configuration for better lock and session management
    • Better Celery worker session handling to prevent orphaned database connections during worker failures

MODSetter and others added 3 commits June 16, 2026 16:18
A single abandoned "idle in transaction" session held locks on the
documents table, which blocked the non-concurrent CREATE INDEX (hnsw)
run inside the FastAPI lifespan. Each API restart queued another
CREATE INDEX behind an advisory lock, leaving the server stuck at
"Waiting for application startup." indefinitely and freezing ingestion
writes.

Changes:
- setup_indexes(): build every index with CREATE INDEX CONCURRENTLY
  (non-blocking ShareUpdateExclusiveLock) under a per-session
  lock_timeout, and make each statement non-fatal so a contended/slow
  build is retried next boot instead of wedging startup. Drop leftover
  invalid indexes before rebuilding.
- create_db_and_tables(): apply lock_timeout to extension/create_all
  DDL and gate the whole bootstrap behind DB_BOOTSTRAP_ON_STARTUP.
- engine: set idle_in_transaction_session_timeout (asyncpg) so an
  abandoned transaction is reaped automatically.
- config + .env.example: DB_BOOTSTRAP_ON_STARTUP, DB_DDL_LOCK_TIMEOUT_MS,
  DB_IDLE_IN_TX_TIMEOUT_MS.

Co-authored-by: Cursor <cursoragent@cursor.com>
The long-running ingestion/podcast/video tasks run on a separate Celery
engine (NullPool), so the web engine's idle_in_transaction_session_timeout
did not cover them — which is exactly where the original 11h zombie
(INSERT INTO chunks) came from. Apply the same protection to the Celery
engine with a generous 60-minute default so a worker that hangs/crashes
mid-transaction can't hold locks on documents/chunks indefinitely, while
never reaping a legitimate per-document embed window.

- config + .env.example: DB_CELERY_IDLE_IN_TX_TIMEOUT_MS (default 3600000).

Co-authored-by: Cursor <cursoragent@cursor.com>
@vercel

vercel Bot commented Jun 16, 2026

Copy link
Copy Markdown

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
surf-sense-frontend Ready Ready Preview, Comment Jun 16, 2026 11:27pm

Request Review

@coderabbitai

coderabbitai Bot commented Jun 16, 2026

Copy link
Copy Markdown

Review Change Stack

Caution

Review failed

The pull request is closed.

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: b6264891-55e1-4e66-b3da-34643e2caf96

📥 Commits

Reviewing files that changed from the base of the PR and between 284df84 and b9702b3.

📒 Files selected for processing (4)
  • surfsense_backend/.env.example
  • surfsense_backend/app/config/__init__.py
  • surfsense_backend/app/db.py
  • surfsense_backend/app/tasks/celery_tasks/__init__.py

📝 Walkthrough

Walkthrough

Adds four optional config knobs (DB_BOOTSTRAP_ON_STARTUP, DB_DDL_LOCK_TIMEOUT_MS, DB_IDLE_IN_TX_TIMEOUT_MS, DB_CELERY_IDLE_IN_TX_TIMEOUT_MS) and documents them in .env.example. These knobs are wired into the main asyncpg engine (create_async_engine), the Celery worker engine, create_db_and_tables() (which gains a bootstrap skip-guard), and a refactored setup_indexes() that uses CREATE INDEX CONCURRENTLY with invalid-index cleanup.

Changes

DB Safety Knobs and Startup Bootstrap

Layer / File(s) Summary
Config settings and env.example documentation
surfsense_backend/.env.example, surfsense_backend/app/config/__init__.py
Adds DB_BOOTSTRAP_ON_STARTUP, DB_DDL_LOCK_TIMEOUT_MS, DB_IDLE_IN_TX_TIMEOUT_MS, and DB_CELERY_IDLE_IN_TX_TIMEOUT_MS to the Config class with env-var bindings and defaults; documents them in .env.example.
Asyncpg connect_args, engine wiring, and bootstrap guard
surfsense_backend/app/db.py
Adds module logger; introduces _build_connect_args() to conditionally inject idle_in_transaction_session_timeout into asyncpg server_settings and passes the result into create_async_engine(); updates create_db_and_tables() to skip bootstrapping when DB_BOOTSTRAP_ON_STARTUP is false and to set a local lock_timeout before running DDL.
Concurrent index provisioning and Celery worker timeout
surfsense_backend/app/db.py, surfsense_backend/app/tasks/celery_tasks/__init__.py
Replaces the old setup_indexes() with a centralized _INDEX_DEFINITIONS list, _drop_invalid_index() to clean up failed concurrent-build leftovers, and a new autocommit-based setup_indexes() running CREATE INDEX CONCURRENTLY per index with per-index failure logging; also wires DB_CELERY_IDLE_IN_TX_TIMEOUT_MS into the Celery worker asyncpg engine via server_settings.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

🐇 Hops through startup, light as dew,
No stale indexes blocking the queue.
Lock timeouts set, the engine won't freeze,
Idle transactions caught in the breeze.
Bootstrap flag set — or skip if you please!

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/db-startup-index-lock-hang

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@MODSetter MODSetter merged commit 7ce409c into main Jun 16, 2026
6 of 11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant