Skip to content

fix(dagster): use block-number-scoped row count for pre-DROP safety c…#53811

Open
bciaraldi wants to merge 2 commits intomasterfrom
bc/part-breaker-checks
Open

fix(dagster): use block-number-scoped row count for pre-DROP safety c…#53811
bciaraldi wants to merge 2 commits intomasterfrom
bc/part-breaker-checks

Conversation

@bciaraldi
Copy link
Copy Markdown
Contributor

…heck

Problem

The part breaker's safety check before DROP PART compares the number of attached parts against expected count. ClickHouse background merges can combine freshly attached parts between the ATTACH and the safety check, reducing the count. This causes the safety check to fail with:

Expected at least x new parts on <table> partition x, but only found < x (excluding original). Skipping DROP to avoid data loss.

The INSERT SELECT, ATTACH, and replication all succeed — data is intact — but the overly strict part count check prevents the DROP, leaving the old oversized part waste

Changes

  • Before ATTACH, record the block number range (min_block, max_block) of our parts from part names
  • After ATTACH + replication, query system.parts for parts that overlap our block range
  • Validates row count against staging target with zero tolerance

Why this is safe:

  • Background merges: can combine our parts with pre-existing ones, but the merged result still overlaps our block range - caught by the overlap query
  • Pre-existing partition data: has a lower block range, excluded unless merged with our parts (in which case the row count increases)
  • Concurrent ingestion: has a higher block range, excluded by the max_block upper bound
  • Zero tolerance: the check can only over-count (from cross-range merges including pre-existing rows), never under-count.

Failure cases:

  • Zero parts found in our block range → fail
  • Row count below staging target → fail

How did you test this code?

  • Observed dag failure: x parts attached, x merged by background process, old part-count check failed at < x
  • Manually verified data integrity: compared _part-filtered row counts before and after dropping old par
  • Verified min_block_number/max_block_number are preserved through merges on prod

Publish to changelog?

no

@bciaraldi bciaraldi requested a review from a team April 9, 2026 00:26
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Apr 9, 2026

Vulnerabilities

No security concerns identified. The block-range values (our_min_block, our_max_block) are derived from filesystem part names already filtered by partition_id, and are passed as parameterised query arguments — not interpolated directly into SQL.

Reviews (1): Last reviewed commit: "pre|post-check block range" | Re-trigger Greptile

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 9, 2026

🎭 Playwright report · View test results →

⚠️ 1 flaky test:

  • Materialize view pane (chromium)

These issues are not necessarily caused by your changes.
Annoyed by this comment? Help fix flakies and failures and it'll disappear!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants