Skip to content

fix(merge): make candidate dedup anchor-insensitive (align with status-preservation keys)#136

Merged
justinjoy merged 1 commit into
mainfrom
fix/merge-dedup-anchor-insensitive
Jun 21, 2026
Merged

fix(merge): make candidate dedup anchor-insensitive (align with status-preservation keys)#136
justinjoy merged 1 commit into
mainfrom
fix/merge-dedup-anchor-insensitive

Conversation

@justinjoy

Copy link
Copy Markdown
Contributor

Summary

Fixes the lone anchor-sensitive key in the merge path. normalize_rows deduped raw candidates on the full source string including any #anchor, while every status-preservation key in the same file (existing_superseded_keys, existing_engine_keys, existing_review_keys) deliberately strips the anchor. That asymmetry kept the same (subject, relation, object) re-extracted from the same file with a drifted/added/invalid anchor as two candidate rows — even though the status layer treats them as one fact, leaving two candidate rows under one status identity.

Change

  • Dedup key is now (subject, relation, object, source.partition("#")[0]) — anchor-insensitive, matching the preservation keys.
  • On collision the surviving row is chosen deterministically: the row whose full source sorts lexicographically first wins. A bare sources/a.md is a prefix of (thus <) any sources/a.md#anchor, so bare beats anchored; otherwise the lexicographically-first anchor wins. The winner is fixed by value, independent of input row order (no longer incidental to iteration).
  • Source-existence validation (partition("#") check) and NFC normalisation are unchanged.

This intentionally collapses cross-section provenance, consistent with how the status layer already treats anchors as insignificant. Per-section provenance, if ever wanted, should be a separate deliberate feature.

Tests

  • New tests/unit/test_dedup_anchor.py: bare-vs-anchored collapse, two-anchor collapse, order-independence of the surviving source, and distinct-triple retention — each asserting exactly which source survives.
  • Full pytest unit layer: 215 passed.
  • All 28 tests/test_*.sh shell harnesses pass; golden engine regression passes; ruff clean.
  • All examples synthetic (public repo).

Closes #135

…s-preservation keys)

normalize_rows keyed dedup on the full source string including any #anchor,
while every status-preservation key (superseded / engine / review) strips the
anchor. That asymmetry kept the same (subject, relation, object) re-extracted
from the same file with a drifted/added anchor as two candidate rows, even
though the status layer treats them as one fact.

Dedup now keys on (subject, relation, object, source-file), stripping the
anchor like the preservation keys. On collision the surviving row is chosen
deterministically — the full source that sorts lexicographically first wins, so
a bare path beats any anchored variant and otherwise the first anchor wins,
independent of input order. Source-existence validation and NFC normalisation
are untouched.

Closes #135
@justinjoy justinjoy merged commit 9624b5c into main Jun 21, 2026
3 checks passed
@justinjoy justinjoy deleted the fix/merge-dedup-anchor-insensitive branch June 21, 2026 10:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

fix(merge): make candidate dedup anchor-insensitive (align with status-preservation keys)

1 participant