Motivating case
From the harness corpus (contrib/name_comparison/cases.csv):
| name1 |
name2 |
label |
quality |
category |
current score |
| Bashar xal-Assad |
Bashar al-Assad |
true |
STRONG |
Fat Finger Typo |
~0.67 |
Below the 0.7 alert threshold despite being a STRONG match.
Diagnosis
After tokenisation on the hyphen, the residue stage clusters xal ↔ al (0.51 overlap rule fires on the shared al). Lengths 3 and 2 give a tiny per-side cost budget; one edit blows past it and the cluster scores ~0, dragging the weighted average from above-threshold down to 0.67.
The cluster-weight policy in nomenklatura/matching/logic_v2/names/distance.py already has a stopword down-weight that would solve this:
if len(match.qps) == 1 and len(match.rps) == 1:
if is_stopword(match.qps[0].form):
match.weight = 0.7
al is in resources/text/stopwords.yml (line 21). But the lookup only checks qps[0] — i.e. xal, the typo'd query side — so the result-side particle never marks the cluster as low-information.
Proposed fix
Two small changes:
-
nomenklatura/matching/logic_v2/names/distance.py — make the stopword check fire when either side is a known stopword:
if is_stopword(match.qps[0].form) or is_stopword(match.rps[0].form):
match.weight = 0.7
Rationale: a cluster is low-information if the intended token (either side) is a connective particle. The matcher shouldn't require that the noisy side also happens to land on a known particle.
-
resources/text/stopwords.yml — audit particle coverage. Already present: al, de, la, el, del, du, le, da, der, von. Worth adding for Arabic/Persian/Hebrew/Romance/Germanic/Celtic naming particles commonly seen in screening data: bin, ibn, ben, abu, abd, dos, mac, mc, van. (Curated, not exhaustive — precision over quantity per repo conventions.)
The existing 0.7 weight is already calibrated; this just ensures it triggers in the symmetrical case.
Out of scope
- New
is_name_stopword / NAME_PARTICLES resource — the existing is_stopword + resources/text/stopwords.yml already model the right concept.
- Particle insert/delete across the full alignment (e.g.
Mohammed bin Salman vs Mohammed Salman) — that's symbol pairing's job, not residue distance.
- Tokenizer changes (e.g. collapsing hyphens) — too broad, would affect unrelated cases.
Validation
The harness in contrib/name_comparison/ exercises this end-to-end:
- The Bashar case (
Bashar xal-Assad vs Bashar al-Assad) should rise above 0.7.
- Run the full suite to confirm no regressions on cases.csv labelled outcomes (especially STRONG-tier negatives that share particles).
Motivating case
From the harness corpus (
contrib/name_comparison/cases.csv):Below the 0.7 alert threshold despite being a STRONG match.
Diagnosis
After tokenisation on the hyphen, the residue stage clusters
xal↔al(0.51 overlap rule fires on the sharedal). Lengths 3 and 2 give a tiny per-side cost budget; one edit blows past it and the cluster scores ~0, dragging the weighted average from above-threshold down to 0.67.The cluster-weight policy in
nomenklatura/matching/logic_v2/names/distance.pyalready has a stopword down-weight that would solve this:alis inresources/text/stopwords.yml(line 21). But the lookup only checksqps[0]— i.e.xal, the typo'd query side — so the result-side particle never marks the cluster as low-information.Proposed fix
Two small changes:
nomenklatura/matching/logic_v2/names/distance.py— make the stopword check fire when either side is a known stopword:Rationale: a cluster is low-information if the intended token (either side) is a connective particle. The matcher shouldn't require that the noisy side also happens to land on a known particle.
resources/text/stopwords.yml— audit particle coverage. Already present:al,de,la,el,del,du,le,da,der,von. Worth adding for Arabic/Persian/Hebrew/Romance/Germanic/Celtic naming particles commonly seen in screening data:bin,ibn,ben,abu,abd,dos,mac,mc,van. (Curated, not exhaustive — precision over quantity per repo conventions.)The existing 0.7 weight is already calibrated; this just ensures it triggers in the symmetrical case.
Out of scope
is_name_stopword/NAME_PARTICLESresource — the existingis_stopword+resources/text/stopwords.ymlalready model the right concept.Mohammed bin SalmanvsMohammed Salman) — that's symbol pairing's job, not residue distance.Validation
The harness in
contrib/name_comparison/exercises this end-to-end:Bashar xal-AssadvsBashar al-Assad) should rise above 0.7.