Skip to content

Extend is_stopword check in weighted_edit_similarity to either side of a cluster #202

@pudo

Description

@pudo

Motivating case

From the harness corpus (contrib/name_comparison/cases.csv):

name1 name2 label quality category current score
Bashar xal-Assad Bashar al-Assad true STRONG Fat Finger Typo ~0.67

Below the 0.7 alert threshold despite being a STRONG match.

Diagnosis

After tokenisation on the hyphen, the residue stage clusters xalal (0.51 overlap rule fires on the shared al). Lengths 3 and 2 give a tiny per-side cost budget; one edit blows past it and the cluster scores ~0, dragging the weighted average from above-threshold down to 0.67.

The cluster-weight policy in nomenklatura/matching/logic_v2/names/distance.py already has a stopword down-weight that would solve this:

if len(match.qps) == 1 and len(match.rps) == 1:
    if is_stopword(match.qps[0].form):
        match.weight = 0.7

al is in resources/text/stopwords.yml (line 21). But the lookup only checks qps[0] — i.e. xal, the typo'd query side — so the result-side particle never marks the cluster as low-information.

Proposed fix

Two small changes:

  1. nomenklatura/matching/logic_v2/names/distance.py — make the stopword check fire when either side is a known stopword:

    if is_stopword(match.qps[0].form) or is_stopword(match.rps[0].form):
        match.weight = 0.7

    Rationale: a cluster is low-information if the intended token (either side) is a connective particle. The matcher shouldn't require that the noisy side also happens to land on a known particle.

  2. resources/text/stopwords.yml — audit particle coverage. Already present: al, de, la, el, del, du, le, da, der, von. Worth adding for Arabic/Persian/Hebrew/Romance/Germanic/Celtic naming particles commonly seen in screening data: bin, ibn, ben, abu, abd, dos, mac, mc, van. (Curated, not exhaustive — precision over quantity per repo conventions.)

The existing 0.7 weight is already calibrated; this just ensures it triggers in the symmetrical case.

Out of scope

  • New is_name_stopword / NAME_PARTICLES resource — the existing is_stopword + resources/text/stopwords.yml already model the right concept.
  • Particle insert/delete across the full alignment (e.g. Mohammed bin Salman vs Mohammed Salman) — that's symbol pairing's job, not residue distance.
  • Tokenizer changes (e.g. collapsing hyphens) — too broad, would affect unrelated cases.

Validation

The harness in contrib/name_comparison/ exercises this end-to-end:

  • The Bashar case (Bashar xal-Assad vs Bashar al-Assad) should rise above 0.7.
  • Run the full suite to confirm no regressions on cases.csv labelled outcomes (especially STRONG-tier negatives that share particles).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions