Extend is_stopword check in weighted_edit_similarity to either side of a cluster

## Motivating case

From the harness corpus (`contrib/name_comparison/cases.csv`):

| name1 | name2 | label | quality | category | current score |
|---|---|---|---|---|---|
| Bashar **xal**-Assad | Bashar **al**-Assad | true | STRONG | Fat Finger Typo | ~0.67 |

Below the 0.7 alert threshold despite being a STRONG match.

## Diagnosis

After tokenisation on the hyphen, the residue stage clusters `xal` ↔ `al` (0.51 overlap rule fires on the shared `al`). Lengths 3 and 2 give a tiny per-side cost budget; one edit blows past it and the cluster scores ~0, dragging the weighted average from above-threshold down to 0.67.

The cluster-weight policy in `nomenklatura/matching/logic_v2/names/distance.py` already has a stopword down-weight that would solve this:

```python
if len(match.qps) == 1 and len(match.rps) == 1:
    if is_stopword(match.qps[0].form):
        match.weight = 0.7
```

`al` is in `resources/text/stopwords.yml` (line 21). But the lookup only checks `qps[0]` — i.e. `xal`, the typo'd query side — so the result-side particle never marks the cluster as low-information.

## Proposed fix

**Two small changes:**

1. **`nomenklatura/matching/logic_v2/names/distance.py`** — make the stopword check fire when either side is a known stopword:
   ```python
   if is_stopword(match.qps[0].form) or is_stopword(match.rps[0].form):
       match.weight = 0.7
   ```
   Rationale: a cluster is low-information if the *intended* token (either side) is a connective particle. The matcher shouldn't require that the noisy side also happens to land on a known particle.

2. **`resources/text/stopwords.yml`** — audit particle coverage. Already present: `al`, `de`, `la`, `el`, `del`, `du`, `le`, `da`, `der`, `von`. Worth adding for Arabic/Persian/Hebrew/Romance/Germanic/Celtic naming particles commonly seen in screening data: `bin`, `ibn`, `ben`, `abu`, `abd`, `dos`, `mac`, `mc`, `van`. (Curated, not exhaustive — precision over quantity per repo conventions.)

The existing 0.7 weight is already calibrated; this just ensures it triggers in the symmetrical case.

## Out of scope

- New `is_name_stopword` / `NAME_PARTICLES` resource — the existing `is_stopword` + `resources/text/stopwords.yml` already model the right concept.
- Particle insert/delete across the full alignment (e.g. `Mohammed bin Salman` vs `Mohammed Salman`) — that's symbol pairing's job, not residue distance.
- Tokenizer changes (e.g. collapsing hyphens) — too broad, would affect unrelated cases.

## Validation

The harness in `contrib/name_comparison/` exercises this end-to-end:
- The Bashar case (`Bashar xal-Assad` vs `Bashar al-Assad`) should rise above 0.7.
- Run the full suite to confirm no regressions on cases.csv labelled outcomes (especially STRONG-tier negatives that share particles).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extend is_stopword check in weighted_edit_similarity to either side of a cluster #202

Motivating case

Diagnosis

Proposed fix

Out of scope

Validation

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Extend is_stopword check in weighted_edit_similarity to either side of a cluster #202

Description

Motivating case

Diagnosis

Proposed fix

Out of scope

Validation

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions