PMM donor selection is miscalibrated when the predicted metric is (nearly) tied across cases, in both `matchindex()` and `matcher()`

## Summary

This started from a report by Maggie Zhou (PhD student, University College London) that, in an intercept-only PMM model (no predictors, so every missing case shares the same predicted value), `mice`'s default PMM implementation (`matchindex()`, introduced in #236) repeatedly draws imputations from the *same small donor pool* within one completed dataset, unlike e.g. Stata's PMM. That observation is correct and reproducible (see reprex 1 below). While investigating it, I found that the situation is more nuanced than "the new fast matcher broke it": both the current default `matchindex()` **and** the legacy `matcher()` (`use.matcher = TRUE`) are miscalibrated when predicted values are tied — but in opposite directions, and for different underlying reasons.

- **`matchindex()`** (current default): unbiased point estimates, but **SE is roughly 2-3x too large** (over-conservative, CI coverage ~100% instead of 95%).
- **`matcher()`** (legacy, deprecated): **severely biased** point estimates and **SE far too small** (CI coverage ~0% instead of 95%) — this looks like an independent, more serious bug.

## Reprex 1 — donor pool reuse within one dataset (`matchindex()`)

```r
library(mice)
packageVersion("mice")
#> [1] '3.19.8'

set.seed(1)
n <- 200
y <- rnorm(n)
y[1:100] <- NA  # 50% missing

d <- data.frame(y = y, dummy = 1)         # dummy col: mice needs >= 2 columns
pred <- make.predictorMatrix(d)
pred["y", "dummy"] <- 0                   # no predictors used for y -> intercept-only

imp <- mice(d, method = "pmm", predictorMatrix = pred,
            m = 1, maxit = 1, donors = 5, print = FALSE)
imputed_vals <- complete(imp)$y[is.na(y)]
length(unique(imputed_vals))
#> [1] 5
```

All 100 missing values in this single completed dataset are filled from just 5 distinct donor values, because `matchindex()` shuffles the donor order **once per call** (see `src/matchindex.cpp`, step 1, before the loop over target cases), so every target case with the same predicted value resolves to the same tie-broken neighbourhood.

## Reprex 2 — calibration check against known truth

Data and missingness are fixed once (MCAR, random 50% of 200 draws from `rnorm`); only imputation randomness is repeated over 400 `mice()` calls, `m = 20`, `donors = 5`, intercept-only model as above.

```r
true mu (full data):    0.0236
donor-only mean:        0.0326

matchindex() [default]:
  mean estimate:                      0.0305   (close to true mu — unbiased)
  SD of estimate across reruns:       0.0541
  mean reported SE:                   0.267    (~5x larger than the SD!)
  95% CI coverage of true mu:         1.000    (nominal: 0.95)

matcher() [legacy, use.matcher = TRUE]:
  mean estimate:                      0.246    (true mu = 0.0236 — badly biased)
  SD of estimate across reruns:       0.006
  mean reported SE:                   0.0711
  95% CI coverage of true mu:         0.000    (nominal: 0.95)
```

Full script:

```r
library(mice)

set.seed(2024)
n <- 200
y_full <- rnorm(n)
mu_true <- mean(y_full)
mis_idx <- sample(n, 100)  # MCAR
y <- y_full
y[mis_idx] <- NA

d <- data.frame(y = y, dummy = 1)
pred <- make.predictorMatrix(d)
pred["y", "dummy"] <- 0

run_mice <- function(use.matcher) {
  imp <- mice(d, method = "pmm", predictorMatrix = pred,
              m = 20, maxit = 1, donors = 5, print = FALSE,
              use.matcher = use.matcher)
  s <- summary(pool(with(imp, lm(y ~ 1))))
  c(estimate = s$estimate, std.error = s$std.error)
}

R <- 400
res_new <- t(replicate(R, run_mice(FALSE)))
res_old <- t(replicate(R, run_mice(TRUE)))

report <- function(name, res) {
  cover <- mean(res[,"estimate"] - 1.96*res[,"std.error"] < mu_true &
                mu_true < res[,"estimate"] + 1.96*res[,"std.error"])
  cat(name, "\n")
  cat("  mean estimate:", round(mean(res[,"estimate"]), 4),
      " (true mu =", round(mu_true, 4), ")\n")
  cat("  SD of estimate across reruns:", round(sd(res[,"estimate"]), 4), "\n")
  cat("  mean reported SE:             ", round(mean(res[,"std.error"]), 4), "\n")
  cat("  95% CI coverage of true mu:   ", round(cover, 3), "\n\n")
}
report("matchindex() [default]", res_new)
report("matcher() [legacy]", res_old)
```

## Root causes

### `matchindex()`: ties broken once per call, not once per target

In `src/matchindex.cpp`:

```cpp
// 1. Shuffle records to remove effects of ties
IntegerVector ishuf = sample(n1);
...
// then loop over ALL n0 target cases using this single shuffled/sorted order
for (int i = 0; i < n0; i++) { ... }
```

The shuffle happens once, outside the loop over target cases, so every target with the same (or a nearby) predicted value ends up choosing from the same fixed window of donors within one call to `matchindex()` (i.e. within one imputed dataset). Because the window itself still varies *across* calls (different `m`), this inflates *between*-imputation variance `b` relative to what independent per-target draws would give, which is why the reported SE ends up too large rather than too small. Point estimates stay unbiased because the fixed window is still a (pre-shuffled) random subset of donors.

### `matcher()`: tie-breaking noise collapses to exactly zero when donors are tied

In `src/match.cpp`:

```cpp
NumericVector mm = range(obs);
double small = (mm[1] - mm[0]) / 65536;
...
for(int i = 0; i < n0; i++) {
    d2 = runif(n1, 0, small);   // noise scaled by the RANGE of obs
    ...
}
```

When all (or many) donors share the same predicted value, `range(obs)` is 0 (or very small), so `small` becomes exactly 0 and `runif(n1, 0, 0)` returns all zeros — the tie-breaking noise vanishes completely. Without noise, `std::nth_element` plus the subsequent index walk deterministically resolves to the first `k` donors **in their original array order**, for every target and every call. This can be seen directly:

```r
set.seed(1)
idx <- mice:::matcher(rep(0, 100), rep(0, 20), k = 5)
range(idx)
#> [1] 1 5
```

Since the array order of `obs` (the donors) is arbitrary with respect to the missingness mechanism, always drawing from positions 1-5 introduces a systematic selection bias whenever there's a meaningful difference between "the first few donors in the data" and "a representative sample of donors" — exactly what we see in reprex 2.

## Why this matters

- `matchindex()` has been the default PMM engine since mice 3.11.8 (2020), so any analysis using `method = "pmm"` with tied or near-tied predicted values (no predictors, few categorical predictors, heavily rounded/discretized outcomes, small sample per covariate pattern, etc.) may have **over-conservatively wide** confidence intervals from Rubin's Rules.
- `matcher()` is deprecated but still reachable via `use.matcher = TRUE`, and its bias/under-coverage in the same setting is far more severe and dangerous (silently wrong point estimates with falsely narrow CIs).

## Environment

```r
sessionInfo()
```
R version 4.5.2 (2025-10-31), mice 3.19.8, macOS.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

PMM donor selection is miscalibrated when the predicted metric is (nearly) tied across cases, in both `matchindex()` and `matcher()` #757

Summary

Reprex 1 — donor pool reuse within one dataset (`matchindex()`)

Reprex 2 — calibration check against known truth

Root causes

`matchindex()`: ties broken once per call, not once per target

`matcher()`: tie-breaking noise collapses to exactly zero when donors are tied

Why this matters

Environment

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

PMM donor selection is miscalibrated when the predicted metric is (nearly) tied across cases, in both matchindex() and matcher() #757

Description

Summary

Reprex 1 — donor pool reuse within one dataset (matchindex())

Reprex 2 — calibration check against known truth

Root causes

matchindex(): ties broken once per call, not once per target

matcher(): tie-breaking noise collapses to exactly zero when donors are tied

Why this matters

Environment

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

PMM donor selection is miscalibrated when the predicted metric is (nearly) tied across cases, in both `matchindex()` and `matcher()` #757

Reprex 1 — donor pool reuse within one dataset (`matchindex()`)

`matchindex()`: ties broken once per call, not once per target

`matcher()`: tie-breaking noise collapses to exactly zero when donors are tied