Skip to content

PMM donor selection is miscalibrated when the predicted metric is (nearly) tied across cases, in both matchindex() and matcher() #757

Description

@stefvanbuuren

Summary

This started from a report by Maggie Zhou (PhD student, University College London) that, in an intercept-only PMM model (no predictors, so every missing case shares the same predicted value), mice's default PMM implementation (matchindex(), introduced in #236) repeatedly draws imputations from the same small donor pool within one completed dataset, unlike e.g. Stata's PMM. That observation is correct and reproducible (see reprex 1 below). While investigating it, I found that the situation is more nuanced than "the new fast matcher broke it": both the current default matchindex() and the legacy matcher() (use.matcher = TRUE) are miscalibrated when predicted values are tied — but in opposite directions, and for different underlying reasons.

  • matchindex() (current default): unbiased point estimates, but SE is roughly 2-3x too large (over-conservative, CI coverage ~100% instead of 95%).
  • matcher() (legacy, deprecated): severely biased point estimates and SE far too small (CI coverage ~0% instead of 95%) — this looks like an independent, more serious bug.

Reprex 1 — donor pool reuse within one dataset (matchindex())

library(mice)
packageVersion("mice")
#> [1] '3.19.8'

set.seed(1)
n <- 200
y <- rnorm(n)
y[1:100] <- NA  # 50% missing

d <- data.frame(y = y, dummy = 1)         # dummy col: mice needs >= 2 columns
pred <- make.predictorMatrix(d)
pred["y", "dummy"] <- 0                   # no predictors used for y -> intercept-only

imp <- mice(d, method = "pmm", predictorMatrix = pred,
            m = 1, maxit = 1, donors = 5, print = FALSE)
imputed_vals <- complete(imp)$y[is.na(y)]
length(unique(imputed_vals))
#> [1] 5

All 100 missing values in this single completed dataset are filled from just 5 distinct donor values, because matchindex() shuffles the donor order once per call (see src/matchindex.cpp, step 1, before the loop over target cases), so every target case with the same predicted value resolves to the same tie-broken neighbourhood.

Reprex 2 — calibration check against known truth

Data and missingness are fixed once (MCAR, random 50% of 200 draws from rnorm); only imputation randomness is repeated over 400 mice() calls, m = 20, donors = 5, intercept-only model as above.

true mu (full data):    0.0236
donor-only mean:        0.0326

matchindex() [default]:
  mean estimate:                      0.0305   (close to true muunbiased)
  SD of estimate across reruns:       0.0541
  mean reported SE:                   0.267    (~5x larger than the SD!)
  95% CI coverage of true mu:         1.000    (nominal: 0.95)

matcher() [legacy, use.matcher = TRUE]:
  mean estimate:                      0.246    (true mu = 0.0236badly biased)
  SD of estimate across reruns:       0.006
  mean reported SE:                   0.0711
  95% CI coverage of true mu:         0.000    (nominal: 0.95)

Full script:

library(mice)

set.seed(2024)
n <- 200
y_full <- rnorm(n)
mu_true <- mean(y_full)
mis_idx <- sample(n, 100)  # MCAR
y <- y_full
y[mis_idx] <- NA

d <- data.frame(y = y, dummy = 1)
pred <- make.predictorMatrix(d)
pred["y", "dummy"] <- 0

run_mice <- function(use.matcher) {
  imp <- mice(d, method = "pmm", predictorMatrix = pred,
              m = 20, maxit = 1, donors = 5, print = FALSE,
              use.matcher = use.matcher)
  s <- summary(pool(with(imp, lm(y ~ 1))))
  c(estimate = s$estimate, std.error = s$std.error)
}

R <- 400
res_new <- t(replicate(R, run_mice(FALSE)))
res_old <- t(replicate(R, run_mice(TRUE)))

report <- function(name, res) {
  cover <- mean(res[,"estimate"] - 1.96*res[,"std.error"] < mu_true &
                mu_true < res[,"estimate"] + 1.96*res[,"std.error"])
  cat(name, "\n")
  cat("  mean estimate:", round(mean(res[,"estimate"]), 4),
      " (true mu =", round(mu_true, 4), ")\n")
  cat("  SD of estimate across reruns:", round(sd(res[,"estimate"]), 4), "\n")
  cat("  mean reported SE:             ", round(mean(res[,"std.error"]), 4), "\n")
  cat("  95% CI coverage of true mu:   ", round(cover, 3), "\n\n")
}
report("matchindex() [default]", res_new)
report("matcher() [legacy]", res_old)

Root causes

matchindex(): ties broken once per call, not once per target

In src/matchindex.cpp:

// 1. Shuffle records to remove effects of ties
IntegerVector ishuf = sample(n1);
...
// then loop over ALL n0 target cases using this single shuffled/sorted order
for (int i = 0; i < n0; i++) { ... }

The shuffle happens once, outside the loop over target cases, so every target with the same (or a nearby) predicted value ends up choosing from the same fixed window of donors within one call to matchindex() (i.e. within one imputed dataset). Because the window itself still varies across calls (different m), this inflates between-imputation variance b relative to what independent per-target draws would give, which is why the reported SE ends up too large rather than too small. Point estimates stay unbiased because the fixed window is still a (pre-shuffled) random subset of donors.

matcher(): tie-breaking noise collapses to exactly zero when donors are tied

In src/match.cpp:

NumericVector mm = range(obs);
double small = (mm[1] - mm[0]) / 65536;
...
for(int i = 0; i < n0; i++) {
    d2 = runif(n1, 0, small);   // noise scaled by the RANGE of obs
    ...
}

When all (or many) donors share the same predicted value, range(obs) is 0 (or very small), so small becomes exactly 0 and runif(n1, 0, 0) returns all zeros — the tie-breaking noise vanishes completely. Without noise, std::nth_element plus the subsequent index walk deterministically resolves to the first k donors in their original array order, for every target and every call. This can be seen directly:

set.seed(1)
idx <- mice:::matcher(rep(0, 100), rep(0, 20), k = 5)
range(idx)
#> [1] 1 5

Since the array order of obs (the donors) is arbitrary with respect to the missingness mechanism, always drawing from positions 1-5 introduces a systematic selection bias whenever there's a meaningful difference between "the first few donors in the data" and "a representative sample of donors" — exactly what we see in reprex 2.

Why this matters

  • matchindex() has been the default PMM engine since mice 3.11.8 (2020), so any analysis using method = "pmm" with tied or near-tied predicted values (no predictors, few categorical predictors, heavily rounded/discretized outcomes, small sample per covariate pattern, etc.) may have over-conservatively wide confidence intervals from Rubin's Rules.
  • matcher() is deprecated but still reachable via use.matcher = TRUE, and its bias/under-coverage in the same setting is far more severe and dangerous (silently wrong point estimates with falsely narrow CIs).

Environment

sessionInfo()

R version 4.5.2 (2025-10-31), mice 3.19.8, macOS.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions