Summary
This started from a report by Maggie Zhou (PhD student, University College London) that, in an intercept-only PMM model (no predictors, so every missing case shares the same predicted value), mice's default PMM implementation (matchindex(), introduced in #236) repeatedly draws imputations from the same small donor pool within one completed dataset, unlike e.g. Stata's PMM. That observation is correct and reproducible (see reprex 1 below). While investigating it, I found that the situation is more nuanced than "the new fast matcher broke it": both the current default matchindex() and the legacy matcher() (use.matcher = TRUE) are miscalibrated when predicted values are tied — but in opposite directions, and for different underlying reasons.
matchindex() (current default): unbiased point estimates, but SE is roughly 2-3x too large (over-conservative, CI coverage ~100% instead of 95%).
matcher() (legacy, deprecated): severely biased point estimates and SE far too small (CI coverage ~0% instead of 95%) — this looks like an independent, more serious bug.
Reprex 1 — donor pool reuse within one dataset (matchindex())
library(mice)
packageVersion("mice")
#> [1] '3.19.8'
set.seed(1)
n <- 200
y <- rnorm(n)
y[1:100] <- NA # 50% missing
d <- data.frame(y = y, dummy = 1) # dummy col: mice needs >= 2 columns
pred <- make.predictorMatrix(d)
pred["y", "dummy"] <- 0 # no predictors used for y -> intercept-only
imp <- mice(d, method = "pmm", predictorMatrix = pred,
m = 1, maxit = 1, donors = 5, print = FALSE)
imputed_vals <- complete(imp)$y[is.na(y)]
length(unique(imputed_vals))
#> [1] 5
All 100 missing values in this single completed dataset are filled from just 5 distinct donor values, because matchindex() shuffles the donor order once per call (see src/matchindex.cpp, step 1, before the loop over target cases), so every target case with the same predicted value resolves to the same tie-broken neighbourhood.
Reprex 2 — calibration check against known truth
Data and missingness are fixed once (MCAR, random 50% of 200 draws from rnorm); only imputation randomness is repeated over 400 mice() calls, m = 20, donors = 5, intercept-only model as above.
true mu (full data): 0.0236
donor-only mean: 0.0326
matchindex() [default]:
mean estimate: 0.0305 (close to true mu — unbiased)
SD of estimate across reruns: 0.0541
mean reported SE: 0.267 (~5x larger than the SD!)
95% CI coverage of true mu: 1.000 (nominal: 0.95)
matcher() [legacy, use.matcher = TRUE]:
mean estimate: 0.246 (true mu = 0.0236 — badly biased)
SD of estimate across reruns: 0.006
mean reported SE: 0.0711
95% CI coverage of true mu: 0.000 (nominal: 0.95)
Full script:
library(mice)
set.seed(2024)
n <- 200
y_full <- rnorm(n)
mu_true <- mean(y_full)
mis_idx <- sample(n, 100) # MCAR
y <- y_full
y[mis_idx] <- NA
d <- data.frame(y = y, dummy = 1)
pred <- make.predictorMatrix(d)
pred["y", "dummy"] <- 0
run_mice <- function(use.matcher) {
imp <- mice(d, method = "pmm", predictorMatrix = pred,
m = 20, maxit = 1, donors = 5, print = FALSE,
use.matcher = use.matcher)
s <- summary(pool(with(imp, lm(y ~ 1))))
c(estimate = s$estimate, std.error = s$std.error)
}
R <- 400
res_new <- t(replicate(R, run_mice(FALSE)))
res_old <- t(replicate(R, run_mice(TRUE)))
report <- function(name, res) {
cover <- mean(res[,"estimate"] - 1.96*res[,"std.error"] < mu_true &
mu_true < res[,"estimate"] + 1.96*res[,"std.error"])
cat(name, "\n")
cat(" mean estimate:", round(mean(res[,"estimate"]), 4),
" (true mu =", round(mu_true, 4), ")\n")
cat(" SD of estimate across reruns:", round(sd(res[,"estimate"]), 4), "\n")
cat(" mean reported SE: ", round(mean(res[,"std.error"]), 4), "\n")
cat(" 95% CI coverage of true mu: ", round(cover, 3), "\n\n")
}
report("matchindex() [default]", res_new)
report("matcher() [legacy]", res_old)
Root causes
matchindex(): ties broken once per call, not once per target
In src/matchindex.cpp:
// 1. Shuffle records to remove effects of ties
IntegerVector ishuf = sample(n1);
...
// then loop over ALL n0 target cases using this single shuffled/sorted order
for (int i = 0; i < n0; i++) { ... }
The shuffle happens once, outside the loop over target cases, so every target with the same (or a nearby) predicted value ends up choosing from the same fixed window of donors within one call to matchindex() (i.e. within one imputed dataset). Because the window itself still varies across calls (different m), this inflates between-imputation variance b relative to what independent per-target draws would give, which is why the reported SE ends up too large rather than too small. Point estimates stay unbiased because the fixed window is still a (pre-shuffled) random subset of donors.
matcher(): tie-breaking noise collapses to exactly zero when donors are tied
In src/match.cpp:
NumericVector mm = range(obs);
double small = (mm[1] - mm[0]) / 65536;
...
for(int i = 0; i < n0; i++) {
d2 = runif(n1, 0, small); // noise scaled by the RANGE of obs
...
}
When all (or many) donors share the same predicted value, range(obs) is 0 (or very small), so small becomes exactly 0 and runif(n1, 0, 0) returns all zeros — the tie-breaking noise vanishes completely. Without noise, std::nth_element plus the subsequent index walk deterministically resolves to the first k donors in their original array order, for every target and every call. This can be seen directly:
set.seed(1)
idx <- mice:::matcher(rep(0, 100), rep(0, 20), k = 5)
range(idx)
#> [1] 1 5
Since the array order of obs (the donors) is arbitrary with respect to the missingness mechanism, always drawing from positions 1-5 introduces a systematic selection bias whenever there's a meaningful difference between "the first few donors in the data" and "a representative sample of donors" — exactly what we see in reprex 2.
Why this matters
matchindex() has been the default PMM engine since mice 3.11.8 (2020), so any analysis using method = "pmm" with tied or near-tied predicted values (no predictors, few categorical predictors, heavily rounded/discretized outcomes, small sample per covariate pattern, etc.) may have over-conservatively wide confidence intervals from Rubin's Rules.
matcher() is deprecated but still reachable via use.matcher = TRUE, and its bias/under-coverage in the same setting is far more severe and dangerous (silently wrong point estimates with falsely narrow CIs).
Environment
R version 4.5.2 (2025-10-31), mice 3.19.8, macOS.
Summary
This started from a report by Maggie Zhou (PhD student, University College London) that, in an intercept-only PMM model (no predictors, so every missing case shares the same predicted value),
mice's default PMM implementation (matchindex(), introduced in #236) repeatedly draws imputations from the same small donor pool within one completed dataset, unlike e.g. Stata's PMM. That observation is correct and reproducible (see reprex 1 below). While investigating it, I found that the situation is more nuanced than "the new fast matcher broke it": both the current defaultmatchindex()and the legacymatcher()(use.matcher = TRUE) are miscalibrated when predicted values are tied — but in opposite directions, and for different underlying reasons.matchindex()(current default): unbiased point estimates, but SE is roughly 2-3x too large (over-conservative, CI coverage ~100% instead of 95%).matcher()(legacy, deprecated): severely biased point estimates and SE far too small (CI coverage ~0% instead of 95%) — this looks like an independent, more serious bug.Reprex 1 — donor pool reuse within one dataset (
matchindex())All 100 missing values in this single completed dataset are filled from just 5 distinct donor values, because
matchindex()shuffles the donor order once per call (seesrc/matchindex.cpp, step 1, before the loop over target cases), so every target case with the same predicted value resolves to the same tie-broken neighbourhood.Reprex 2 — calibration check against known truth
Data and missingness are fixed once (MCAR, random 50% of 200 draws from
rnorm); only imputation randomness is repeated over 400mice()calls,m = 20,donors = 5, intercept-only model as above.Full script:
Root causes
matchindex(): ties broken once per call, not once per targetIn
src/matchindex.cpp:The shuffle happens once, outside the loop over target cases, so every target with the same (or a nearby) predicted value ends up choosing from the same fixed window of donors within one call to
matchindex()(i.e. within one imputed dataset). Because the window itself still varies across calls (differentm), this inflates between-imputation variancebrelative to what independent per-target draws would give, which is why the reported SE ends up too large rather than too small. Point estimates stay unbiased because the fixed window is still a (pre-shuffled) random subset of donors.matcher(): tie-breaking noise collapses to exactly zero when donors are tiedIn
src/match.cpp:When all (or many) donors share the same predicted value,
range(obs)is 0 (or very small), sosmallbecomes exactly 0 andrunif(n1, 0, 0)returns all zeros — the tie-breaking noise vanishes completely. Without noise,std::nth_elementplus the subsequent index walk deterministically resolves to the firstkdonors in their original array order, for every target and every call. This can be seen directly:Since the array order of
obs(the donors) is arbitrary with respect to the missingness mechanism, always drawing from positions 1-5 introduces a systematic selection bias whenever there's a meaningful difference between "the first few donors in the data" and "a representative sample of donors" — exactly what we see in reprex 2.Why this matters
matchindex()has been the default PMM engine since mice 3.11.8 (2020), so any analysis usingmethod = "pmm"with tied or near-tied predicted values (no predictors, few categorical predictors, heavily rounded/discretized outcomes, small sample per covariate pattern, etc.) may have over-conservatively wide confidence intervals from Rubin's Rules.matcher()is deprecated but still reachable viause.matcher = TRUE, and its bias/under-coverage in the same setting is far more severe and dangerous (silently wrong point estimates with falsely narrow CIs).Environment
R version 4.5.2 (2025-10-31), mice 3.19.8, macOS.