Fix preprocessing batch metadata leakage across exact-multiple epoch boundaries by ibro45 · Pull Request #10 · Zyphra/zuna

ibro45 · 2026-04-13T20:02:47Z

Problem Observed

I was running ZUNA on about 200 scans. When It was complete, 2 were missing, which led me to investigate the cause of it.

Summary

Fix a preprocessing cache bug that could write the wrong metadata["original_filename"] into output .pt files when a source recording ended on an exact multiple of epochs_per_file (64). Because pt_to_fif() groups inference outputs by metadata["original_filename"], this could cause chunks from different recordings to be reconstructed together and saved under the wrong filename, overwriting other outputs.

Root cause

The preprocessing batch cache in src/zuna/preprocessing/batch.py keeps per-file state in the global _epoch_cache, including:

data_list
positions_list
metadata
channel_names
file_counter
pt_file_counter

When a file produced an exact multiple of 64 epochs:

_save_pt_from_cache() saved the full 64-epoch PT file(s) and drained data_list.
_flush_remaining_cache() then saw len(_epoch_cache["data_list"]) == 0 and returned early.
That early return did not clear stale per-file metadata or channel_names.
The next source file entered _add_epochs_to_cache(), which only replaced metadata when _epoch_cache["metadata"] is None.
As a result, PT files for the new source file could be saved with the previous file’s metadata, especially the previous original_filename.

This corruption originated in preprocessing output itself, and later stages preserved the bad metadata.

User impact

This is a data integrity bug, not a cosmetic metadata issue.

Wrong original_filename values in PT metadata cause pt_to_fif() to:

group chunks under the wrong source recording,
reconstruct files under the wrong output name,
and overwrite other reconstructed outputs when multiple groups collapse to the same filename.

In practice, that means preprocessing can silently produce PT files whose embedded source identity does not match their actual content, and reconstruction can then emit fewer FIF files than the number of distinct inputs.

Fix

This change keeps the fix in ZUNA preprocessing and preserves existing behavior except for the stale cache lifecycle.

Changes:

Added a helper to clear all per-file cache state.
Changed _flush_remaining_cache() to fully clear cache state even when there is no remainder to save.
- This is the key fix for the exact-multiple-of-64 case.
Added defensive invariants in _add_epochs_to_cache():
- raise if a new file_counter arrives while buffered epochs from the previous file still exist,
- raise if cached metadata does not match the current source file.
Initialized file_counter to None so cache ownership is explicit.
Preserved PT chunk numbering per source file while ensuring metadata and channel state cannot leak across file boundaries.

This applies to both n_jobs=1 and n_jobs>1, since the affected batching helpers are used in either execution mode.

Tests

Added focused regression tests in tests/test_preprocessing_batch.py:

test_exact_multiple_boundary_uses_next_file_metadata
- covers file A with exactly 64 epochs followed by file B with a remainder,
- verifies file B PT metadata references file B, not file A.
test_multiple_exact_multiple_files_do_not_cascade_stale_metadata
- covers several sequential files that each end on exact 64-epoch boundaries,
- verifies stale metadata does not propagate across multiple file transitions.
test_switching_files_with_buffered_epochs_raises
- verifies the new defensive invariant that cross-file transitions with unsaved buffered epochs fail loudly instead of silently corrupting outputs.
test_pt_to_fif_groups_outputs_by_original_filename_after_exact_boundary
- generates PT files through the batching helpers and runs pt_to_fif(),
- verifies reconstruction still produces distinct FIF outputs grouped by the correct original_filename.

Risk / rollout notes

Risk is low and localized to preprocessing batch cache management.

Behavioral change:

ZUNA now raises a RuntimeError for impossible cross-file cache transitions that previously would have silently reused stale metadata and produced corrupted outputs.

The change does not alter PT naming, batch sizing, or reconstruction logic beyond ensuring that the metadata attached to each PT file belongs to the correct source recording.

Fix preprocessing batch metadata leakage

1e9bbc2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix preprocessing batch metadata leakage across exact-multiple epoch boundaries#10

Fix preprocessing batch metadata leakage across exact-multiple epoch boundaries#10
ibro45 wants to merge 1 commit intoZyphra:mainfrom
ibro45:preprocessing-cache-metadata-leak

ibro45 commented Apr 13, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ibro45 commented Apr 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem Observed

Summary

Root cause

User impact

Fix

Tests

Risk / rollout notes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ibro45 commented Apr 13, 2026 •

edited

Loading