Skip to content

Fix preprocessing batch metadata leakage across exact-multiple epoch boundaries#10

Open
ibro45 wants to merge 1 commit intoZyphra:mainfrom
ibro45:preprocessing-cache-metadata-leak
Open

Fix preprocessing batch metadata leakage across exact-multiple epoch boundaries#10
ibro45 wants to merge 1 commit intoZyphra:mainfrom
ibro45:preprocessing-cache-metadata-leak

Conversation

@ibro45
Copy link
Copy Markdown

@ibro45 ibro45 commented Apr 13, 2026

Problem Observed

I was running ZUNA on about 200 scans. When It was complete, 2 were missing, which led me to investigate the cause of it.

Summary

Fix a preprocessing cache bug that could write the wrong metadata["original_filename"] into output .pt files when a source recording ended on an exact multiple of epochs_per_file (64). Because pt_to_fif() groups inference outputs by metadata["original_filename"], this could cause chunks from different recordings to be reconstructed together and saved under the wrong filename, overwriting other outputs.

Root cause

The preprocessing batch cache in src/zuna/preprocessing/batch.py keeps per-file state in the global _epoch_cache, including:

  • data_list
  • positions_list
  • metadata
  • channel_names
  • file_counter
  • pt_file_counter

When a file produced an exact multiple of 64 epochs:

  1. _save_pt_from_cache() saved the full 64-epoch PT file(s) and drained data_list.
  2. _flush_remaining_cache() then saw len(_epoch_cache["data_list"]) == 0 and returned early.
  3. That early return did not clear stale per-file metadata or channel_names.
  4. The next source file entered _add_epochs_to_cache(), which only replaced metadata when _epoch_cache["metadata"] is None.
  5. As a result, PT files for the new source file could be saved with the previous file’s metadata, especially the previous original_filename.

This corruption originated in preprocessing output itself, and later stages preserved the bad metadata.

User impact

This is a data integrity bug, not a cosmetic metadata issue.

Wrong original_filename values in PT metadata cause pt_to_fif() to:

  • group chunks under the wrong source recording,
  • reconstruct files under the wrong output name,
  • and overwrite other reconstructed outputs when multiple groups collapse to the same filename.

In practice, that means preprocessing can silently produce PT files whose embedded source identity does not match their actual content, and reconstruction can then emit fewer FIF files than the number of distinct inputs.

Fix

This change keeps the fix in ZUNA preprocessing and preserves existing behavior except for the stale cache lifecycle.

Changes:

  • Added a helper to clear all per-file cache state.
  • Changed _flush_remaining_cache() to fully clear cache state even when there is no remainder to save.
    • This is the key fix for the exact-multiple-of-64 case.
  • Added defensive invariants in _add_epochs_to_cache():
    • raise if a new file_counter arrives while buffered epochs from the previous file still exist,
    • raise if cached metadata does not match the current source file.
  • Initialized file_counter to None so cache ownership is explicit.
  • Preserved PT chunk numbering per source file while ensuring metadata and channel state cannot leak across file boundaries.

This applies to both n_jobs=1 and n_jobs>1, since the affected batching helpers are used in either execution mode.

Tests

Added focused regression tests in tests/test_preprocessing_batch.py:

  • test_exact_multiple_boundary_uses_next_file_metadata

    • covers file A with exactly 64 epochs followed by file B with a remainder,
    • verifies file B PT metadata references file B, not file A.
  • test_multiple_exact_multiple_files_do_not_cascade_stale_metadata

    • covers several sequential files that each end on exact 64-epoch boundaries,
    • verifies stale metadata does not propagate across multiple file transitions.
  • test_switching_files_with_buffered_epochs_raises

    • verifies the new defensive invariant that cross-file transitions with unsaved buffered epochs fail loudly instead of silently corrupting outputs.
  • test_pt_to_fif_groups_outputs_by_original_filename_after_exact_boundary

    • generates PT files through the batching helpers and runs pt_to_fif(),
    • verifies reconstruction still produces distinct FIF outputs grouped by the correct original_filename.

Risk / rollout notes

Risk is low and localized to preprocessing batch cache management.

Behavioral change:

  • ZUNA now raises a RuntimeError for impossible cross-file cache transitions that previously would have silently reused stale metadata and produced corrupted outputs.

The change does not alter PT naming, batch sizing, or reconstruction logic beyond ensuring that the metadata attached to each PT file belongs to the correct source recording.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant