Fix preprocessing batch metadata leakage across exact-multiple epoch boundaries#10
Open
ibro45 wants to merge 1 commit intoZyphra:mainfrom
Open
Fix preprocessing batch metadata leakage across exact-multiple epoch boundaries#10ibro45 wants to merge 1 commit intoZyphra:mainfrom
ibro45 wants to merge 1 commit intoZyphra:mainfrom
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem Observed
I was running ZUNA on about 200 scans. When It was complete, 2 were missing, which led me to investigate the cause of it.
Summary
Fix a preprocessing cache bug that could write the wrong
metadata["original_filename"]into output.ptfiles when a source recording ended on an exact multiple ofepochs_per_file(64). Becausept_to_fif()groups inference outputs bymetadata["original_filename"], this could cause chunks from different recordings to be reconstructed together and saved under the wrong filename, overwriting other outputs.Root cause
The preprocessing batch cache in
src/zuna/preprocessing/batch.pykeeps per-file state in the global_epoch_cache, including:data_listpositions_listmetadatachannel_namesfile_counterpt_file_counterWhen a file produced an exact multiple of 64 epochs:
_save_pt_from_cache()saved the full 64-epoch PT file(s) and draineddata_list._flush_remaining_cache()then sawlen(_epoch_cache["data_list"]) == 0and returned early.metadataorchannel_names._add_epochs_to_cache(), which only replaced metadata when_epoch_cache["metadata"] is None.original_filename.This corruption originated in preprocessing output itself, and later stages preserved the bad metadata.
User impact
This is a data integrity bug, not a cosmetic metadata issue.
Wrong
original_filenamevalues in PT metadata causept_to_fif()to:In practice, that means preprocessing can silently produce PT files whose embedded source identity does not match their actual content, and reconstruction can then emit fewer FIF files than the number of distinct inputs.
Fix
This change keeps the fix in ZUNA preprocessing and preserves existing behavior except for the stale cache lifecycle.
Changes:
_flush_remaining_cache()to fully clear cache state even when there is no remainder to save._add_epochs_to_cache():file_counterarrives while buffered epochs from the previous file still exist,file_countertoNoneso cache ownership is explicit.This applies to both
n_jobs=1andn_jobs>1, since the affected batching helpers are used in either execution mode.Tests
Added focused regression tests in
tests/test_preprocessing_batch.py:test_exact_multiple_boundary_uses_next_file_metadatatest_multiple_exact_multiple_files_do_not_cascade_stale_metadatatest_switching_files_with_buffered_epochs_raisestest_pt_to_fif_groups_outputs_by_original_filename_after_exact_boundarypt_to_fif(),original_filename.Risk / rollout notes
Risk is low and localized to preprocessing batch cache management.
Behavioral change:
RuntimeErrorfor impossible cross-file cache transitions that previously would have silently reused stale metadata and produced corrupted outputs.The change does not alter PT naming, batch sizing, or reconstruction logic beyond ensuring that the metadata attached to each PT file belongs to the correct source recording.