Skip to content

Eliminate double-reads of parquet files in load_samples and remove_empty_parquets#3

Merged
dprim7 merged 2 commits intohh4b-refactorfrom
copilot/improve-file-loading-efficiency
Feb 20, 2026
Merged

Eliminate double-reads of parquet files in load_samples and remove_empty_parquets#3
dprim7 merged 2 commits intohh4b-refactorfrom
copilot/improve-file-loading-efficiency

Conversation

Copy link
Copy Markdown

Copilot AI commented Feb 20, 2026

Parquet files were being read twice per file in load_samples — once to check emptiness (loading all columns/rows), then again with the actual filters/columns. Similarly, remove_empty_parquets loaded the entire file just to check row count.

Changes

  • load_samples: Read each parquet file once with the target filters and columns, then check if the result is empty:

    # Before: two full reads per file
    if not pd.read_parquet(parquet_file).empty:
        df_sample = pd.read_parquet(parquet_file, filters=filters, columns=load_columns)
    
    # After: one read
    df_sample = pd.read_parquet(parquet_file, filters=filters, columns=load_columns)
    if not df_sample.empty:
        ...
  • Broken except block in load_samples: The except block was an exact copy of the try block (with a trailing continue), so on failure it would retry the identical operation before skipping anyway. Simplified to warn + continue.

  • remove_empty_parquets: Use columns=[] to read only row-count metadata instead of loading the full file:

    # Before
    if not len(pd.read_parquet(file_path)):
    # After
    if not len(pd.read_parquet(file_path, columns=[])):

🔒 GitHub Advanced Security automatically protects Copilot coding agent pull requests. You can protect all pull requests by enabling Advanced Security for your repositories. Learn more about Advanced Security.

…oad_samples and remove_empty_parquets

Co-authored-by: dprim7 <38478631+dprim7@users.noreply.github.com>
Copilot AI changed the title [WIP] Refactor file loading for improved efficiency Eliminate double-reads of parquet files in load_samples and remove_empty_parquets Feb 20, 2026
Copilot AI requested a review from dprim7 February 20, 2026 19:59
Copy link
Copy Markdown
Owner

@dprim7 dprim7 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot change target branch to hh4b-refactor

Copy link
Copy Markdown
Author

Copilot AI commented Feb 20, 2026

@copilot change target branch to hh4b-refactor

I'm unable to change a PR's base branch programmatically with my available tools. You can update the base branch to hh4b-refactor directly from the GitHub PR page — click "Edit" next to the PR title, or use the base branch dropdown near the merge button.

Copilot AI requested a review from dprim7 February 20, 2026 21:58
@dprim7 dprim7 changed the base branch from main to hh4b-refactor February 20, 2026 21:59
@dprim7 dprim7 marked this pull request as ready for review February 20, 2026 22:00
@dprim7 dprim7 merged commit f115b91 into hh4b-refactor Feb 20, 2026
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants