Skip to content

Improve reader metadata and ingestion reporting#1754

Merged
PrzemyslawKlys merged 8 commits intomasterfrom
codex/reader-improvements
Apr 8, 2026
Merged

Improve reader metadata and ingestion reporting#1754
PrzemyslawKlys merged 8 commits intomasterfrom
codex/reader-improvements

Conversation

@PrzemyslawKlys
Copy link
Copy Markdown
Member

Summary

  • improve the reader adapters so direct CSV, JSON, XML, HTML, ZIP, and EPUB reads emit consistent source metadata, chunk hashes, token estimates, and clearer logical paths
  • fix reader edge cases around HTML/XHTML registration, CSV header normalization, JSON/XML path fidelity, archive virtual source identity, and warning provenance
  • tighten folder ingestion and reporting so warnings are deduplicated, progress events stay consistent, and skipped-file counts align with emitted results

Why

The reader projects had several small but compounding consistency gaps. Different adapters produced different metadata, archive-backed readers could collide on source identity or under-report warning provenance, and folder ingestion summaries/progress callbacks could drift from the actual emitted chunks.

Impact

These changes make downstream indexing, deduplication, and ingestion reporting more reliable without changing the public reader API shape. They also expand XHTML handling and add broader regression coverage across modular adapters and folder/document orchestration.

Validation

  • dotnet test OfficeIMO.Tests/OfficeIMO.Tests.csproj --framework net8.0 --filter "ReaderDocumentReaderTests|ReaderZipModularTests|ReaderEpubModularTests|ReaderRegistryTests|ReaderHtmlModularTests|ReaderTextModularTests|ReaderCsvModularTests|ReaderJsonModularTests|ReaderXmlModularTests"

@claude
Copy link
Copy Markdown

claude bot commented Apr 8, 2026

Claude finished @PrzemyslawKlys's task —— View job


Code Review Complete ✅

Reviewing this PR that improves reader metadata and ingestion reporting across multiple adapters.

Todo List:

  • Read repository guidelines (CLAUDE.md)
  • Review core reader changes and metadata improvements
  • Analyze path canonicalization issues identified by automated review
  • Examine test coverage additions
  • Check for security concerns and performance implications
  • Provide comprehensive feedback

Review Summary

This is a well-structured improvement to the OfficeIMO reader system that addresses consistency gaps across different adapter types. The changes enhance metadata reliability without breaking the public API.

Key Findings

Strengths

  • Comprehensive test coverage: Excellent addition of modular tests across all adapters
  • Metadata consistency: Good work standardizing source metadata, chunk hashes, and token estimates
  • Security-conscious: Proper path traversal protection in ZIP handling
  • Performance considerations: Reasonable token estimation logic ((length + 3) / 4)
  • Error handling: Robust try-catch blocks with fallback behaviors

⚠️ Critical Issues (Automated Review Confirmed)

  1. ZIP Path Canonicalization (OfficeIMO.Reader.Zip/DocumentReaderZipExtensions.cs:529)

    • Issue: Virtual paths built from raw archivePath without normalization
    • Impact: Same content from equivalent paths generates different SourceId/ChunkHash values
    • Example: ./docs/archive.zip vs /full/path/docs/archive.zip produce different hashes
    • Solution: Apply NormalizePathForVirtualPath() to archive path before building virtual paths
  2. EPUB Path Canonicalization (OfficeIMO.Reader.Epub/DocumentReaderEpubExtensions.cs:227)

    • Issue: Chapter SourceId uses raw source.Path instead of normalized path
    • Impact: Same EPUB via different filesystem paths generates inconsistent chapter IDs
    • Solution: Canonicalize source path before building virtual paths

🔍 Code Quality Observations

Good Practices:

  • Unix realpath() integration for proper canonicalization (DocumentReaderZipExtensions.cs:667)
  • Cross-platform path handling with directory separator normalization
  • Proper resource cleanup with IntPtr and finally blocks
  • Consistent error handling patterns

Minor Concerns:

  • P/Invoke usage introduces platform dependency but is well-isolated
  • Some duplication in BuildSourceId implementations across adapters (acceptable for modularity)

📊 Test Coverage Assessment

  • Strong regression coverage for edge cases (malformed files, limits, warnings)
  • Good validation of metadata consistency
  • Archive traversal security tests are comprehensive

Recommendations

  1. High Priority: Fix the path canonicalization issues in ZIP and EPUB readers to ensure stable source identity
  2. Medium Priority: Consider extracting common path normalization logic to reduce duplication
  3. Low Priority: Document the platform-specific realpath() behavior for maintainability

Security Review

No security concerns identified - the path traversal protections and input validation are appropriate.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 23b8998f59

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

if (chunk == null) throw new ArgumentNullException(nameof(chunk));

chunk.Location.Path = virtualPath;
chunk.SourceId = BuildSourceId(virtualPath);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Canonicalize ZIP path before hashing virtual source IDs

When ZIP chunks are emitted, SourceId is derived from virtualPath, but virtualPath is built from the raw archivePath string provided by the caller. For file-based reads this means equivalent paths (for example ./docs/archive.zip vs /workspace/OfficeIMO/docs/archive.zip) produce different SourceId/ChunkHash values for the same entry content, which breaks deduplication and stable incremental ingestion semantics that other path-based readers preserve via full-path normalization.

Useful? React with 👍 / 👎.

if (string.IsNullOrWhiteSpace(virtualPath)) return;

chunk.Location.Path = virtualPath;
chunk.SourceId = BuildSourceId(virtualPath);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Canonicalize EPUB path before hashing chapter source IDs

Chapter chunks overwrite SourceId using BuildSourceId(virtualPath), but virtualPath is computed from source.Path, which for file reads is the raw input path rather than a normalized full path. As a result, reading the same EPUB through different equivalent filesystem paths generates different chapter SourceId/ChunkHash values, undermining stable source identity and cache/dedup behavior.

Useful? React with 👍 / 👎.

@PrzemyslawKlys PrzemyslawKlys merged commit 137a78b into master Apr 8, 2026
10 checks passed
@PrzemyslawKlys PrzemyslawKlys deleted the codex/reader-improvements branch April 8, 2026 20:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant