Improve reader metadata and ingestion reporting by PrzemyslawKlys · Pull Request #1754 · EvotecIT/OfficeIMO

PrzemyslawKlys · 2026-04-08T10:33:33Z

Summary

improve the reader adapters so direct CSV, JSON, XML, HTML, ZIP, and EPUB reads emit consistent source metadata, chunk hashes, token estimates, and clearer logical paths
fix reader edge cases around HTML/XHTML registration, CSV header normalization, JSON/XML path fidelity, archive virtual source identity, and warning provenance
tighten folder ingestion and reporting so warnings are deduplicated, progress events stay consistent, and skipped-file counts align with emitted results

Why

The reader projects had several small but compounding consistency gaps. Different adapters produced different metadata, archive-backed readers could collide on source identity or under-report warning provenance, and folder ingestion summaries/progress callbacks could drift from the actual emitted chunks.

Impact

These changes make downstream indexing, deduplication, and ingestion reporting more reliable without changing the public reader API shape. They also expand XHTML handling and add broader regression coverage across modular adapters and folder/document orchestration.

Validation

dotnet test OfficeIMO.Tests/OfficeIMO.Tests.csproj --framework net8.0 --filter "ReaderDocumentReaderTests|ReaderZipModularTests|ReaderEpubModularTests|ReaderRegistryTests|ReaderHtmlModularTests|ReaderTextModularTests|ReaderCsvModularTests|ReaderJsonModularTests|ReaderXmlModularTests"

claude · 2026-04-08T10:33:49Z

Claude finished @PrzemyslawKlys's task —— View job

Code Review Complete ✅

Reviewing this PR that improves reader metadata and ingestion reporting across multiple adapters.

Todo List:

Read repository guidelines (CLAUDE.md)
Review core reader changes and metadata improvements
Analyze path canonicalization issues identified by automated review
Examine test coverage additions
Check for security concerns and performance implications
Provide comprehensive feedback

Review Summary

This is a well-structured improvement to the OfficeIMO reader system that addresses consistency gaps across different adapter types. The changes enhance metadata reliability without breaking the public API.

Key Findings

✅ Strengths

Comprehensive test coverage: Excellent addition of modular tests across all adapters
Metadata consistency: Good work standardizing source metadata, chunk hashes, and token estimates
Security-conscious: Proper path traversal protection in ZIP handling
Performance considerations: Reasonable token estimation logic ((length + 3) / 4)
Error handling: Robust try-catch blocks with fallback behaviors

⚠️ Critical Issues (Automated Review Confirmed)

ZIP Path Canonicalization (OfficeIMO.Reader.Zip/DocumentReaderZipExtensions.cs:529)
- Issue: Virtual paths built from raw archivePath without normalization
- Impact: Same content from equivalent paths generates different SourceId/ChunkHash values
- Example: ./docs/archive.zip vs /full/path/docs/archive.zip produce different hashes
- Solution: Apply NormalizePathForVirtualPath() to archive path before building virtual paths
EPUB Path Canonicalization (OfficeIMO.Reader.Epub/DocumentReaderEpubExtensions.cs:227)
- Issue: Chapter SourceId uses raw source.Path instead of normalized path
- Impact: Same EPUB via different filesystem paths generates inconsistent chapter IDs
- Solution: Canonicalize source path before building virtual paths

🔍 Code Quality Observations

Good Practices:

Unix realpath() integration for proper canonicalization (DocumentReaderZipExtensions.cs:667)
Cross-platform path handling with directory separator normalization
Proper resource cleanup with IntPtr and finally blocks
Consistent error handling patterns

Minor Concerns:

P/Invoke usage introduces platform dependency but is well-isolated
Some duplication in BuildSourceId implementations across adapters (acceptable for modularity)

📊 Test Coverage Assessment

Strong regression coverage for edge cases (malformed files, limits, warnings)
Good validation of metadata consistency
Archive traversal security tests are comprehensive

Recommendations

High Priority: Fix the path canonicalization issues in ZIP and EPUB readers to ensure stable source identity
Medium Priority: Consider extracting common path normalization logic to reduce duplication
Low Priority: Document the platform-specific realpath() behavior for maintainability

Security Review

✅ No security concerns identified - the path traversal protections and input validation are appropriate.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 23b8998f59

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-04-08T12:15:41Z

OfficeIMO.Reader.Zip/DocumentReaderZipExtensions.cs

+        if (chunk == null) throw new ArgumentNullException(nameof(chunk));
+
+        chunk.Location.Path = virtualPath;
+        chunk.SourceId = BuildSourceId(virtualPath);


Canonicalize ZIP path before hashing virtual source IDs

When ZIP chunks are emitted, SourceId is derived from virtualPath, but virtualPath is built from the raw archivePath string provided by the caller. For file-based reads this means equivalent paths (for example ./docs/archive.zip vs /workspace/OfficeIMO/docs/archive.zip) produce different SourceId/ChunkHash values for the same entry content, which breaks deduplication and stable incremental ingestion semantics that other path-based readers preserve via full-path normalization.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-04-08T12:15:41Z

OfficeIMO.Reader.Epub/DocumentReaderEpubExtensions.cs

+        if (string.IsNullOrWhiteSpace(virtualPath)) return;
+
+        chunk.Location.Path = virtualPath;
+        chunk.SourceId = BuildSourceId(virtualPath);


Canonicalize EPUB path before hashing chapter source IDs

Chapter chunks overwrite SourceId using BuildSourceId(virtualPath), but virtualPath is computed from source.Path, which for file reads is the raw input path rather than a normalized full path. As a result, reading the same EPUB through different equivalent filesystem paths generates different chapter SourceId/ChunkHash values, undermining stable source identity and cache/dedup behavior.

Useful? React with 👍 / 👎.

Improve reader metadata and ingestion reporting

a02f654

PrzemyslawKlys added 4 commits April 8, 2026 13:57

Avoid hashing size-skipped folder inputs

2a75b65

Warn when document chunks are truncated

d20c4ea

Preserve schema for header-only CSV files

a273f26

Delay single-file hashes until parse succeeds

23b8998

chatgpt-codex-connector bot reviewed Apr 8, 2026

View reviewed changes

PrzemyslawKlys added 3 commits April 8, 2026 16:45

Fix reader archive source canonicalization

67fd5a1

Fix legacy target compatibility in reader adapters

64980a4

Fix reader archive canonicalization on Unix

21a0c03

PrzemyslawKlys merged commit 137a78b into master Apr 8, 2026
10 checks passed

PrzemyslawKlys deleted the codex/reader-improvements branch April 8, 2026 20:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Improve reader metadata and ingestion reporting#1754

Improve reader metadata and ingestion reporting#1754
PrzemyslawKlys merged 8 commits intomasterfrom
codex/reader-improvements

PrzemyslawKlys commented Apr 8, 2026

Uh oh!

claude bot commented Apr 8, 2026 •

edited

Loading

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

chatgpt-codex-connector bot Apr 8, 2026

Uh oh!

chatgpt-codex-connector bot Apr 8, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

PrzemyslawKlys commented Apr 8, 2026

Summary

Why

Impact

Validation

Uh oh!

claude bot commented Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review Complete ✅

Todo List:

Review Summary

Key Findings

✅ Strengths

⚠️ Critical Issues (Automated Review Confirmed)

🔍 Code Quality Observations

📊 Test Coverage Assessment

Recommendations

Security Review

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector bot Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

claude bot commented Apr 8, 2026 •

edited

Loading