Improve reader metadata and ingestion reporting#1754
Conversation
|
Claude finished @PrzemyslawKlys's task —— View job Code Review Complete ✅Reviewing this PR that improves reader metadata and ingestion reporting across multiple adapters. Todo List:
Review SummaryThis is a well-structured improvement to the OfficeIMO reader system that addresses consistency gaps across different adapter types. The changes enhance metadata reliability without breaking the public API. Key Findings✅ Strengths
|
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 23b8998f59
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if (chunk == null) throw new ArgumentNullException(nameof(chunk)); | ||
|
|
||
| chunk.Location.Path = virtualPath; | ||
| chunk.SourceId = BuildSourceId(virtualPath); |
There was a problem hiding this comment.
Canonicalize ZIP path before hashing virtual source IDs
When ZIP chunks are emitted, SourceId is derived from virtualPath, but virtualPath is built from the raw archivePath string provided by the caller. For file-based reads this means equivalent paths (for example ./docs/archive.zip vs /workspace/OfficeIMO/docs/archive.zip) produce different SourceId/ChunkHash values for the same entry content, which breaks deduplication and stable incremental ingestion semantics that other path-based readers preserve via full-path normalization.
Useful? React with 👍 / 👎.
| if (string.IsNullOrWhiteSpace(virtualPath)) return; | ||
|
|
||
| chunk.Location.Path = virtualPath; | ||
| chunk.SourceId = BuildSourceId(virtualPath); |
There was a problem hiding this comment.
Canonicalize EPUB path before hashing chapter source IDs
Chapter chunks overwrite SourceId using BuildSourceId(virtualPath), but virtualPath is computed from source.Path, which for file reads is the raw input path rather than a normalized full path. As a result, reading the same EPUB through different equivalent filesystem paths generates different chapter SourceId/ChunkHash values, undermining stable source identity and cache/dedup behavior.
Useful? React with 👍 / 👎.
Summary
Why
The reader projects had several small but compounding consistency gaps. Different adapters produced different metadata, archive-backed readers could collide on source identity or under-report warning provenance, and folder ingestion summaries/progress callbacks could drift from the actual emitted chunks.
Impact
These changes make downstream indexing, deduplication, and ingestion reporting more reliable without changing the public reader API shape. They also expand XHTML handling and add broader regression coverage across modular adapters and folder/document orchestration.
Validation
dotnet test OfficeIMO.Tests/OfficeIMO.Tests.csproj --framework net8.0 --filter "ReaderDocumentReaderTests|ReaderZipModularTests|ReaderEpubModularTests|ReaderRegistryTests|ReaderHtmlModularTests|ReaderTextModularTests|ReaderCsvModularTests|ReaderJsonModularTests|ReaderXmlModularTests"