feat: add document structure detection cascade for TOC#34
Conversation
Replace the novel-only chapter heuristics with an evidence cascade that
identifies sections, subsections, and front/back matter across varied
PDFs — research papers especially, whose layouts vary widely.
Detectors, ranked by reliability and merged with confidence + source:
1. Tagged structure tree (/StructTreeRoot) — parses H1–H6/Title via
getStructTree() + marked-content text, tying each tag to its glyphs.
2. Embedded outline / bookmarks (unchanged).
3. Typography + numbering/lexicon grammar — the universal fallback for
untagged PDFs: heading lines stand out by size/weight, by a section
number ("3.", "4.1."), or by a known section name. The numbering
signal is what survives when headings share the body size and pdf.js
has hidden their bold weight (common in journal styles).
4. Keyword chapter detection / page markers — last resort.
Font size and bold weight are now threaded through line reconstruction.
Titles resolve from the tagged tree, then valid metadata, then the
largest line on page 1, then the filename — so producer junk like a
leftover "gr1.eps" in /Title no longer wins.
A new kind-aware StructureNode model (level, kind, number, confidence,
source) is stored alongside the flat outline and surfaced in the
contents panel; the flat outline is still derived for back-compat.
https://claude.ai/code/session_01KMkammT4QJ4cKPq7phS98f
The unstructured fallback used strict priority — typography, then the keyword/chapter detector — so a novel without tags or bookmarks could have the new typography pass preempt the well-tuned chapter detector with noise. Merge the two instead: union their candidates, page-ordered and de-duplicated, so a novel contributes its "Chapter N"/large-title headings and a paper its numbered sections, and either genre keeps standing if only one detector fires. https://claude.ai/code/session_01KMkammT4QJ4cKPq7phS98f
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
|
Warning Review limit reached
More reviews will be available in 51 minutes. Learn how PR review limits work. Your organization has run out of usage credits. Purchase more in the billing tab. ⌛ How to resolve this issue?After more reviews become available, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available. Please see our Fair Usage Limits Policy for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (3)
📝 WalkthroughWalkthroughThis PR enriches PDF document structure extraction by implementing a cascading detection system that probes for tagged structure trees, embedded outlines, and typography-based signals, then unifies them into a kind-aware structured table of contents. Text line reconstruction is enhanced with font-size and bold metrics to support heading detection. ChangesPDF Document Structure Extraction and Display
Estimated code review effort🎯 4 (Complex) | ⏱️ ~50 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
src/components/Reader.tsx (1)
1219-1219: ⚡ Quick winConsider adding a defensive guard for
item.level.The padding calculation assumes
item.level >= 1. While the fallback at line 650 guarantees a minimum level of 1, and the structure detection system should also provide valid levels, adding a guard would prevent negative padding if malformed data arrives.🛡️ Defensive guard suggestion
<span className="truncate flex items-center gap-2" - style={{ paddingLeft: `${(item.level - 1) * 12}px` }} + style={{ paddingLeft: `${Math.max(0, (item.level - 1) * 12)}px` }} >🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/components/Reader.tsx` at line 1219, The paddingLeft calculation directly uses item.level and can produce negative values if malformed data arrives; update the inline style in the Reader component where style={{ paddingLeft: `${(item.level - 1) * 12}px` }} to defensively clamp or default the level (e.g., coerce to a number and use Math.max(level, 1) or fallback to 1) before computing (level - 1) * 12 so padding never becomes negative; locate the usage of item.level in the Reader.tsx render and replace it with the clamped/defaulted value.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@src/components/Reader.tsx`:
- Line 1219: The paddingLeft calculation directly uses item.level and can
produce negative values if malformed data arrives; update the inline style in
the Reader component where style={{ paddingLeft: `${(item.level - 1) * 12}px` }}
to defensively clamp or default the level (e.g., coerce to a number and use
Math.max(level, 1) or fallback to 1) before computing (level - 1) * 12 so
padding never becomes negative; locate the usage of item.level in the Reader.tsx
render and replace it with the clamped/defaulted value.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: fe542543-acc3-4619-a671-e1edb0f65c14
📒 Files selected for processing (5)
src/components/App.tsxsrc/components/Reader.tsxsrc/lib/pdf-extract.tssrc/lib/pdf-structure.tssrc/lib/reader-store.ts
Guard the contents-panel padding with Math.max so malformed level data can never yield negative indentation (CodeRabbit nitpick), and add docstrings to the remaining structure-detection helpers. https://claude.ai/code/session_01KMkammT4QJ4cKPq7phS98f
Summary
Implement a robust document structure detection system that cascades through multiple evidence sources (tagged structure tree, embedded outline, typography analysis, and keyword detection) to build a rich, kind-aware table of contents. This replaces the flat outline-only approach with a system that can identify section levels, numbering, and semantic kinds (section, subsection, frontmatter, reference, appendix).
Key Changes
New module
pdf-structure.ts: Pure logic layer (DOM-free, pdf.js-free) for structure detectionclassifyHeading(): Grammar-based heading classification (numbering, lexicon matching, caption detection)extractStructHeadings(): Extract H1–H6/Title from tagged structure trees via marked-content mappingfinalizeStructHeadings(): Normalize structure tree headings and lift document titlesdetectHeadingsByTypography(): Universal fallback using font size, weight, and numbering heuristicsbuildStructure(): Cascade merger that selects the strongest available evidence sourceresolveTitle(): Multi-source title resolution (tagged > metadata > typography > filename)Enhanced
pdf-extract.ts:getStructTree()and marked-content text mappingStructureNode[]alongside flat outline for backward compatibilityUpdated
Reader.tsx:Type updates:
StructureNodeinterface with level, kind, confidence, and source metadatastructurefield toExtractedDocandCachedDocpdf-structure.tsfor consumer useImplementation Details
The cascade ranks evidence by reliability:
Typography detection is robust across untagged PDFs by combining three independent signals: size (larger than body text), weight (bold), and grammar (numbered sections or known lexicon names like "Abstract", "References"). This survives research papers where headings share the body size but are numbered, and novels where large titles lack numbers.
The system gracefully degrades: when a strong signal exists (tagged structure), weaker ones are ignored; when it doesn't, the cascade falls through to universal heuristics that work on any PDF.
https://claude.ai/code/session_01KMkammT4QJ4cKPq7phS98f
Summary by CodeRabbit