Skip to content

feat: add document structure detection cascade for TOC#34

Merged
punyamsingh merged 3 commits into
mainfrom
claude/pdf-structure-parsing-3wr10t
Jun 10, 2026
Merged

feat: add document structure detection cascade for TOC#34
punyamsingh merged 3 commits into
mainfrom
claude/pdf-structure-parsing-3wr10t

Conversation

@punyamsingh

@punyamsingh punyamsingh commented Jun 10, 2026

Copy link
Copy Markdown
Owner

Summary

Implement a robust document structure detection system that cascades through multiple evidence sources (tagged structure tree, embedded outline, typography analysis, and keyword detection) to build a rich, kind-aware table of contents. This replaces the flat outline-only approach with a system that can identify section levels, numbering, and semantic kinds (section, subsection, frontmatter, reference, appendix).

Key Changes

  • New module pdf-structure.ts: Pure logic layer (DOM-free, pdf.js-free) for structure detection

    • classifyHeading(): Grammar-based heading classification (numbering, lexicon matching, caption detection)
    • extractStructHeadings(): Extract H1–H6/Title from tagged structure trees via marked-content mapping
    • finalizeStructHeadings(): Normalize structure tree headings and lift document titles
    • detectHeadingsByTypography(): Universal fallback using font size, weight, and numbering heuristics
    • buildStructure(): Cascade merger that selects the strongest available evidence source
    • resolveTitle(): Multi-source title resolution (tagged > metadata > typography > filename)
    • Supporting utilities for outline conversion, candidate filtering, and deduplication
  • Enhanced pdf-extract.ts:

    • Extract tagged structure tree from PDF via getStructTree() and marked-content text mapping
    • Compute font size and bold detection from text transforms and font styles
    • Pass typography candidates through the structure cascade
    • Return rich StructureNode[] alongside flat outline for backward compatibility
    • Improved title resolution using the new cascade system
  • Updated Reader.tsx:

    • Prefer kind-aware structure for TOC display when available
    • Fall back to flat outline for cached docs (pre-structure)
    • Display semantic kind badges (e.g., "References", "Appendix") in the contents panel
    • Adjust styling for nested levels and kind indicators
  • Type updates:

    • Add StructureNode interface with level, kind, confidence, and source metadata
    • Add structure field to ExtractedDoc and CachedDoc
    • Export structure types from pdf-structure.ts for consumer use

Implementation Details

The cascade ranks evidence by reliability:

  1. Tagged structure tree (0.97 confidence) — PDF explicitly labels H1/H2/H3
  2. Embedded outline (0.9 confidence) — Bookmarks with hierarchy
  3. Typography + keyword detection (0.55–0.7 confidence) — Font size/weight and numbering
  4. Fallback (0.5 confidence) — Legacy chapter detection for novels

Typography detection is robust across untagged PDFs by combining three independent signals: size (larger than body text), weight (bold), and grammar (numbered sections or known lexicon names like "Abstract", "References"). This survives research papers where headings share the body size but are numbered, and novels where large titles lack numbers.

The system gracefully degrades: when a strong signal exists (tagged structure), weaker ones are ignored; when it doesn't, the cascade falls through to universal heuristics that work on any PDF.

https://claude.ai/code/session_01KMkammT4QJ4cKPq7phS98f

Summary by CodeRabbit

  • New Features
    • Enhanced PDF parsing now extracts and preserves document structure metadata for improved table of contents generation.
    • Table of contents now intelligently detects headings through multiple methods, including embedded structure data and typography analysis.
    • Document metadata resolution improved for more accurate title and author attribution.

claude added 2 commits June 10, 2026 18:59
Replace the novel-only chapter heuristics with an evidence cascade that
identifies sections, subsections, and front/back matter across varied
PDFs — research papers especially, whose layouts vary widely.

Detectors, ranked by reliability and merged with confidence + source:

1. Tagged structure tree (/StructTreeRoot) — parses H1–H6/Title via
   getStructTree() + marked-content text, tying each tag to its glyphs.
2. Embedded outline / bookmarks (unchanged).
3. Typography + numbering/lexicon grammar — the universal fallback for
   untagged PDFs: heading lines stand out by size/weight, by a section
   number ("3.", "4.1."), or by a known section name. The numbering
   signal is what survives when headings share the body size and pdf.js
   has hidden their bold weight (common in journal styles).
4. Keyword chapter detection / page markers — last resort.

Font size and bold weight are now threaded through line reconstruction.
Titles resolve from the tagged tree, then valid metadata, then the
largest line on page 1, then the filename — so producer junk like a
leftover "gr1.eps" in /Title no longer wins.

A new kind-aware StructureNode model (level, kind, number, confidence,
source) is stored alongside the flat outline and surfaced in the
contents panel; the flat outline is still derived for back-compat.

https://claude.ai/code/session_01KMkammT4QJ4cKPq7phS98f
The unstructured fallback used strict priority — typography, then the
keyword/chapter detector — so a novel without tags or bookmarks could
have the new typography pass preempt the well-tuned chapter detector
with noise. Merge the two instead: union their candidates, page-ordered
and de-duplicated, so a novel contributes its "Chapter N"/large-title
headings and a paper its numbered sections, and either genre keeps
standing if only one detector fires.

https://claude.ai/code/session_01KMkammT4QJ4cKPq7phS98f
@vercel

vercel Bot commented Jun 10, 2026

Copy link
Copy Markdown

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
wereadpdf Ready Ready Preview, Comment Jun 10, 2026 7:45pm

@coderabbitai

coderabbitai Bot commented Jun 10, 2026

Copy link
Copy Markdown

Review Change Stack

Warning

Review limit reached

@punyamsingh, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 51 minutes. Learn how PR review limits work.

Your organization has run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 4e014f86-18e8-4b7e-b7ff-5b010ee3e7e4

📥 Commits

Reviewing files that changed from the base of the PR and between 02d9faf and a404e7e.

📒 Files selected for processing (3)
  • src/components/Reader.tsx
  • src/lib/pdf-extract.ts
  • src/lib/pdf-structure.ts
📝 Walkthrough

Walkthrough

This PR enriches PDF document structure extraction by implementing a cascading detection system that probes for tagged structure trees, embedded outlines, and typography-based signals, then unifies them into a kind-aware structured table of contents. Text line reconstruction is enhanced with font-size and bold metrics to support heading detection.

Changes

PDF Document Structure Extraction and Display

Layer / File(s) Summary
Structure core model and heading classification
src/lib/pdf-structure.ts
Introduces StructureSource, StructureKind, and StructureNode types; adds classifyHeading to detect numbered headings, lexicon-based section names (Abstract/References/Appendix), and captions with metadata like kind, number, and inLexicon.
Tagged structure tree extraction
src/lib/pdf-structure.ts
Defines pdf.js structure-tree types (StructTreeNode, MarkedItem); implements extractStructHeadings to traverse structure tree and extract H1–H6/Title headings; finalizeStructHeadings normalizes levels and optionally promotes an early H1 to document title.
Typography-based heading detection
src/lib/pdf-structure.ts
Implements detectHeadingsByTypography to infer body text size and select heading candidates using size, bold, numbering, and lexicon signals; assigns hierarchy levels via numbering depth or size ranking.
Structure cascade and merging
src/lib/pdf-structure.ts
Implements cascade normalization, deduplication, and merging logic: converts candidates to StructureNodes, removes running-head/duplicates, refines typography candidates, converts outline indentation to candidates, and merges typography with legacy chapters in page order.
Structure output utilities and title resolution
src/lib/pdf-structure.ts
Adds structureToOutline to flatten structure back to legacy format; resolveTitle selects most trustworthy title from struct-tree, metadata, typography, or filename; largestLineTitle extracts typography-based title candidate from page-1 lines.
Font and size/bold enrichment in text extraction
src/lib/pdf-extract.ts
Introduces FontStyles and isBoldFont for bold detection; extends Line and PlacedItem interfaces with size and bold fields; refactors buildPageLines, collectItems, line ordering, and groupIntoLines to compute and propagate typography metrics through the extraction pipeline.
PDF extraction integration and structure wiring
src/lib/pdf-extract.ts
Expands imports for structure utilities and types; updates ExtractedDoc to include structure field; implements extractTaggedStructure to probe page-1 for struct-tree availability; builds structure from tagged, outline, typography, and chapter sources; derives legacy outline format; returns structure with title resolved via cascade.
Storage and persistence of structure
src/lib/reader-store.ts, src/components/App.tsx
Extends CachedDoc with optional structure: StructureNode[] field and imports StructureNode type; persists extracted structure during PDF import, with fallback to legacy outline for older cached documents.
Reader UI display of structured table of contents
src/components/Reader.tsx
Builds tocItems memo from cached document's structure (using explicit levels and section kinds) or fallback to outline with computed levels; renders TOC panel from filteredToc with level-based indentation and conditional uppercase "kind" badges for non-section-level items.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Poem

📚 Structure blooms in the PDF's nested tree,
Bold lines and titles sing their hierarchy,
From tags and fonts, a cascade takes form—
Chapters align, each kind becomes norm,
Readers now wander the structured warm.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 71.43% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title accurately summarizes the main objective: implementing a document structure detection cascade system for table-of-contents generation.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch claude/pdf-structure-parsing-3wr10t

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
src/components/Reader.tsx (1)

1219-1219: ⚡ Quick win

Consider adding a defensive guard for item.level.

The padding calculation assumes item.level >= 1. While the fallback at line 650 guarantees a minimum level of 1, and the structure detection system should also provide valid levels, adding a guard would prevent negative padding if malformed data arrives.

🛡️ Defensive guard suggestion
 <span
   className="truncate flex items-center gap-2"
-  style={{ paddingLeft: `${(item.level - 1) * 12}px` }}
+  style={{ paddingLeft: `${Math.max(0, (item.level - 1) * 12)}px` }}
 >
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/components/Reader.tsx` at line 1219, The paddingLeft calculation directly
uses item.level and can produce negative values if malformed data arrives;
update the inline style in the Reader component where style={{ paddingLeft:
`${(item.level - 1) * 12}px` }} to defensively clamp or default the level (e.g.,
coerce to a number and use Math.max(level, 1) or fallback to 1) before computing
(level - 1) * 12 so padding never becomes negative; locate the usage of
item.level in the Reader.tsx render and replace it with the clamped/defaulted
value.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@src/components/Reader.tsx`:
- Line 1219: The paddingLeft calculation directly uses item.level and can
produce negative values if malformed data arrives; update the inline style in
the Reader component where style={{ paddingLeft: `${(item.level - 1) * 12}px` }}
to defensively clamp or default the level (e.g., coerce to a number and use
Math.max(level, 1) or fallback to 1) before computing (level - 1) * 12 so
padding never becomes negative; locate the usage of item.level in the Reader.tsx
render and replace it with the clamped/defaulted value.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: fe542543-acc3-4619-a671-e1edb0f65c14

📥 Commits

Reviewing files that changed from the base of the PR and between 1ab89cf and 02d9faf.

📒 Files selected for processing (5)
  • src/components/App.tsx
  • src/components/Reader.tsx
  • src/lib/pdf-extract.ts
  • src/lib/pdf-structure.ts
  • src/lib/reader-store.ts

Guard the contents-panel padding with Math.max so malformed level data
can never yield negative indentation (CodeRabbit nitpick), and add
docstrings to the remaining structure-detection helpers.

https://claude.ai/code/session_01KMkammT4QJ4cKPq7phS98f
@punyamsingh punyamsingh merged commit 279e642 into main Jun 10, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants