feat: add document structure detection cascade for TOC by punyamsingh · Pull Request #34 · punyamsingh/WeReadPDF

punyamsingh · 2026-06-10T19:35:26Z

Summary

Implement a robust document structure detection system that cascades through multiple evidence sources (tagged structure tree, embedded outline, typography analysis, and keyword detection) to build a rich, kind-aware table of contents. This replaces the flat outline-only approach with a system that can identify section levels, numbering, and semantic kinds (section, subsection, frontmatter, reference, appendix).

Key Changes

New module pdf-structure.ts: Pure logic layer (DOM-free, pdf.js-free) for structure detection
- classifyHeading(): Grammar-based heading classification (numbering, lexicon matching, caption detection)
- extractStructHeadings(): Extract H1–H6/Title from tagged structure trees via marked-content mapping
- finalizeStructHeadings(): Normalize structure tree headings and lift document titles
- detectHeadingsByTypography(): Universal fallback using font size, weight, and numbering heuristics
- buildStructure(): Cascade merger that selects the strongest available evidence source
- resolveTitle(): Multi-source title resolution (tagged > metadata > typography > filename)
- Supporting utilities for outline conversion, candidate filtering, and deduplication
Enhanced pdf-extract.ts:
- Extract tagged structure tree from PDF via getStructTree() and marked-content text mapping
- Compute font size and bold detection from text transforms and font styles
- Pass typography candidates through the structure cascade
- Return rich StructureNode[] alongside flat outline for backward compatibility
- Improved title resolution using the new cascade system
Updated Reader.tsx:
- Prefer kind-aware structure for TOC display when available
- Fall back to flat outline for cached docs (pre-structure)
- Display semantic kind badges (e.g., "References", "Appendix") in the contents panel
- Adjust styling for nested levels and kind indicators
Type updates:
- Add StructureNode interface with level, kind, confidence, and source metadata
- Add structure field to ExtractedDoc and CachedDoc
- Export structure types from pdf-structure.ts for consumer use

Implementation Details

The cascade ranks evidence by reliability:

Tagged structure tree (0.97 confidence) — PDF explicitly labels H1/H2/H3
Embedded outline (0.9 confidence) — Bookmarks with hierarchy
Typography + keyword detection (0.55–0.7 confidence) — Font size/weight and numbering
Fallback (0.5 confidence) — Legacy chapter detection for novels

Typography detection is robust across untagged PDFs by combining three independent signals: size (larger than body text), weight (bold), and grammar (numbered sections or known lexicon names like "Abstract", "References"). This survives research papers where headings share the body size but are numbered, and novels where large titles lack numbers.

The system gracefully degrades: when a strong signal exists (tagged structure), weaker ones are ignored; when it doesn't, the cascade falls through to universal heuristics that work on any PDF.

https://claude.ai/code/session_01KMkammT4QJ4cKPq7phS98f

Summary by CodeRabbit

New Features
- Enhanced PDF parsing now extracts and preserves document structure metadata for improved table of contents generation.
- Table of contents now intelligently detects headings through multiple methods, including embedded structure data and typography analysis.
- Document metadata resolution improved for more accurate title and author attribution.

Replace the novel-only chapter heuristics with an evidence cascade that identifies sections, subsections, and front/back matter across varied PDFs — research papers especially, whose layouts vary widely. Detectors, ranked by reliability and merged with confidence + source: 1. Tagged structure tree (/StructTreeRoot) — parses H1–H6/Title via getStructTree() + marked-content text, tying each tag to its glyphs. 2. Embedded outline / bookmarks (unchanged). 3. Typography + numbering/lexicon grammar — the universal fallback for untagged PDFs: heading lines stand out by size/weight, by a section number ("3.", "4.1."), or by a known section name. The numbering signal is what survives when headings share the body size and pdf.js has hidden their bold weight (common in journal styles). 4. Keyword chapter detection / page markers — last resort. Font size and bold weight are now threaded through line reconstruction. Titles resolve from the tagged tree, then valid metadata, then the largest line on page 1, then the filename — so producer junk like a leftover "gr1.eps" in /Title no longer wins. A new kind-aware StructureNode model (level, kind, number, confidence, source) is stored alongside the flat outline and surfaced in the contents panel; the flat outline is still derived for back-compat. https://claude.ai/code/session_01KMkammT4QJ4cKPq7phS98f

The unstructured fallback used strict priority — typography, then the keyword/chapter detector — so a novel without tags or bookmarks could have the new typography pass preempt the well-tuned chapter detector with noise. Merge the two instead: union their candidates, page-ordered and de-duplicated, so a novel contributes its "Chapter N"/large-title headings and a paper its numbered sections, and either genre keeps standing if only one detector fires. https://claude.ai/code/session_01KMkammT4QJ4cKPq7phS98f

vercel · 2026-06-10T19:35:33Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Actions	Updated (UTC)
wereadpdf	Ready	Preview, Comment	Jun 10, 2026 7:45pm

coderabbitai · 2026-06-10T19:35:38Z

Warning

Review limit reached

@punyamsingh, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 51 minutes. Learn how PR review limits work.

Your organization has run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 4e014f86-18e8-4b7e-b7ff-5b010ee3e7e4

📥 Commits

Reviewing files that changed from the base of the PR and between 02d9faf and a404e7e.

📒 Files selected for processing (3)

src/components/Reader.tsx
src/lib/pdf-extract.ts
src/lib/pdf-structure.ts

📝 Walkthrough

Walkthrough

This PR enriches PDF document structure extraction by implementing a cascading detection system that probes for tagged structure trees, embedded outlines, and typography-based signals, then unifies them into a kind-aware structured table of contents. Text line reconstruction is enhanced with font-size and bold metrics to support heading detection.

Changes

PDF Document Structure Extraction and Display

Layer / File(s)	Summary
Structure core model and heading classification `src/lib/pdf-structure.ts`	Introduces `StructureSource`, `StructureKind`, and `StructureNode` types; adds `classifyHeading` to detect numbered headings, lexicon-based section names (Abstract/References/Appendix), and captions with metadata like `kind`, `number`, and `inLexicon`.
Tagged structure tree extraction `src/lib/pdf-structure.ts`	Defines pdf.js structure-tree types (`StructTreeNode`, `MarkedItem`); implements `extractStructHeadings` to traverse structure tree and extract H1–H6/Title headings; `finalizeStructHeadings` normalizes levels and optionally promotes an early H1 to document title.
Typography-based heading detection `src/lib/pdf-structure.ts`	Implements `detectHeadingsByTypography` to infer body text size and select heading candidates using size, bold, numbering, and lexicon signals; assigns hierarchy levels via numbering depth or size ranking.
Structure cascade and merging `src/lib/pdf-structure.ts`	Implements cascade normalization, deduplication, and merging logic: converts candidates to `StructureNode`s, removes running-head/duplicates, refines typography candidates, converts outline indentation to candidates, and merges typography with legacy chapters in page order.
Structure output utilities and title resolution `src/lib/pdf-structure.ts`	Adds `structureToOutline` to flatten structure back to legacy format; `resolveTitle` selects most trustworthy title from struct-tree, metadata, typography, or filename; `largestLineTitle` extracts typography-based title candidate from page-1 lines.
Font and size/bold enrichment in text extraction `src/lib/pdf-extract.ts`	Introduces `FontStyles` and `isBoldFont` for bold detection; extends `Line` and `PlacedItem` interfaces with `size` and `bold` fields; refactors `buildPageLines`, `collectItems`, line ordering, and `groupIntoLines` to compute and propagate typography metrics through the extraction pipeline.
PDF extraction integration and structure wiring `src/lib/pdf-extract.ts`	Expands imports for structure utilities and types; updates `ExtractedDoc` to include `structure` field; implements `extractTaggedStructure` to probe page-1 for struct-tree availability; builds structure from tagged, outline, typography, and chapter sources; derives legacy `outline` format; returns structure with title resolved via cascade.
Storage and persistence of structure `src/lib/reader-store.ts`, `src/components/App.tsx`	Extends `CachedDoc` with optional `structure: StructureNode[]` field and imports `StructureNode` type; persists extracted structure during PDF import, with fallback to legacy `outline` for older cached documents.
Reader UI display of structured table of contents `src/components/Reader.tsx`	Builds `tocItems` memo from cached document's `structure` (using explicit levels and section kinds) or fallback to `outline` with computed levels; renders TOC panel from `filteredToc` with level-based indentation and conditional uppercase "kind" badges for non-section-level items.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Poem

📚 Structure blooms in the PDF's nested tree,
Bold lines and titles sing their hierarchy,
From tags and fonts, a cascade takes form—
Chapters align, each kind becomes norm,
Readers now wander the structured warm.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 71.43% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The PR title accurately summarizes the main objective: implementing a document structure detection cascade system for table-of-contents generation.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch claude/pdf-structure-parsing-3wr10t

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

🧹 Nitpick comments (1)

src/components/Reader.tsx (1)
1219-1219: ⚡ Quick win

Consider adding a defensive guard for item.level.

The padding calculation assumes item.level >= 1. While the fallback at line 650 guarantees a minimum level of 1, and the structure detection system should also provide valid levels, adding a guard would prevent negative padding if malformed data arrives.
🛡️ Defensive guard suggestion
 <span
   className="truncate flex items-center gap-2"
-  style={{ paddingLeft: `${(item.level - 1) * 12}px` }}
+  style={{ paddingLeft: `${Math.max(0, (item.level - 1) * 12)}px` }}
 >
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/components/Reader.tsx` at line 1219, The paddingLeft calculation directly
uses item.level and can produce negative values if malformed data arrives;
update the inline style in the Reader component where style={{ paddingLeft:
`${(item.level - 1) * 12}px` }} to defensively clamp or default the level (e.g.,
coerce to a number and use Math.max(level, 1) or fallback to 1) before computing
(level - 1) * 12 so padding never becomes negative; locate the usage of
item.level in the Reader.tsx render and replace it with the clamped/defaulted
value.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@src/components/Reader.tsx`:
- Line 1219: The paddingLeft calculation directly uses item.level and can
produce negative values if malformed data arrives; update the inline style in
the Reader component where style={{ paddingLeft: `${(item.level - 1) * 12}px` }}
to defensively clamp or default the level (e.g., coerce to a number and use
Math.max(level, 1) or fallback to 1) before computing (level - 1) * 12 so
padding never becomes negative; locate the usage of item.level in the Reader.tsx
render and replace it with the clamped/defaulted value.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: fe542543-acc3-4619-a671-e1edb0f65c14

📥 Commits

Reviewing files that changed from the base of the PR and between 1ab89cf and 02d9faf.

📒 Files selected for processing (5)

src/components/App.tsx
src/components/Reader.tsx
src/lib/pdf-extract.ts
src/lib/pdf-structure.ts
src/lib/reader-store.ts

Guard the contents-panel padding with Math.max so malformed level data can never yield negative indentation (CodeRabbit nitpick), and add docstrings to the remaining structure-detection helpers. https://claude.ai/code/session_01KMkammT4QJ4cKPq7phS98f

claude added 2 commits June 10, 2026 18:59

coderabbitai Bot reviewed Jun 10, 2026

View reviewed changes

vercel Bot deployed to Preview June 10, 2026 19:45 View deployment

punyamsingh merged commit 279e642 into main Jun 10, 2026
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add document structure detection cascade for TOC#34

feat: add document structure detection cascade for TOC#34
punyamsingh merged 3 commits into
mainfrom
claude/pdf-structure-parsing-3wr10t

punyamsingh commented Jun 10, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

vercel Bot commented Jun 10, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot commented Jun 10, 2026 •

edited

Loading

Review limit reached

Walkthrough

Changes

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

punyamsingh commented Jun 10, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Key Changes

Implementation Details

Summary by CodeRabbit

Uh oh!

vercel Bot commented Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai Bot commented Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review limit reached

Walkthrough

Changes

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

punyamsingh commented Jun 10, 2026 •

edited by coderabbitai Bot

Loading

vercel Bot commented Jun 10, 2026 •

edited

Loading

coderabbitai Bot commented Jun 10, 2026 •

edited

Loading