Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
19 changes: 9 additions & 10 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -95,8 +95,8 @@ transform_markdown(
ignore_pdf_errors=False, # Optional: continue on PDF rendering errors
ignore_ocr_errors=False, # Optional: continue on OCR recognition errors
generate_plot=False, # Optional: generate visualization charts
toc_mode=TocExtractionMode.NO_TOC_PAGE, # Optional: TOC extraction mode
toc_llm=None, # Optional: LLM instance for enhanced TOC extraction
toc_assumed=False, # Optional: whether to assume TOC pages exist (default: False)
)
```

Expand All @@ -118,8 +118,8 @@ transform_epub(
ignore_pdf_errors=False, # Optional: continue on PDF rendering errors
ignore_ocr_errors=False, # Optional: continue on OCR recognition errors
generate_plot=False, # Optional: generate visualization charts
toc_mode=TocExtractionMode.AUTO_DETECT, # Optional: TOC extraction mode
toc_llm=None, # Optional: LLM instance for enhanced TOC extraction
toc_assumed=True, # Optional: whether to assume TOC pages exist (default: True for EPUB)
book_meta=BookMeta(
title="Book Title",
authors=["Author 1", "Author 2"],
Expand Down Expand Up @@ -208,20 +208,19 @@ The `inline_latex` parameter (EPUB only, default: `True`) controls whether to pr

### Table of Contents Detection

The `toc_mode` parameter controls how pdf-craft extracts table of contents information:
The `toc_assumed` parameter controls how pdf-craft handles table of contents extraction:

- `TocExtractionMode.NO_TOC_PAGE` (default for Markdown): Generates TOC based on document headings only, without detecting TOC pages
- `TocExtractionMode.AUTO_DETECT` (default for EPUB): Detects TOC pages using statistical analysis and extracts chapter structure
- `TocExtractionMode.LLM_ENHANCED`: Detects TOC pages and uses LLM to extract hierarchical chapter structure with improved accuracy. **Requires `toc_llm` parameter to be configured.**
- `False` (default for Markdown): Assumes no TOC pages exist. The conversion generates TOC based on document headings only, without detecting or processing TOC pages.
- `True` (default for EPUB): Assumes TOC pages exist. The conversion uses statistical analysis to detect TOC pages and extract chapter structure.

For books with complex chapter hierarchies, `LLM_ENHANCED` mode provides the most accurate results.
For books with complex chapter hierarchies, you can configure the optional `toc_llm` parameter to enable LLM-powered chapter title analysis, which provides more accurate TOC hierarchy detection.

#### LLM-Enhanced TOC Extraction

To use LLM-enhanced TOC extraction, you need to configure an LLM instance:

```python
from pdf_craft import transform_epub, BookMeta, LLM, TocExtractionMode
from pdf_craft import transform_epub, BookMeta, LLM

# Configure LLM for TOC extraction
toc_llm = LLM(
Expand All @@ -237,8 +236,8 @@ toc_llm = LLM(
transform_epub(
pdf_path="input.pdf",
epub_path="output.epub",
toc_mode=TocExtractionMode.LLM_ENHANCED,
toc_llm=toc_llm,
toc_assumed=True, # Enable TOC detection
toc_llm=toc_llm, # Enable LLM-powered chapter title analysis
book_meta=BookMeta(
title="Book Title",
authors=["Author"],
Expand Down
19 changes: 9 additions & 10 deletions README_zh-CN.md
Original file line number Diff line number Diff line change
Expand Up @@ -95,8 +95,8 @@ transform_markdown(
ignore_pdf_errors=False, # 可选:遇到 PDF 渲染错误时继续处理
ignore_ocr_errors=False, # 可选:遇到 OCR 识别错误时继续处理
generate_plot=False, # 可选:生成可视化图表
toc_mode=TocExtractionMode.NO_TOC_PAGE, # 可选:目录提取模式
toc_llm=None, # 可选:用于增强目录提取的 LLM 实例
toc_assumed=False, # 可选:是否假定存在目录页(默认:False)
)
```

Expand All @@ -118,8 +118,8 @@ transform_epub(
ignore_pdf_errors=False, # 可选:遇到 PDF 渲染错误时继续处理
ignore_ocr_errors=False, # 可选:遇到 OCR 识别错误时继续处理
generate_plot=False, # 可选:生成可视化图表
toc_mode=TocExtractionMode.AUTO_DETECT, # 可选:目录提取模式
toc_llm=None, # 可选:用于增强目录提取的 LLM 实例
toc_assumed=True, # 可选:是否假定存在目录页(EPUB 默认:True)
book_meta=BookMeta(
title="书名",
authors=["作者1", "作者2"],
Expand Down Expand Up @@ -208,20 +208,19 @@ transform_markdown(

### 目录检测

`toc_mode` 参数控制 pdf-craft 如何提取目录信息
`toc_assumed` 参数控制 pdf-craft 如何处理目录提取

- `TocExtractionMode.NO_TOC_PAGE`(Markdown 默认值):仅基于文档标题生成目录,不检测目录页
- `TocExtractionMode.AUTO_DETECT`(EPUB 默认值):使用统计分析检测目录页并提取章节结构
- `TocExtractionMode.LLM_ENHANCED`:检测目录页并使用 LLM 提取层级化的章节结构,准确度更高。**需要配置 `toc_llm` 参数。**
- `False`(Markdown 默认值):假定不存在目录页。转换过程仅基于文档标题生成目录,不检测或处理目录页。
- `True`(EPUB 默认值):假定存在目录页。转换过程使用统计分析检测目录页并提取章节结构。

对于具有复杂章节层级的书籍,`LLM_ENHANCED` 模式能提供最准确的结果
对于具有复杂章节层级的书籍,你可以配置可选的 `toc_llm` 参数来启用 LLM 驱动的章节标题分析,这能提供更准确的目录层级检测

#### LLM 增强目录提取

要使用 LLM 增强的目录提取功能,你需要配置一个 LLM 实例:

```python
from pdf_craft import transform_epub, BookMeta, LLM, TocExtractionMode
from pdf_craft import transform_epub, BookMeta, LLM

# 配置用于目录提取的 LLM
toc_llm = LLM(
Expand All @@ -237,8 +236,8 @@ toc_llm = LLM(
transform_epub(
pdf_path="input.pdf",
epub_path="output.epub",
toc_mode=TocExtractionMode.LLM_ENHANCED,
toc_llm=toc_llm,
toc_assumed=True, # 启用目录检测
toc_llm=toc_llm, # 启用 LLM 驱动的章节标题分析
book_meta=BookMeta(
title="书名",
authors=["作者"],
Expand Down
110 changes: 110 additions & 0 deletions docs/changelog/v1.0.10.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,110 @@
This release simplifies the table of contents (TOC) extraction API by replacing enum-based modes with a boolean flag, while adding LLM-powered chapter title analysis capabilities for improved TOC hierarchy detection.

## What's Changed

### Breaking Changes

* **Simplified TOC API**: Replaced `TocExtractionMode` enum with a simpler `toc_assumed` boolean parameter in https://github.com/oomol-lab/pdf-craft/pull/341
- Removed `toc_mode` parameter from `transform_markdown()` and `transform_epub()` functions
- Removed `TocExtractionMode` from public API exports
- Introduced `toc_assumed` boolean flag to control TOC detection behavior

### Features

* **LLM-Powered Chapter Title Analysis**: Added support for LLM-based analysis of chapter titles to enhance TOC extraction accuracy in https://github.com/oomol-lab/pdf-craft/pull/341
- Automatically analyzes chapter title hierarchies when `toc_llm` is configured
- Provides more accurate chapter level detection for complex book structures
- Intelligently falls back to standard analysis when LLM is unavailable or encounters errors
Comment on lines +14 to +17
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Minor grammar fix: use hyphenated compound adjective.

On line 16, "chapter level detection" should be "chapter-level detection" when used as a compound adjective modifying "detection".

📝 Suggested fix
-  - Provides more accurate chapter level detection for complex book structures
+  - Provides more accurate chapter-level detection for complex book structures
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
* **LLM-Powered Chapter Title Analysis**: Added support for LLM-based analysis of chapter titles to enhance TOC extraction accuracy in https://github.com/oomol-lab/pdf-craft/pull/341
- Automatically analyzes chapter title hierarchies when `toc_llm` is configured
- Provides more accurate chapter level detection for complex book structures
- Intelligently falls back to standard analysis when LLM is unavailable or encounters errors
* **LLM-Powered Chapter Title Analysis**: Added support for LLM-based analysis of chapter titles to enhance TOC extraction accuracy in https://github.com/oomol-lab/pdf-craft/pull/341
- Automatically analyzes chapter title hierarchies when `toc_llm` is configured
- Provides more accurate chapter-level detection for complex book structures
- Intelligently falls back to standard analysis when LLM is unavailable or encounters errors
🧰 Tools
🪛 LanguageTool

[grammar] ~16-~16: Use a hyphen to join words.
Context: ...gured - Provides more accurate chapter level detection for complex book structu...

(QB_NEW_EN_HYPHEN)

🤖 Prompt for AI Agents
In `@docs/changelog/v1.0.10.md` around lines 14 - 17, Update the sentence
"Provides more accurate chapter level detection for complex book structures" to
use a hyphenated compound adjective: change it to "Provides more accurate
chapter-level detection for complex book structures" (locate the exact string in
the changelog entry under the LLM-Powered Chapter Title Analysis bullet).


### Improvements

* **Enhanced Error Handling**: Added robust error handling for LLM-based analysis with automatic recovery mechanisms in https://github.com/oomol-lab/pdf-craft/pull/341
- Better error diagnostics for LLM analysis failures
- Graceful degradation when LLM analysis fails, ensuring conversion continues successfully

## Migration Guide

If you were using `toc_mode` in previous versions, update your code as follows:

### Previous API (v1.0.9 and earlier)

```python
from pdf_craft import transform_markdown, TocExtractionMode

# For Markdown conversion
transform_markdown(
pdf_path="input.pdf",
markdown_path="output.md",
toc_mode=TocExtractionMode.NO_TOC_PAGE, # Old parameter
)

# For EPUB conversion
transform_epub(
pdf_path="input.pdf",
epub_path="output.epub",
toc_mode=TocExtractionMode.AUTO_DETECT, # Old parameter
)
```

### New API (v1.0.10)

```python
from pdf_craft import transform_markdown

# For Markdown conversion (assumes no TOC pages by default)
transform_markdown(
pdf_path="input.pdf",
markdown_path="output.md",
toc_assumed=False, # New boolean parameter (default: False)
)

# For EPUB conversion (assumes TOC pages exist)
transform_epub(
pdf_path="input.pdf",
epub_path="output.epub",
toc_assumed=True, # New boolean parameter
)
```

### Migration Mapping

| Old `toc_mode` Value | New `toc_assumed` Value |
|---------------------|------------------------|
| `TocExtractionMode.NO_TOC_PAGE` | `False` |
| `TocExtractionMode.AUTO_DETECT` | `True` |
| `TocExtractionMode.LLM_ENHANCED` | `True` (with `toc_llm` configured) |

## LLM-Enhanced TOC Extraction

To use LLM-powered chapter title analysis:

```python
from pdf_craft import transform_epub, BookMeta, LLM

# Configure LLM for TOC enhancement
toc_llm = LLM(
key="your-api-key",
url="https://api.openai.com/v1",
model="gpt-4",
token_encoding="cl100k_base",
)

transform_epub(
pdf_path="input.pdf",
epub_path="output.epub",
toc_assumed=True, # Enable TOC detection
toc_llm=toc_llm, # Enable LLM-powered analysis
book_meta=BookMeta(
title="Book Title",
authors=["Author"],
),
)
```

## Notes

- The `toc_assumed` parameter defaults to `False` for Markdown conversion and `True` for EPUB conversion (maintaining backward-compatible behavior)
- LLM-powered chapter title analysis is optional and automatically falls back to standard analysis if not configured or if errors occur
- The new API is simpler and more intuitive, reducing the cognitive load of choosing between multiple enum values

**Full Changelog**: https://github.com/oomol-lab/pdf-craft/compare/v1.0.9...v1.0.10
2 changes: 1 addition & 1 deletion pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ build-backend = "poetry.core.masonry.api"

[tool.poetry]
name = "pdf-craft"
version = "1.0.9"
version = "1.0.10"
description = "PDF craft can convert PDF files into various other formats. This project will focus on processing PDF files of scanned books."
license = "MIT"
authors = ["Tao Zeyu <i@taozeyu.com>"]
Expand Down