Skip to content

fix(pdf): support CJK characters in PDF snapshots (#103)#160

Open
m719 wants to merge 1 commit into
mainfrom
103-fix-cjk-characters-render-as-mojibake-in-pdf-snapshot
Open

fix(pdf): support CJK characters in PDF snapshots (#103)#160
m719 wants to merge 1 commit into
mainfrom
103-fix-cjk-characters-render-as-mojibake-in-pdf-snapshot

Conversation

@m719

@m719 m719 commented Jun 18, 2026

Copy link
Copy Markdown
Contributor

Changes:

  • Add cjk-font.ts: regex-based detection, fetch + base64 conversion with chunk-based processing, promise lock with GC after use
  • Update pdf-generator.ts: integrate CJK font, add setFont helper that forces 'normal' style for CJK, add timeouts to image loaders, strip <script> tags, clean up dead code
  • Update extraction.ts: CJK support in fallback generators, restore courier for code blocks when not in CJK mode
  • Copy fonts to build output via vite config
  • Add unit tests for needsCJKFont and sanitizeFilename

Works on both Chrome and Firefox (MV3).

Proposed changes

  • Bundle Noto Sans CJK font (Git LFS) and register it with jsPDF on demand
  • Lazy-load only when CJK characters are detected in extracted content

Related issues

How to test this PR

  1. Install the extension from this branch
  2. Navigate to a page with CJK content and generate a PDF snapshot
  3. Verify characters render correctly (not mojibake)

Test pages:

  1. Also test a Latin-only page to confirm no regression (font should NOT be loaded)

Checklist

  • The PR title follows the Conventional Commits convention type(scope?): description (#issue)
  • I signed my commits
  • This PR is linked to an issue
  • I consider the submitted work as finished
  • I tested the code for its functionality
  • I added/updated the relevant documentation
  • Where necessary, I refactored code to improve the overall quality

Further comments

The font file is ~19 MB (TTF) tracked via Git LFS. It adds ~5 MB compressed to the extension package. It is only fetched at runtime when CJK characters are detected, so Latin-only PDFs have zero overhead.

Alternatives considered:

  • Host the font on our internal CDN instead of bundling it, which would reduce extension package size but add a network dependency at runtime.
  • Use html2canvas for Firefox instead of bundling the font. On Chrome we could use native Page.printToPDF (which respects system fonts), and fall back to html2canvas on Firefox. This would avoid storing the font entirely, but the Firefox output would be image-based with no selectable text.

CJK characters (Chinese, Japanese, Korean) rendered as mojibake in
 generated PDFs because jsPDF's built-in fonts only cover Latin-1.

 Bundle Noto Sans CJK (TTF, tracked via Git LFS) and lazy-load it
 only when CJK characters are detected in the content.

 Changes:
 - Add cjk-font.ts: regex-based detection, fetch + base64 conversion
   with chunk-based processing, promise lock with GC after use
 - Update pdf-generator.ts: integrate CJK font, add setFont helper
   that forces 'normal' style for CJK, add timeouts to image loaders,
   strip <script> tags, clean up dead code
 - Update extraction.ts: CJK support in fallback generators, restore
   courier for code blocks when not in CJK mode
 - Copy fonts to build output via vite config
 - Add unit tests for needsCJKFont and sanitizeFilename

 Works on both Chrome and Firefox (MV3).
Copilot AI review requested due to automatic review settings June 18, 2026 09:58
@m719 m719 linked an issue Jun 18, 2026 that may be closed by this pull request
@m719 m719 self-assigned this Jun 18, 2026

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses issue #103 by adding on-demand CJK font support to the jsPDF-based PDF snapshot generation so Japanese/Chinese/Korean text renders correctly across browsers (Chrome/Firefox MV3), while avoiding overhead for Latin-only pages.

Changes:

  • Add a new CJK font utility (cjk-font.ts) that detects CJK characters and registers a bundled Noto Sans CJK font with jsPDF only when needed.
  • Integrate CJK font selection into both the shared PDF generator and content-script fallback generators; export and reuse sanitizeFilename.
  • Ensure the font file is included in build output via Vite, and add unit tests for needsCJKFont and sanitizeFilename.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
vite.config.ts Copies bundled fonts into the dist/<browser>/assets/fonts output so they can be fetched at runtime.
tests/unit/pdf-generator.test.ts Adds unit tests covering sanitizeFilename behavior (including CJK titles).
tests/unit/cjk-font.test.ts Adds unit tests validating needsCJKFont across multiple Unicode ranges and non-CJK text.
src/shared/extraction/pdf-generator.ts Loads/registers the CJK font on demand, uses a font helper for style selection, adds timeouts for image helpers, strips dangerous elements, exports sanitizeFilename, and removes unused native-PDF code paths.
src/shared/extraction/cjk-font.ts Implements CJK detection + on-demand font fetch/base64 conversion + jsPDF registration with a promise lock.
src/content/extraction.ts Updates fallback PDF generators to use the same CJK font logic and shared sanitizeFilename.
src/assets/fonts/NotoSansCJK-Regular.ttf Adds the Noto Sans CJK font tracked via Git LFS (pointer file in Git).
.gitattributes Configures *.ttf files to be tracked via Git LFS.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread .gitattributes
@@ -0,0 +1 @@
*.ttf filter=lfs diff=lfs merge=lfs -text
Comment on lines +9 to +10
import { jsPDF } from 'jspdf';
import { loggers } from '../utils/logger';
Comment on lines 49 to 65
/**
* Generate PDF from extracted content
* Tries native print first, falls back to jsPDF
* Generate PDF from extracted content using jsPDF
*/
export async function generatePDF(
content: ExtractedContent,
options: PDFOptions = {}
): Promise<PDFGenerationResult | null> {
const opts = { ...DEFAULT_OPTIONS, ...options };

// Try jsPDF generation (most reliable for extensions)
const jspdfResult = await generateWithJsPDF(content, opts);
if (jspdfResult) {
return jspdfResult;
}

log.error('[PDFGenerator] All PDF generation methods failed');
log.error('[PDFGenerator] PDF generation failed');
return null;
}
@codecov

codecov Bot commented Jun 18, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 6.72269% with 111 lines in your changes missing coverage. Please review.
✅ Project coverage is 33.39%. Comparing base (f8d57da) to head (6cf8a43).

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #160      +/-   ##
==========================================
- Coverage   33.40%   33.39%   -0.02%     
==========================================
  Files          92       93       +1     
  Lines       16556    16602      +46     
  Branches     5238     5440     +202     
==========================================
+ Hits         5531     5544      +13     
- Misses      11025    11058      +33     
Flag Coverage Δ
e2e 33.39% <6.72%> (-0.02%) ⬇️
integration-openaev 33.39% <6.72%> (-0.02%) ⬇️
integration-opencti 33.39% <6.72%> (-0.02%) ⬇️
unittests 33.39% <6.72%> (-0.02%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

fix: CJK characters render as mojibake in PDF snapshot

2 participants