feat: add CJK predefined CMap decoders (Shift-JIS and UCS-2 BE) by Detective-XH · Pull Request #69 · ledongthuc/pdf

Detective-XH · 2026-06-01T08:55:11Z

Problem

getEncoder() has no handler for several widely-used CJK predefined CMaps.
PDFs that declare one of the following encodings fall through to the default
branch and return garbled U+FFFD replacement characters instead of readable text:

90ms-RKSJ-H, 90ms-RKSJ-V, 90pv-RKSJ-H — Japanese Shift-JIS CMaps
(Adobe-Japan1, used by virtually every Japanese PDF tool)
UniGB-UCS2-H/V, UniCNS-UCS2-H/V, UniJIS-UCS2-H/V, UniKS-UCS2-H/V
— the four Adobe Uni*-UCS2 CMap families (Simplified Chinese, Traditional
Chinese, Japanese, Korean)

Fix

Two small, focused encoder types added to page.go:

multibyteCMapEncoder (Shift-JIS)
Wraps golang.org/x/text/encoding/japanese.ShiftJIS to decode the raw
content-stream bytes. x/text is already an indirect dependency of this
module, so no new transitive dependency is introduced.

ucs2BEEncoder (Uni*-UCS2-H/V)
Reads successive 2-byte big-endian values directly as Unicode BMP code points.
Requires no additional import — the Uni*-UCS2-* CMaps map each glyph selector
to a single BMP code point, making plain uint16 arithmetic correct and
sufficient.

Tests

page_cjk_test.go (new file) covers both encoders:

TestUCS2BEEncoder — 7 sub-tests: Simplified Chinese, Traditional Chinese,
Japanese hiragana, Korean hangul, ASCII round-trip, empty input, trailing
odd byte.
TestMultibyteCMapEncoder_ShiftJIS — 3 sub-tests: katakana, mixed
kanji/hiragana, empty input.

All 10 tests pass (go test -race ./...).

Relation to existing PRs

PR #56 adds UniGB-UCS2-H with a ucs2Encoder using encoding/binary +
unicode/utf16. This PR extends that idea to all eight Uni*-UCS2 variants
and uses a simpler implementation (no standard-library imports beyond those
already present) while also adding the unrelated Shift-JIS family. If #56 is
preferred, this PR can be scoped to the Shift-JIS encoder only.

getEncoder() now handles two families of predefined CMaps that previously fell through to the default case and returned U+FFFD replacement characters: Shift-JIS (90ms-RKSJ-H, 90ms-RKSJ-V, 90pv-RKSJ-H): multibyteCMapEncoder wraps golang.org/x/text/encoding/japanese.ShiftJIS. Raw PDF content-stream bytes are Shift-JIS encoded; x/text does the charset conversion to UTF-8. UCS-2 big-endian (UniGB-UCS2-H/V, UniCNS-UCS2-H/V, UniJIS-UCS2-H/V, UniKS-UCS2-H/V): ucs2BEEncoder reads successive 2-byte big-endian code points directly as Unicode runes. No new dependency -- BMP-only uint16 arithmetic is correct for all Uni*-UCS2-* CMaps. golang.org/x/text was already an indirect dep; japanese.ShiftJIS adds no new transitive dependencies. Tests in page_cjk_test.go cover both encoders with Simplified Chinese, Traditional Chinese, Japanese, and Korean inputs plus edge cases.

Detective-XH · 2026-06-01T09:48:27Z

Closing in favour of a more complete PR that covers all five CJK predefined CMap families (Shift-JIS, UCS-2 BE, GBK, Big5-ETen, UHC/KSCms) in one set.

Detective-XH closed this Jun 1, 2026

Detective-XH mentioned this pull request Jun 1, 2026

feat: add predefined CMap decoders for Shift-JIS, UCS-2 BE, GBK, Big5-ETen, and UHC #70

Closed

3 tasks

Detective-XH deleted the feature/cjk-predefined-cmap-decoders branch June 1, 2026 22:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add CJK predefined CMap decoders (Shift-JIS and UCS-2 BE)#69

feat: add CJK predefined CMap decoders (Shift-JIS and UCS-2 BE)#69
Detective-XH wants to merge 1 commit into
ledongthuc:masterfrom
Detective-XH:feature/cjk-predefined-cmap-decoders

Detective-XH commented Jun 1, 2026

Uh oh!

Detective-XH commented Jun 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Detective-XH commented Jun 1, 2026

Problem

Fix

Tests

Relation to existing PRs

Uh oh!

Detective-XH commented Jun 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant