Skip to content

feat: add CJK predefined CMap decoders (Shift-JIS and UCS-2 BE)#69

Closed
Detective-XH wants to merge 1 commit into
ledongthuc:masterfrom
Detective-XH:feature/cjk-predefined-cmap-decoders
Closed

feat: add CJK predefined CMap decoders (Shift-JIS and UCS-2 BE)#69
Detective-XH wants to merge 1 commit into
ledongthuc:masterfrom
Detective-XH:feature/cjk-predefined-cmap-decoders

Conversation

@Detective-XH

Copy link
Copy Markdown

Problem

getEncoder() has no handler for several widely-used CJK predefined CMaps.
PDFs that declare one of the following encodings fall through to the default
branch and return garbled U+FFFD replacement characters instead of readable text:

  • 90ms-RKSJ-H, 90ms-RKSJ-V, 90pv-RKSJ-H — Japanese Shift-JIS CMaps
    (Adobe-Japan1, used by virtually every Japanese PDF tool)
  • UniGB-UCS2-H/V, UniCNS-UCS2-H/V, UniJIS-UCS2-H/V, UniKS-UCS2-H/V
    — the four Adobe Uni*-UCS2 CMap families (Simplified Chinese, Traditional
    Chinese, Japanese, Korean)

Fix

Two small, focused encoder types added to page.go:

multibyteCMapEncoder (Shift-JIS)
Wraps golang.org/x/text/encoding/japanese.ShiftJIS to decode the raw
content-stream bytes. x/text is already an indirect dependency of this
module, so no new transitive dependency is introduced.

ucs2BEEncoder (Uni*-UCS2-H/V)
Reads successive 2-byte big-endian values directly as Unicode BMP code points.
Requires no additional import — the Uni*-UCS2-* CMaps map each glyph selector
to a single BMP code point, making plain uint16 arithmetic correct and
sufficient.

Tests

page_cjk_test.go (new file) covers both encoders:

  • TestUCS2BEEncoder — 7 sub-tests: Simplified Chinese, Traditional Chinese,
    Japanese hiragana, Korean hangul, ASCII round-trip, empty input, trailing
    odd byte.
  • TestMultibyteCMapEncoder_ShiftJIS — 3 sub-tests: katakana, mixed
    kanji/hiragana, empty input.

All 10 tests pass (go test -race ./...).

Relation to existing PRs

PR #56 adds UniGB-UCS2-H with a ucs2Encoder using encoding/binary +
unicode/utf16. This PR extends that idea to all eight Uni*-UCS2 variants
and uses a simpler implementation (no standard-library imports beyond those
already present) while also adding the unrelated Shift-JIS family. If #56 is
preferred, this PR can be scoped to the Shift-JIS encoder only.

getEncoder() now handles two families of predefined CMaps that previously
fell through to the default case and returned U+FFFD replacement characters:

Shift-JIS (90ms-RKSJ-H, 90ms-RKSJ-V, 90pv-RKSJ-H):
  multibyteCMapEncoder wraps golang.org/x/text/encoding/japanese.ShiftJIS.
  Raw PDF content-stream bytes are Shift-JIS encoded; x/text does the
  charset conversion to UTF-8.

UCS-2 big-endian (UniGB-UCS2-H/V, UniCNS-UCS2-H/V, UniJIS-UCS2-H/V,
  UniKS-UCS2-H/V):
  ucs2BEEncoder reads successive 2-byte big-endian code points directly as
  Unicode runes. No new dependency -- BMP-only uint16 arithmetic is correct
  for all Uni*-UCS2-* CMaps.

golang.org/x/text was already an indirect dep; japanese.ShiftJIS adds no
new transitive dependencies.

Tests in page_cjk_test.go cover both encoders with Simplified Chinese,
Traditional Chinese, Japanese, and Korean inputs plus edge cases.
@Detective-XH

Copy link
Copy Markdown
Author

Closing in favour of a more complete PR that covers all five CJK predefined CMap families (Shift-JIS, UCS-2 BE, GBK, Big5-ETen, UHC/KSCms) in one set.

@Detective-XH Detective-XH deleted the feature/cjk-predefined-cmap-decoders branch June 1, 2026 22:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant