feat: add CJK predefined CMap decoders (Shift-JIS and UCS-2 BE)#69
Closed
Detective-XH wants to merge 1 commit into
Closed
feat: add CJK predefined CMap decoders (Shift-JIS and UCS-2 BE)#69Detective-XH wants to merge 1 commit into
Detective-XH wants to merge 1 commit into
Conversation
getEncoder() now handles two families of predefined CMaps that previously fell through to the default case and returned U+FFFD replacement characters: Shift-JIS (90ms-RKSJ-H, 90ms-RKSJ-V, 90pv-RKSJ-H): multibyteCMapEncoder wraps golang.org/x/text/encoding/japanese.ShiftJIS. Raw PDF content-stream bytes are Shift-JIS encoded; x/text does the charset conversion to UTF-8. UCS-2 big-endian (UniGB-UCS2-H/V, UniCNS-UCS2-H/V, UniJIS-UCS2-H/V, UniKS-UCS2-H/V): ucs2BEEncoder reads successive 2-byte big-endian code points directly as Unicode runes. No new dependency -- BMP-only uint16 arithmetic is correct for all Uni*-UCS2-* CMaps. golang.org/x/text was already an indirect dep; japanese.ShiftJIS adds no new transitive dependencies. Tests in page_cjk_test.go cover both encoders with Simplified Chinese, Traditional Chinese, Japanese, and Korean inputs plus edge cases.
Author
|
Closing in favour of a more complete PR that covers all five CJK predefined CMap families (Shift-JIS, UCS-2 BE, GBK, Big5-ETen, UHC/KSCms) in one set. |
3 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
getEncoder()has no handler for several widely-used CJK predefined CMaps.PDFs that declare one of the following encodings fall through to the
defaultbranch and return garbled U+FFFD replacement characters instead of readable text:
90ms-RKSJ-H,90ms-RKSJ-V,90pv-RKSJ-H— Japanese Shift-JIS CMaps(Adobe-Japan1, used by virtually every Japanese PDF tool)
UniGB-UCS2-H/V,UniCNS-UCS2-H/V,UniJIS-UCS2-H/V,UniKS-UCS2-H/V— the four Adobe Uni*-UCS2 CMap families (Simplified Chinese, Traditional
Chinese, Japanese, Korean)
Fix
Two small, focused encoder types added to
page.go:multibyteCMapEncoder(Shift-JIS)Wraps
golang.org/x/text/encoding/japanese.ShiftJISto decode the rawcontent-stream bytes.
x/textis already an indirect dependency of thismodule, so no new transitive dependency is introduced.
ucs2BEEncoder(Uni*-UCS2-H/V)Reads successive 2-byte big-endian values directly as Unicode BMP code points.
Requires no additional import — the Uni*-UCS2-* CMaps map each glyph selector
to a single BMP code point, making plain
uint16arithmetic correct andsufficient.
Tests
page_cjk_test.go(new file) covers both encoders:TestUCS2BEEncoder— 7 sub-tests: Simplified Chinese, Traditional Chinese,Japanese hiragana, Korean hangul, ASCII round-trip, empty input, trailing
odd byte.
TestMultibyteCMapEncoder_ShiftJIS— 3 sub-tests: katakana, mixedkanji/hiragana, empty input.
All 10 tests pass (
go test -race ./...).Relation to existing PRs
PR #56 adds
UniGB-UCS2-Hwith aucs2Encoderusingencoding/binary+unicode/utf16. This PR extends that idea to all eight Uni*-UCS2 variantsand uses a simpler implementation (no standard-library imports beyond those
already present) while also adding the unrelated Shift-JIS family. If #56 is
preferred, this PR can be scoped to the Shift-JIS encoder only.