PDF stage 1.3 (part B): predefined Unicode CMaps for Type0 fonts#534
Merged
Conversation
27c674a to
8320665
Compare
65f4ac0 to
0124223
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: bf136a1d4f
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
Handle the predefined Unicode /Encoding CMaps of composite (Type0) fonts —
the Uni*-UCS2/UTF16/UTF32 families — whose character codes already are
Unicode (big-endian), so they are decoded directly with no data tables.
- pdf_cid.{hpp,cpp}: translate_predefined_cmap() decodes UCS2/UTF16 codes as
UTF-16BE (incl. surrogate pairs) and UTF32 as 4-byte big-endian; returns
nullopt for Identity-H/V and the legacy CJK code->CID CMaps.
- parse_composite_font records the Type0 /Encoding name
(Font::cid_encoding_name); Font::to_unicode routes a composite font that
lacks a /ToUnicode through the predefined-CMap path, else "no Unicode".
- Tests: PdfCid.* (inline) and
DocumentParser.composite_font_predefined_unicode_cmap.
The legacy CJK CMaps (RKSJ/EUC/Big5/GBK/KSC) need per-collection
CID->Unicode tables and are deferred (the data is large). The generator
scaffolding is landed: tools/pdf/generate_cid_data.py fetches Adobe's
cmap-resources (git-ignored input, pinned) and emits block-encoded tables;
how to store them compactly (measured ~3.3 MB plain / ~855 KB zlib+base64)
and the C++ lookup are the remaining work. A .gitignore guard prevents a
stray generator run from committing the multi-MB output.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Adapt generate_encoding_data.py to fetch glyphlist.txt from a pinned commit of adobe-type-tools/agl-aglfn (git-ignored on disk), mirroring how generate_cid_data.py fetches the CMap resources. The three base encodings are PDF-spec data with no canonical download and stay vendored. Since the AGL is no longer redistributed in the repo, carry its BSD-3-Clause attribution in the generated source banner (regenerated; output otherwise unchanged). Docs updated accordingly. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
bf136a1 to
1d9079a
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Stacked on #533 (part A), which is stacked on #532 (stage 1.2). Retarget down the stack as each merges.
Stage 1.3, part B — predefined CMaps (Uni* slice)
A composite (Type0) font names a predefined CMap as its
/Encoding. These split by data weight; this PR lands the data-free half.Done — predefined Unicode CMaps (
Uni*-UCS2/UTF16/UTF32)Their character codes already are Unicode (big-endian), so they're decoded directly with no data tables — covering the bulk of modern CJK PDFs.
pdf_cid.{hpp,cpp}:translate_predefined_cmap()— UCS2/UTF16 → UTF-16BE (incl. surrogate pairs), UTF32 → 4-byte BE;nulloptforIdentity-H/Vand legacy CJK CMaps.parse_composite_fontrecords the Type0/Encodingname (Font::cid_encoding_name);Font::to_unicoderoutes a composite font lacking/ToUnicodethrough the predefined-CMap path, else "no Unicode".PdfCid.*(inline) +DocumentParser.composite_font_predefined_unicode_cmap. All 76 PDF tests pass.Deferred — legacy CJK CMaps (RKSJ/EUC/Big5/GBK/KSC)
These map
code→CID, so they also need per-collectionCID→Unicodetables — large data. The generator/fetch scaffolding is landed:tools/pdf/generate_cid_data.pyfetches Adobe'scmap-resources(git-ignored input, pinned commit), parses the CMaps (resolvingusecmap), inverts theUni*CMaps forCID→Unicode, and emits block-encoded range arrays..gitignoreguard prevents a stray run from committing the multi-MB output.Tooling — Adobe Glyph List now downloaded + pinned
Brings
generate_encoding_data.pyin line with the newgenerate_cid_data.pyfetch pattern:glyphlist.txtis no longer vendored but downloaded from a pinned commit ofadobe-type-tools/agl-aglfn(git-ignored on disk, reused after first run). Verified byte-identical to the previously vendored copy.standard/win_ansi/mac_roman) are PDF-spec transcriptions with no canonical download, so they stay vendored.Notes
AGENTS.md,tools/pdf/README.md,THIRD_PARTY_LICENSES.md.🤖 Generated with Claude Code