Skip to content

PDF stage 1.3 (part B): predefined Unicode CMaps for Type0 fonts#534

Merged
andiwand merged 5 commits into
mainfrom
pdf-composite-cid-fonts-cjk
Jun 15, 2026
Merged

PDF stage 1.3 (part B): predefined Unicode CMaps for Type0 fonts#534
andiwand merged 5 commits into
mainfrom
pdf-composite-cid-fonts-cjk

Conversation

@andiwand

@andiwand andiwand commented Jun 14, 2026

Copy link
Copy Markdown
Member

Stacked on #533 (part A), which is stacked on #532 (stage 1.2). Retarget down the stack as each merges.

Stage 1.3, part B — predefined CMaps (Uni* slice)

A composite (Type0) font names a predefined CMap as its /Encoding. These split by data weight; this PR lands the data-free half.

Done — predefined Unicode CMaps (Uni*-UCS2/UTF16/UTF32)

Their character codes already are Unicode (big-endian), so they're decoded directly with no data tables — covering the bulk of modern CJK PDFs.

  • pdf_cid.{hpp,cpp}: translate_predefined_cmap() — UCS2/UTF16 → UTF-16BE (incl. surrogate pairs), UTF32 → 4-byte BE; nullopt for Identity-H/V and legacy CJK CMaps.
  • parse_composite_font records the Type0 /Encoding name (Font::cid_encoding_name); Font::to_unicode routes a composite font lacking /ToUnicode through the predefined-CMap path, else "no Unicode".
  • Tests: PdfCid.* (inline) + DocumentParser.composite_font_predefined_unicode_cmap. All 76 PDF tests pass.

Deferred — legacy CJK CMaps (RKSJ/EUC/Big5/GBK/KSC)

These map code→CID, so they also need per-collection CID→Unicode tables — large data. The generator/fetch scaffolding is landed:

  • tools/pdf/generate_cid_data.py fetches Adobe's cmap-resources (git-ignored input, pinned commit), parses the CMaps (resolving usecmap), inverts the Uni* CMaps for CID→Unicode, and emits block-encoded range arrays.
  • Measured output: ~3.3 MB plain block-encoded C++, ~855 KB as zlib+base64. The storage decision + the C++ lookup are the remaining work. A .gitignore guard prevents a stray run from committing the multi-MB output.

Tooling — Adobe Glyph List now downloaded + pinned

Brings generate_encoding_data.py in line with the new generate_cid_data.py fetch pattern: glyphlist.txt is no longer vendored but downloaded from a pinned commit of adobe-type-tools/agl-aglfn (git-ignored on disk, reused after first run). Verified byte-identical to the previously vendored copy.

  • The three base encodings (standard/win_ansi/mac_roman) are PDF-spec transcriptions with no canonical download, so they stay vendored.
  • Since the AGL is no longer redistributed in the repo, its BSD-3-Clause attribution now rides in the generated source banner (regenerated; output otherwise unchanged).

Notes

  • No CJK fixture in the corpus, so this is validated with synthetic inline mini-PDFs (the module convention).
  • Docs updated: AGENTS.md, tools/pdf/README.md, THIRD_PARTY_LICENSES.md.

🤖 Generated with Claude Code

@andiwand andiwand force-pushed the pdf-composite-cid-fonts branch from 27c674a to 8320665 Compare June 14, 2026 20:28
@andiwand andiwand force-pushed the pdf-composite-cid-fonts-cjk branch 2 times, most recently from 65f4ac0 to 0124223 Compare June 14, 2026 20:48
@andiwand andiwand marked this pull request as ready for review June 14, 2026 21:20

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: bf136a1d4f

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread tools/pdf/THIRD_PARTY_LICENSES.md
Base automatically changed from pdf-composite-cid-fonts to main June 14, 2026 21:53
andiwand and others added 3 commits June 15, 2026 01:15
Handle the predefined Unicode /Encoding CMaps of composite (Type0) fonts —
the Uni*-UCS2/UTF16/UTF32 families — whose character codes already are
Unicode (big-endian), so they are decoded directly with no data tables.

- pdf_cid.{hpp,cpp}: translate_predefined_cmap() decodes UCS2/UTF16 codes as
  UTF-16BE (incl. surrogate pairs) and UTF32 as 4-byte big-endian; returns
  nullopt for Identity-H/V and the legacy CJK code->CID CMaps.
- parse_composite_font records the Type0 /Encoding name
  (Font::cid_encoding_name); Font::to_unicode routes a composite font that
  lacks a /ToUnicode through the predefined-CMap path, else "no Unicode".
- Tests: PdfCid.* (inline) and
  DocumentParser.composite_font_predefined_unicode_cmap.

The legacy CJK CMaps (RKSJ/EUC/Big5/GBK/KSC) need per-collection
CID->Unicode tables and are deferred (the data is large). The generator
scaffolding is landed: tools/pdf/generate_cid_data.py fetches Adobe's
cmap-resources (git-ignored input, pinned) and emits block-encoded tables;
how to store them compactly (measured ~3.3 MB plain / ~855 KB zlib+base64)
and the C++ lookup are the remaining work. A .gitignore guard prevents a
stray generator run from committing the multi-MB output.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Adapt generate_encoding_data.py to fetch glyphlist.txt from a pinned
commit of adobe-type-tools/agl-aglfn (git-ignored on disk), mirroring how
generate_cid_data.py fetches the CMap resources. The three base encodings
are PDF-spec data with no canonical download and stay vendored.

Since the AGL is no longer redistributed in the repo, carry its
BSD-3-Clause attribution in the generated source banner (regenerated;
output otherwise unchanged). Docs updated accordingly.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@andiwand andiwand force-pushed the pdf-composite-cid-fonts-cjk branch from bf136a1 to 1d9079a Compare June 14, 2026 23:17
@andiwand andiwand merged commit 5da1f4e into main Jun 15, 2026
11 checks passed
@andiwand andiwand deleted the pdf-composite-cid-fonts-cjk branch June 15, 2026 07:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant