PDF stage 1.3 (part B): predefined Unicode CMaps for Type0 fonts by andiwand · Pull Request #534 · opendocument-app/OpenDocument.core

andiwand · 2026-06-14T20:23:03Z

Stacked on #533 (part A), which is stacked on #532 (stage 1.2). Retarget down the stack as each merges.

Stage 1.3, part B — predefined CMaps (Uni* slice)

A composite (Type0) font names a predefined CMap as its /Encoding. These split by data weight; this PR lands the data-free half.

Done — predefined Unicode CMaps (`Uni*-UCS2/UTF16/UTF32`)

Their character codes already are Unicode (big-endian), so they're decoded directly with no data tables — covering the bulk of modern CJK PDFs.

pdf_cid.{hpp,cpp}: translate_predefined_cmap() — UCS2/UTF16 → UTF-16BE (incl. surrogate pairs), UTF32 → 4-byte BE; nullopt for Identity-H/V and legacy CJK CMaps.
parse_composite_font records the Type0 /Encoding name (Font::cid_encoding_name); Font::to_unicode routes a composite font lacking /ToUnicode through the predefined-CMap path, else "no Unicode".
Tests: PdfCid.* (inline) + DocumentParser.composite_font_predefined_unicode_cmap. All 76 PDF tests pass.

Deferred — legacy CJK CMaps (RKSJ/EUC/Big5/GBK/KSC)

These map code→CID, so they also need per-collection CID→Unicode tables — large data. The generator/fetch scaffolding is landed:

tools/pdf/generate_cid_data.py fetches Adobe's cmap-resources (git-ignored input, pinned commit), parses the CMaps (resolving usecmap), inverts the Uni* CMaps for CID→Unicode, and emits block-encoded range arrays.
Measured output: ~3.3 MB plain block-encoded C++, ~855 KB as zlib+base64. The storage decision + the C++ lookup are the remaining work. A .gitignore guard prevents a stray run from committing the multi-MB output.

Tooling — Adobe Glyph List now downloaded + pinned

Brings generate_encoding_data.py in line with the new generate_cid_data.py fetch pattern: glyphlist.txt is no longer vendored but downloaded from a pinned commit of adobe-type-tools/agl-aglfn (git-ignored on disk, reused after first run). Verified byte-identical to the previously vendored copy.

The three base encodings (standard/win_ansi/mac_roman) are PDF-spec transcriptions with no canonical download, so they stay vendored.
Since the AGL is no longer redistributed in the repo, its BSD-3-Clause attribution now rides in the generated source banner (regenerated; output otherwise unchanged).

Notes

No CJK fixture in the corpus, so this is validated with synthetic inline mini-PDFs (the module convention).
Docs updated: AGENTS.md, tools/pdf/README.md, THIRD_PARTY_LICENSES.md.

🤖 Generated with Claude Code

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: bf136a1d4f

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Handle the predefined Unicode /Encoding CMaps of composite (Type0) fonts — the Uni*-UCS2/UTF16/UTF32 families — whose character codes already are Unicode (big-endian), so they are decoded directly with no data tables. - pdf_cid.{hpp,cpp}: translate_predefined_cmap() decodes UCS2/UTF16 codes as UTF-16BE (incl. surrogate pairs) and UTF32 as 4-byte big-endian; returns nullopt for Identity-H/V and the legacy CJK code->CID CMaps. - parse_composite_font records the Type0 /Encoding name (Font::cid_encoding_name); Font::to_unicode routes a composite font that lacks a /ToUnicode through the predefined-CMap path, else "no Unicode". - Tests: PdfCid.* (inline) and DocumentParser.composite_font_predefined_unicode_cmap. The legacy CJK CMaps (RKSJ/EUC/Big5/GBK/KSC) need per-collection CID->Unicode tables and are deferred (the data is large). The generator scaffolding is landed: tools/pdf/generate_cid_data.py fetches Adobe's cmap-resources (git-ignored input, pinned) and emits block-encoded tables; how to store them compactly (measured ~3.3 MB plain / ~855 KB zlib+base64) and the C++ lookup are the remaining work. A .gitignore guard prevents a stray generator run from committing the multi-MB output. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Adapt generate_encoding_data.py to fetch glyphlist.txt from a pinned commit of adobe-type-tools/agl-aglfn (git-ignored on disk), mirroring how generate_cid_data.py fetches the CMap resources. The three base encodings are PDF-spec data with no canonical download and stay vendored. Since the AGL is no longer redistributed in the repo, carry its BSD-3-Clause attribution in the generated source banner (regenerated; output otherwise unchanged). Docs updated accordingly. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

andiwand force-pushed the pdf-composite-cid-fonts branch from 27c674a to 8320665 Compare June 14, 2026 20:28

andiwand force-pushed the pdf-composite-cid-fonts-cjk branch 2 times, most recently from 65f4ac0 to 0124223 Compare June 14, 2026 20:48

andiwand marked this pull request as ready for review June 14, 2026 21:20

chatgpt-codex-connector Bot reviewed Jun 14, 2026

View reviewed changes

Comment thread tools/pdf/THIRD_PARTY_LICENSES.md

andiwand mentioned this pull request Jun 14, 2026

PDF: stage-1 roadmap wrap-up — summary + drop sub-stage tags from code #536

Merged

Base automatically changed from pdf-composite-cid-fonts to main June 14, 2026 21:53

andiwand and others added 3 commits June 15, 2026 01:15

format

cbba325

andiwand force-pushed the pdf-composite-cid-fonts-cjk branch from bf136a1 to 1d9079a Compare June 14, 2026 23:17

andiwand added 2 commits June 15, 2026 01:18

cleanup gitignore

974f585

update ref

adfafff

andiwand merged commit 5da1f4e into main Jun 15, 2026
11 checks passed

andiwand deleted the pdf-composite-cid-fonts-cjk branch June 15, 2026 07:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PDF stage 1.3 (part B): predefined Unicode CMaps for Type0 fonts#534

PDF stage 1.3 (part B): predefined Unicode CMaps for Type0 fonts#534
andiwand merged 5 commits into
mainfrom
pdf-composite-cid-fonts-cjk

andiwand commented Jun 14, 2026 •

edited

Loading

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

andiwand commented Jun 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Stage 1.3, part B — predefined CMaps (Uni* slice)

Done — predefined Unicode CMaps (Uni*-UCS2/UTF16/UTF32)

Deferred — legacy CJK CMaps (RKSJ/EUC/Big5/GBK/KSC)

Tooling — Adobe Glyph List now downloaded + pinned

Notes

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

andiwand commented Jun 14, 2026 •

edited

Loading

Done — predefined Unicode CMaps (`Uni*-UCS2/UTF16/UTF32`)