PDF stage 1.3 (part A): composite (Type0) fonts#533
Merged
Conversation
Recognize composite (Type0) fonts and drive their extraction through the existing multi-byte /ToUnicode path (stage 1.1), which covers the whole local corpus (every Type0 font is Identity-H + /ToUnicode). - parse_font detects /Subtype /Type0, walks /DescendantFonts[0] and records the descendant CIDFont's /CIDSystemInfo /Registry//Ordering on Font, and keeps the Type0 /Encoding (a code -> CID CMap) out of the simple-font parse_encoding path — so Identity-H no longer trips the "unsupported /Encoding name" warning. - Font gains composite/cid_registry/cid_ordering; Font::to_unicode returns "no Unicode" for a composite font lacking a /ToUnicode rather than mis-splitting its multi-byte codes through the single-byte identity fallback. - Tests: composite_font_with_to_unicode and composite_font_without_to_unicode_yields_no_unicode. Predefined CJK CMaps and the CID -> Unicode tables (part B) are deferred: they are the heavy data chunk and the corpus has no CJK fixture to validate against. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
27c674a to
8320665
Compare
andiwand
commented
Jun 14, 2026
- parse_composite_font takes Font& instead of Font* - the composite_font_mini_pdf test helper uses a /// doc comment Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
ef8a1f6 to
5fefb7f
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Stacked on #532 (stage 1.2). Targets
pdf-encoding-to-unicode; retarget tomainonce #532 merges.Stage 1.3, part A — composite (Type0/CID) fonts
The roadmap's stage 1.3 is composite fonts:
Identity-H/V+ predefined CJK CMaps, mapping code → CID → Unicode. Scanning the corpus reshaped the work: every Type0 font we have is/Identity-Hand carries a/ToUnicodeCMap, which the stage-1.1 multi-byte CMap path already handles. So this PR is the structural landing (part A); the heavy predefined-CJK-CMap data (part B) is deferred until there's a CJK fixture to validate it against.Changes
parse_fontdetects/Subtype /Type0, walks/DescendantFonts[0]and records the descendant CIDFont's/CIDSystemInfo/Registry//OrderingonFont, and keeps the Type0/Encoding(a code → CID CMap) out of the simple-fontparse_encodingpath — soIdentity-Hno longer trips the "unsupported /Encoding name" warning.Fontgainscomposite/cid_registry/cid_ordering.Font::to_unicode: a composite font without a/ToUnicodenow returns "no Unicode" instead of mis-splitting its multi-byte codes into byte-garbage through the single-byte identity fallback (sets up stage 1.5).DocumentParser.composite_font_with_to_unicodeand…_without_to_unicode_yields_no_unicode(inline mini-PDFs).Notes
/ToUnicode, so behavior is unchanged for the corpus; the new path only affects composite fonts that lack a/ToUnicode.Part B (follow-up) needs
cmap-resources+ CID → Unicode tables as generated C++ (like the AGL in 1.2).