diff --git a/src/odr/internal/pdf/AGENTS.md b/src/odr/internal/pdf/AGENTS.md index 531312e1..2451b1b3 100644 --- a/src/odr/internal/pdf/AGENTS.md +++ b/src/odr/internal/pdf/AGENTS.md @@ -104,7 +104,7 @@ not production-quality — the HTML path still contains debug `std::cout` output part B), since those character codes already are Unicode (big-endian); any other case (`Identity-H/V`, or the legacy CJK code→CID CMaps) yields "no Unicode" (not byte-garbage) until the legacy CID → Unicode tables (the - deferred half of part B) or the embedded font program (stage 1.4) land. + deferred half of part B) or the embedded font program (stage 3) land. - **Content streams**: the full graphics-operator vocabulary is tokenized; `GraphicsState` executes a subset (state stack `q`/`Q`, matrices `cm`/`Tm`, line parameters, text state `Tc`/`Tw`/`Tz`/`TL`/`Tf`/`Tr`/`Ts`, text @@ -315,156 +315,34 @@ and last-resort cross-reference recovery for broken files. Remaining odds and ends are folded into *Other known gaps* below; the staged renderer work now builds on a parser that opens the common corpus. -## Stage 1 — text extraction: the code → Unicode chain - -PDF strings are **character codes**; per font, walk this chain and record -per-code Unicode (or "unknown", which stage 3 handles). The stage is **too -large for one change** — it bundles work of very different size and dependency, -so it is split into the sub-stages below. They are independently useful and -ordered by corpus frequency; each is its own branch/PR off this roadmap. Sub- -stages 1.1 and 1.2 have landed, as has **1.3 part A** (Type0 structure + -`Identity-H/V` + `/ToUnicode`-driven extraction) and the predefined-Unicode-CMap -slice of **1.3 part B** (`Uni*-UCS2/UTF16/UTF32`, no data tables); **the legacy -CJK CMaps + CID → Unicode tables are the remaining deferred work**; 1.4 is -blocked on stage 3 and stays deferred until then. - -### 1.1 — `ToUnicode` CMap: multi-byte codes, `bfrange`, multi-char targets — **done** - -The narrowest, most self-contained chunk: it only extends the existing `CMap` -(`pdf_cmap.{hpp,cpp}`) and its parser (`pdf_cmap_parser.cpp`), with no new data -tables and no new font plumbing. Today the map is single-byte -(`std::unordered_map`), `read_bfchar` warns on multi-byte glyphs -/ multi-char targets via `std::cerr`, and `read_bfrange` / `read_codespacerange` -are empty `// TODO` stubs. - -Scope of this one change: -- **Code keys become multi-byte.** Key the map on the full character code, not a - single `char` (e.g. a `std::uint32_t` code + the byte length, or a - `std::string` code). `codespacerange` defines the code byte-lengths; record the - ranges so `translate_string` can chunk a string into codes of the right width - (most `ToUnicode` CMaps are fixed 1- or 2-byte, but the ranges are authoritative). -- **`bfrange`.** Both forms: ` ` (increment the last UTF-16 unit of - `dst` across the code range, per spec only the low byte) and - ` [ … ]` (array of per-code destinations). -- **Multi-character targets.** A code may map to a *string* of UTF-16 units - (ligatures, e.g. `fi` → `f` `i`); `map_bfchar`'s value type widens from - `char16_t` to `std::u16string` (or a small-string equivalent). -- **`translate_string`** chunks the input by the codespace widths, looks up each - code, and falls back to the identity/​byte value (today's behaviour) on a miss - — keeping the "unknown" path for later sub-stages to refine. -- Replace the two `std::cerr` warnings with the module's `Logger` (thread one in, - as `DocumentParser` already does), or drop them once the cases are handled. - -Tests (assertion-based, inline CMap strings — no fixtures): a 2-byte -`codespacerange`; `bfrange` increment form and array form; a multi-char `bfchar` -target; mixed widths; a miss falling back to identity. This matches the existing -inline-string test convention for the module. - -Out of scope for 1.1: anything needing `/Encoding`, the AGL, predefined CMaps, -or font-file reading — those are 1.2–1.4. - -### 1.2 — simple-font encoding → Unicode — **done** - -`/Encoding` base (WinAnsi/MacRoman/Standard) + `/Differences` → glyph names → -Unicode via the Adobe Glyph List (incl. `uniXXXX`/`uXXXXXX` names). Carries the -data weight of the three base-encoding tables **and** the full AGL (~4,300 -entries) — a generated data file. Own branch. - -The chain, per simple font that has no `ToUnicode` CMap: a 1-byte code → glyph -name (base encoding, overlaid with `/Differences`) → Unicode (AGL lookup, or the -algorithmic `uniXXXX`/`uXXXXXX` forms). `translate_string` walks a code string -byte by byte; an unmapped name yields "no Unicode" (empty), left for stage 1.5. - -**Data as committed generated source.** `tools/pdf/generate_encoding_data.py` -emits `pdf_encoding_data.{hpp,cpp}` (the three full base-encoding tables + the -AGL as a name-sorted array for binary search); the build only compiles the -result, so there is no build-time codegen dependency. All source data is -vendored next to the script as `.txt` files (the base encodings plus -[Adobe's AGL](https://github.com/adobe-type-tools/agl-aglfn)); re-run the script -with no arguments to regenerate. See [`tools/pdf/README.md`](../../../../tools/pdf/README.md) -for the data files and their provenance/licensing. - -Landed: -- `pdf_encoding.{hpp,cpp}`: `BaseEncoding` (Standard/WinAnsi/MacRoman), - `base_encoding_table` / `base_encoding_from_name`, `glyph_name_to_unicode` - (AGL + `uniXXXX`/`uXXXXXX`), and the `Encoding` class (base + `/Differences` → - `translate_string`). -- `pdf_encoding_data.{hpp,cpp}`: the full Annex D tables + the full AGL (4,281 - entries), generated. -- `/Encoding` parsing wired into `parse_font` (`parse_encoding`): a base name, or - a dictionary with `/BaseEncoding` + `/Differences`. Stored on `Font::encoding` - (a `std::optional`). -- `Font::to_unicode` picks the path — `ToUnicode` CMap when present (via - `CMap::empty`), else the `/Encoding`, else identity — and the HTML text path - (`html/pdf_file.cpp`) calls it instead of the CMap directly, so simple fonts - with only an `/Encoding` now extract text. - -Remaining (1.2 deferrals): -- Symbolic fonts / the "built-in encoding" default (no `/BaseEncoding`) need the - font program — defer to stage 1.4; for now StandardEncoding is the default base. -- An unmapped glyph name still yields "no Unicode" (empty) — refined in 1.5. - -### 1.3 — composite (Type0/CID) fonts — **functionally done; legacy CJK CMaps deferred** - -**Status.** Everything composite fonts need for the corpus seen is in place: -Type0 structure, `Identity-H/V`, `/ToUnicode`-driven extraction (part A), and the -predefined Unicode CMaps (part B, `Uni*-UCS2/UTF16/UTF32`). The **one** remaining -piece is the legacy CJK code→CID CMaps (RKSJ/EUC/Big5/GBK/KSC) plus their -`CID → Unicode` tables — large external data, no fixture in the corpus, and an -uncommon case in modern PDFs. It is treated as a documented deferral (a follow-up -on the landed generator scaffolding), not a blocker: 1.3 can be considered closed -for the renderer's purposes, with the legacy tables an optional add-on. - -`Identity-H/V` plus the predefined CMaps (CJK); map CID → Unicode via the CID -system info where defined. The predefined CMaps are large external data sets — -the heaviest data chunk of the stage. Split into two parts because the data -weight is concentrated in part B, and the whole local corpus (every Type0 font -is `Identity-H` + `/ToUnicode`) is covered by part A alone. - -**Part A — Type0 structure + `Identity-H/V` + `/ToUnicode` — done.** `parse_font` -detects `/Subtype /Type0`, walks `/DescendantFonts[0]` to record the descendant -CIDFont's `/CIDSystemInfo` `/Registry`/`/Ordering` on `Font` (`composite`, -`cid_registry`, `cid_ordering`), and keeps the Type0 `/Encoding` (a code → CID -CMap, not a glyph-name encoding) out of `parse_encoding` — so `Identity-H` no -longer trips the "unsupported /Encoding name" warning. Extraction runs through -the existing multi-byte `/ToUnicode` path (stage 1.1); `Font::to_unicode` -returns "no Unicode" for a composite font lacking a `/ToUnicode` rather than -mis-splitting its multi-byte codes through the single-byte identity fallback. -Tests: `DocumentParser.composite_font_with_to_unicode` / -`…_without_to_unicode_yields_no_unicode`. - -**Part B — predefined CMaps.** A composite font names a predefined CMap as its -`/Encoding`. These split by data weight: - -- **Unicode CMaps (`Uni*-UCS2/UTF16/UTF32`) — done.** Their character codes - already *are* Unicode (big-endian), so `pdf_cid.cpp` decodes them directly with - **no data tables**: `Font::to_unicode` records the Type0 `/Encoding` name - (`Font::cid_encoding_name`) and, lacking a `/ToUnicode`, routes it through - `translate_predefined_cmap` (UCS2/UTF16 → 2-byte UTF-16BE incl. surrogate - pairs; UTF32 → 4-byte). Tests: `PdfCid.*` and - `DocumentParser.composite_font_predefined_unicode_cmap`. This covers the bulk - of modern CJK PDFs. -- **Legacy CJK CMaps (RKSJ/EUC/Big5/GBK/KSC) — deferred.** These map - `code → CID`, so they need the per-collection `CID → Unicode` table too - (selected by the recorded `/Registry`+`/Ordering`). The data is large: - `tools/pdf/generate_cid_data.py` already fetches Adobe's `cmap-resources` - (git-ignored input) and emits block-encoded range arrays, measured at ~3.3 MB - of committed C++ (~855 KB if zlib+base64-compressed). The decision on how to - store it compactly — and the C++ lookup over it — is the remaining work. **No - CJK fixture in the corpus**, so this is validated with synthetic inline - mini-PDFs. - -### 1.4 — embedded-font fallback — **deferred (needs stage 3)** - -Reverse the TrueType `cmap`; read glyph names from Type1/CFF charstrings. -Explicitly depends on stage 3's font *reading*, so it cannot start until that -machinery exists. - -### 1.5 — "no Unicode" runs + `/ActualText` - -Nothing applies → mark the run "no Unicode" for stage 3's re-encoding. -`/ActualText` (tagged PDFs, ligatures) overrides the whole chain for extraction. -Small, but rides on the run/state plumbing the sub-stages above introduce. +## Stage 1 — text extraction: the code → Unicode chain — **done** + +**Goal.** PDF strings are **character codes**; per font, walk code → Unicode and +record it per code (or "no Unicode", which stage 3 re-encodes). The work was +split into sub-stages by data weight and dependency; 1.1–1.3 have landed and +cover the local corpus and the bulk of real-world PDFs. The mechanics live in +*Fonts / text mapping* under *What works*; this is the summary. + +**Achieved** +- **1.1 — `ToUnicode` CMap.** Multi-byte codes (codespace-driven chunking), both + `bfrange` forms, multi-character (ligature) targets. +- **1.2 — simple-font `/Encoding`.** Standard/WinAnsi/MacRoman base encodings + + `/Differences` → glyph name → Unicode via the generated Adobe Glyph List (incl. + the algorithmic `uniXXXX`/`uXXXXXX` forms). +- **1.3 — composite (Type0/CID) fonts.** Type0 structure + `/CIDSystemInfo`, + `Identity-H/V`, `/ToUnicode`-driven extraction, and the predefined Unicode + CMaps (`Uni*-UCS2/UTF16/UTF32`, decoded directly with no data tables). + +**Deferred — relocated to later stages.** None blocks the renderer and no corpus +fixture needs them yet: +- **Legacy CJK code→CID CMaps** (RKSJ/EUC/Big5/GBK/KSC) + their `CID → Unicode` + tables — large external data; the generator/fetch scaffolding is landed + (`tools/pdf/generate_cid_data.py`), the storage decision and C++ lookup remain. + Tracked under *Other known gaps* (CMap coverage). +- **Embedded-font reverse map** (was 1.4) — needs the font reading machinery; + folded into **stage 3**. +- **"No Unicode" run marking + `/ActualText`** (was 1.5) — rides on the run/state + plumbing introduced by **stage 2**, where it now lives. ## Stage 2 — text positioning & metrics @@ -488,6 +366,10 @@ Independent of Unicode work; fixes layout even with today's partial CMaps. - HTML mapping decision: per-run spans with CSS `transform` (cheap, breaks on heavy kerning) vs. per-glyph positioning (exact, verbose) — likely per-run with a kerning threshold that splits runs, like pdf2htmlEX. +- **Extraction refinements** (was stage 1.5, rides on the run plumbing above): + mark a run "no Unicode" when the code → Unicode chain yields nothing, so stage 3 + can re-encode it; honour `/ActualText` (tagged PDFs, ligatures) as an extraction + override of the whole chain. ## Stage 3 — fonts in HTML @@ -499,8 +381,11 @@ cost of a notoriously heavy build. No trimmed off-the-shelf alternative does what we need (FreeType/stb_truetype are read-only; hb-subset can only subset along the *existing* `cmap`, so it cannot inject the PUA mappings below). Expected ~5–8k lines of focused C++ — on the order of an `oldms/` module. -Reading (SFNT tables, CFF charsets) is the easy part and is needed by stage 1.4 -anyway. +Reading (SFNT tables, CFF charsets) is the easy part and also yields the +**embedded-font reverse map** (was stage 1.4): a font with no usable +`/ToUnicode` or `/Encoding` gets code → Unicode from its embedded program — the +TrueType `cmap` reversed, or Type1/CFF charstring glyph names through the AGL — +closing the last gap in the stage-1 extraction chain. **Architecture: IR for facts, pass-through for glyphs.** No glyph-level font IR: decompiling and recompiling outlines is the FontForge model — loses hinting, @@ -510,7 +395,7 @@ byte-for-byte; even Type1 → Type2 charstrings is a direct sibling-format translation. What *is* shared: a thin `FontProgram`-style interface — per-flavor readers producing the facts every consumer needs (glyph count, glyph → Unicode, advance widths, units-per-em, name, bbox, symbolic flag) with raw bytes kept -alongside. Stage 1.4 reads Unicode from it, the OTF wrap synthesizes +alongside. The embedded-font reverse map (above) reads Unicode from it, the OTF wrap synthesizes `head`/`hhea`/`hmtx`/`OS/2` from it, the re-encoder assigns PUA code points from its glyph count. @@ -633,8 +518,10 @@ tree, little else. recognized and extract through their `/ToUnicode` (stage 1.3 part A) or, when absent, a predefined Unicode `/Encoding` (`Uni*-UCS2/UTF16/UTF32`, stage 1.3 part B). Still open: the legacy CJK code→CID CMaps (RKSJ/EUC/Big5/GBK/KSC) and - their CID → Unicode tables (the deferred half of part B), and embedded-font - reverse maps (stage 1.4); symbolic fonts with a built-in encoding default to - StandardEncoding until 1.4. + their CID → Unicode tables (large external data; the generator scaffolding in + `tools/pdf/generate_cid_data.py` is landed, the storage decision and lookup + remain), and embedded-font reverse maps (stage 3); symbolic fonts with a + built-in encoding default to StandardEncoding until the font program is read + (stage 3). - **Annotations** are collected but their content is not interpreted (stage 5). - Revisit the reference-by-lookahead parsing and `read_stream(-1)` fallback. diff --git a/src/odr/internal/pdf/pdf_cid.hpp b/src/odr/internal/pdf/pdf_cid.hpp index 1edd60f6..d2d88603 100644 --- a/src/odr/internal/pdf/pdf_cid.hpp +++ b/src/odr/internal/pdf/pdf_cid.hpp @@ -10,13 +10,13 @@ namespace odr::internal::pdf { /// (a composite font's `/Encoding` named in the PDF, e.g. `UniGB-UCS2-H`), /// returning the UTF-8 text. /// -/// Stage 1.3 (part B) supports the predefined **Unicode** CMaps — the +/// Supports the predefined **Unicode** CMaps — the /// `Uni*-UCS2`, `Uni*-UTF16` and `Uni*-UTF32` families — whose character codes /// already *are* Unicode (big-endian), so they are decoded directly with no /// data tables. Returns `nullopt` for the legacy CJK code→CID CMaps /// (RKSJ/EUC/Big5/GBK/KSC) and for `Identity-H/V`, which need CID→Unicode -/// tables (the legacy half of part B, deferred — see -/// `tools/pdf/generate_cid_data.py`) or the embedded font program (stage 1.4); +/// tables (the legacy CMaps, deferred — see +/// `tools/pdf/generate_cid_data.py`) or the embedded font program (stage 3); /// the caller then treats the run as "no Unicode". [[nodiscard]] std::optional translate_predefined_cmap(std::string_view name, const std::string &codes); diff --git a/src/odr/internal/pdf/pdf_cmap.cpp b/src/odr/internal/pdf/pdf_cmap.cpp index afad6f40..af03e080 100644 --- a/src/odr/internal/pdf/pdf_cmap.cpp +++ b/src/odr/internal/pdf/pdf_cmap.cpp @@ -46,8 +46,8 @@ std::string CMap::translate_string(const std::string &codes) const { } // Unknown code: fall back to its numeric value as a single UTF-16 unit - // (identity for single-byte codes). Stage 1.5 will refine the handling of - // these "no Unicode" runs. + // (identity for single-byte codes). These "no Unicode" runs are left for + // later re-encoding. std::uint32_t value = 0; for (const char c : code) { value = (value << 8) | static_cast(c); diff --git a/src/odr/internal/pdf/pdf_document.cpp b/src/odr/internal/pdf/pdf_document.cpp index 234d9464..1376975e 100644 --- a/src/odr/internal/pdf/pdf_document.cpp +++ b/src/odr/internal/pdf/pdf_document.cpp @@ -37,12 +37,12 @@ std::string Font::to_unicode(const std::string &codes) const { if (composite) { // A composite (Type0) font with no `ToUnicode` CMap. A predefined Unicode // `/Encoding` (the `Uni*-UCS2/UTF16/UTF32` CMaps) carries Unicode directly - // in its codes, so decode it (stage 1.3 part B). Otherwise code -> CID is + // in its codes, so decode it. Otherwise code -> CID is // known (identity for `Identity-H/V`) but CID -> Unicode needs a predefined // CID -> Unicode table (the legacy CMaps, deferred) or the embedded font - // program (stage 1.4): emit "no Unicode" rather than mis-splitting the + // program (stage 3): emit "no Unicode" rather than mis-splitting the // multi-byte codes into byte-sized garbage through the identity fallback - // below. Stage 1.5 will mark these runs for re-encoding. + // below. Stage 2 will mark these runs for re-encoding. if (!cid_encoding_name.empty()) { if (std::optional unicode = translate_predefined_cmap(cid_encoding_name, codes)) { diff --git a/src/odr/internal/pdf/pdf_document_element.hpp b/src/odr/internal/pdf/pdf_document_element.hpp index 20c41bc0..00ac8462 100644 --- a/src/odr/internal/pdf/pdf_document_element.hpp +++ b/src/odr/internal/pdf/pdf_document_element.hpp @@ -80,27 +80,26 @@ struct Font final : Element { /// fallback used when no `ToUnicode` CMap is present. std::optional encoding; - /// True for composite (Type0) fonts (stage 1.3). Their character codes are + /// True for composite (Type0) fonts. Their character codes are /// multi-byte and select CIDs via the Type0 `/Encoding` CMap; `/ToUnicode` is /// the code -> Unicode path. Code -> CID via predefined CJK CMaps and the - /// CID -> Unicode tables are stage 1.3 (part B); embedded-font reverse maps - /// are stage 1.4. + /// CID -> Unicode tables are deferred; embedded-font reverse maps + /// are stage 3. bool composite{false}; /// The descendant CIDFont's `/CIDSystemInfo` `/Registry` and `/Ordering` /// (e.g. `Adobe` / `Identity` or `Adobe` / `Japan1`). Recorded for the - /// predefined CID -> Unicode table selection of stage 1.3 (part B); empty for + /// predefined CID -> Unicode table selection; empty for /// simple fonts. std::string cid_registry; std::string cid_ordering; /// The composite font's `/Encoding` when it is a *predefined* CMap name (e.g. /// `Identity-H`, `UniGB-UCS2-H`); empty for an embedded CMap stream. Drives - /// the predefined Unicode-CMap extraction path (stage 1.3 part B). + /// the predefined Unicode-CMap extraction path. std::string cid_encoding_name; /// Translate a string of character codes to Unicode: the `ToUnicode` CMap - /// when present (authoritative), else, for a composite font, "no Unicode" - /// (stage 1.3 part B / 1.4 territory), else the simple-font `/Encoding`, else - /// identity bytes. + /// when present (authoritative), else, for a composite font, "no Unicode", + /// else the simple-font `/Encoding`, else identity bytes. [[nodiscard]] std::string to_unicode(const std::string &codes) const; }; diff --git a/src/odr/internal/pdf/pdf_document_parser.cpp b/src/odr/internal/pdf/pdf_document_parser.cpp index d45d059f..89202107 100644 --- a/src/odr/internal/pdf/pdf_document_parser.cpp +++ b/src/odr/internal/pdf/pdf_document_parser.cpp @@ -137,7 +137,7 @@ std::optional parse_encoding(DocumentParser &parser, const Dictionary &dictionary = resolved.as_dictionary(); // No `/BaseEncoding` means "the font's built-in encoding"; that needs the - // font program (stage 1.4). Default to StandardEncoding for now, which is the + // font program (stage 3). Default to StandardEncoding for now, which is the // right base for the non-symbolic Latin fonts this stage targets. auto base = BaseEncoding::standard; if (dictionary.has_key("BaseEncoding")) { @@ -179,9 +179,9 @@ std::optional parse_encoding(DocumentParser &parser, /// Parse a composite (Type0) font's descendant CIDFont (`/DescendantFonts` is a /// one-element array of the CIDFont): records the `/CIDSystemInfo` -/// `/Registry`/`/Ordering` used to pick a predefined CID -> Unicode table in -/// stage 1.3 (part B). The Type0 `/Encoding` (code -> CID) is `Identity-H/V` or -/// a predefined CJK CMap; only `/ToUnicode` is used for extraction in part A. +/// `/Registry`/`/Ordering` used to pick a predefined CID -> Unicode table. +/// The Type0 `/Encoding` (code -> CID) is `Identity-H/V` or a predefined CJK +/// CMap; only `/ToUnicode` is used for extraction. void parse_composite_font(DocumentParser &parser, const Dictionary &dictionary, Font &font) { font.composite = true; @@ -256,8 +256,7 @@ Font *parse_font(DocumentParser &parser, const ObjectReference &reference, if (is_type0) { // Composite (Type0) font: the `/Encoding` is a code -> CID CMap, not a // simple-font glyph-name encoding, so it must not go through - // `parse_encoding`. Extraction relies on `/ToUnicode` (parsed above) in - // stage 1.3 part A. + // `parse_encoding`. Extraction relies on `/ToUnicode` (parsed above). parse_composite_font(parser, dictionary, *font); } else if (dictionary.has_key("Encoding")) { // Simple-font `/Encoding`: a base-encoding name, or a dictionary with diff --git a/src/odr/internal/pdf/pdf_encoding.hpp b/src/odr/internal/pdf/pdf_encoding.hpp index 3272d8a6..18a28534 100644 --- a/src/odr/internal/pdf/pdf_encoding.hpp +++ b/src/odr/internal/pdf/pdf_encoding.hpp @@ -30,7 +30,7 @@ base_encoding_from_name(std::string_view name); /// Glyph name -> Unicode (UTF-16) via the Adobe Glyph List, plus the /// algorithmic `uniXXXX` / `uXXXXXX` forms (ISO 32000-1 9.10.2 / the AGL /// specification). Returns an empty string for a name with no mapping — the -/// caller treats that as "no Unicode" (refined in stage 1.5). +/// caller treats that as "no Unicode" (run marking refined in stage 2). [[nodiscard]] std::u16string glyph_name_to_unicode(std::string_view glyph_name); /// A simple font's `/Encoding`: a base encoding optionally overlaid with diff --git a/test/src/internal/pdf/pdf_document_parser.cpp b/test/src/internal/pdf/pdf_document_parser.cpp index f19532cf..db1f44db 100644 --- a/test/src/internal/pdf/pdf_document_parser.cpp +++ b/test/src/internal/pdf/pdf_document_parser.cpp @@ -249,7 +249,7 @@ const Font *first_page_font(const Document &document, const std::string &name) { // A composite (Type0) font is recognized, its descendant CIDFont's // `/CIDSystemInfo` recorded, and its `/ToUnicode` CMap drives extraction over -// 2-byte codes (stage 1.3). +// 2-byte codes. TEST(DocumentParser, composite_font_with_to_unicode) { const std::string pdf = composite_font_mini_pdf(true); DocumentParser parser(std::make_unique(pdf)); @@ -265,9 +265,9 @@ TEST(DocumentParser, composite_font_with_to_unicode) { } // A composite font without a `/ToUnicode` CMap cannot yet resolve CID -> -// Unicode (predefined CJK tables are stage 1.3 part B; embedded reverse maps -// stage 1.4), so extraction yields "no Unicode" rather than the byte-garbage -// the simple-font identity fallback would produce on multi-byte codes. +// Unicode (predefined CJK tables and embedded reverse maps are deferred), so +// extraction yields "no Unicode" rather than the byte-garbage the simple-font +// identity fallback would produce on multi-byte codes. TEST(DocumentParser, composite_font_without_to_unicode_yields_no_unicode) { const std::string pdf = composite_font_mini_pdf(false); DocumentParser parser(std::make_unique(pdf)); @@ -282,7 +282,7 @@ TEST(DocumentParser, composite_font_without_to_unicode_yields_no_unicode) { // A composite font whose `/Encoding` is a predefined Unicode CMap // (`Uni*-UCS2/UTF16/UTF32`) extracts directly from the codes (they are Unicode) -// even without a `/ToUnicode` CMap (stage 1.3 part B). +// even without a `/ToUnicode` CMap. TEST(DocumentParser, composite_font_predefined_unicode_cmap) { const std::string pdf = composite_font_mini_pdf(false, "UniGB-UCS2-H"); DocumentParser parser(std::make_unique(pdf));