Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
201 changes: 44 additions & 157 deletions src/odr/internal/pdf/AGENTS.md
Original file line number Diff line number Diff line change
Expand Up @@ -104,7 +104,7 @@ not production-quality — the HTML path still contains debug `std::cout` output
part B), since those character codes already are Unicode (big-endian); any
other case (`Identity-H/V`, or the legacy CJK code→CID CMaps) yields "no
Unicode" (not byte-garbage) until the legacy CID → Unicode tables (the
deferred half of part B) or the embedded font program (stage 1.4) land.
deferred half of part B) or the embedded font program (stage 3) land.
- **Content streams**: the full graphics-operator vocabulary is tokenized;
`GraphicsState` executes a subset (state stack `q`/`Q`, matrices `cm`/`Tm`,
line parameters, text state `Tc`/`Tw`/`Tz`/`TL`/`Tf`/`Tr`/`Ts`, text
Expand Down Expand Up @@ -315,156 +315,34 @@ and last-resort cross-reference recovery for broken files. Remaining odds and
ends are folded into *Other known gaps* below; the staged renderer work now
builds on a parser that opens the common corpus.

## Stage 1 — text extraction: the code → Unicode chain

PDF strings are **character codes**; per font, walk this chain and record
per-code Unicode (or "unknown", which stage 3 handles). The stage is **too
large for one change** — it bundles work of very different size and dependency,
so it is split into the sub-stages below. They are independently useful and
ordered by corpus frequency; each is its own branch/PR off this roadmap. Sub-
stages 1.1 and 1.2 have landed, as has **1.3 part A** (Type0 structure +
`Identity-H/V` + `/ToUnicode`-driven extraction) and the predefined-Unicode-CMap
slice of **1.3 part B** (`Uni*-UCS2/UTF16/UTF32`, no data tables); **the legacy
CJK CMaps + CID → Unicode tables are the remaining deferred work**; 1.4 is
blocked on stage 3 and stays deferred until then.

### 1.1 — `ToUnicode` CMap: multi-byte codes, `bfrange`, multi-char targets — **done**

The narrowest, most self-contained chunk: it only extends the existing `CMap`
(`pdf_cmap.{hpp,cpp}`) and its parser (`pdf_cmap_parser.cpp`), with no new data
tables and no new font plumbing. Today the map is single-byte
(`std::unordered_map<char, char16_t>`), `read_bfchar` warns on multi-byte glyphs
/ multi-char targets via `std::cerr`, and `read_bfrange` / `read_codespacerange`
are empty `// TODO` stubs.

Scope of this one change:
- **Code keys become multi-byte.** Key the map on the full character code, not a
single `char` (e.g. a `std::uint32_t` code + the byte length, or a
`std::string` code). `codespacerange` defines the code byte-lengths; record the
ranges so `translate_string` can chunk a string into codes of the right width
(most `ToUnicode` CMaps are fixed 1- or 2-byte, but the ranges are authoritative).
- **`bfrange`.** Both forms: `<lo> <hi> <dst>` (increment the last UTF-16 unit of
`dst` across the code range, per spec only the low byte) and
`<lo> <hi> [ <dst0> <dst1> … ]` (array of per-code destinations).
- **Multi-character targets.** A code may map to a *string* of UTF-16 units
(ligatures, e.g. `fi` → `f` `i`); `map_bfchar`'s value type widens from
`char16_t` to `std::u16string` (or a small-string equivalent).
- **`translate_string`** chunks the input by the codespace widths, looks up each
code, and falls back to the identity/​byte value (today's behaviour) on a miss
— keeping the "unknown" path for later sub-stages to refine.
- Replace the two `std::cerr` warnings with the module's `Logger` (thread one in,
as `DocumentParser` already does), or drop them once the cases are handled.

Tests (assertion-based, inline CMap strings — no fixtures): a 2-byte
`codespacerange`; `bfrange` increment form and array form; a multi-char `bfchar`
target; mixed widths; a miss falling back to identity. This matches the existing
inline-string test convention for the module.

Out of scope for 1.1: anything needing `/Encoding`, the AGL, predefined CMaps,
or font-file reading — those are 1.2–1.4.

### 1.2 — simple-font encoding → Unicode — **done**

`/Encoding` base (WinAnsi/MacRoman/Standard) + `/Differences` → glyph names →
Unicode via the Adobe Glyph List (incl. `uniXXXX`/`uXXXXXX` names). Carries the
data weight of the three base-encoding tables **and** the full AGL (~4,300
entries) — a generated data file. Own branch.

The chain, per simple font that has no `ToUnicode` CMap: a 1-byte code → glyph
name (base encoding, overlaid with `/Differences`) → Unicode (AGL lookup, or the
algorithmic `uniXXXX`/`uXXXXXX` forms). `translate_string` walks a code string
byte by byte; an unmapped name yields "no Unicode" (empty), left for stage 1.5.

**Data as committed generated source.** `tools/pdf/generate_encoding_data.py`
emits `pdf_encoding_data.{hpp,cpp}` (the three full base-encoding tables + the
AGL as a name-sorted array for binary search); the build only compiles the
result, so there is no build-time codegen dependency. All source data is
vendored next to the script as `.txt` files (the base encodings plus
[Adobe's AGL](https://github.com/adobe-type-tools/agl-aglfn)); re-run the script
with no arguments to regenerate. See [`tools/pdf/README.md`](../../../../tools/pdf/README.md)
for the data files and their provenance/licensing.

Landed:
- `pdf_encoding.{hpp,cpp}`: `BaseEncoding` (Standard/WinAnsi/MacRoman),
`base_encoding_table` / `base_encoding_from_name`, `glyph_name_to_unicode`
(AGL + `uniXXXX`/`uXXXXXX`), and the `Encoding` class (base + `/Differences` →
`translate_string`).
- `pdf_encoding_data.{hpp,cpp}`: the full Annex D tables + the full AGL (4,281
entries), generated.
- `/Encoding` parsing wired into `parse_font` (`parse_encoding`): a base name, or
a dictionary with `/BaseEncoding` + `/Differences`. Stored on `Font::encoding`
(a `std::optional<Encoding>`).
- `Font::to_unicode` picks the path — `ToUnicode` CMap when present (via
`CMap::empty`), else the `/Encoding`, else identity — and the HTML text path
(`html/pdf_file.cpp`) calls it instead of the CMap directly, so simple fonts
with only an `/Encoding` now extract text.

Remaining (1.2 deferrals):
- Symbolic fonts / the "built-in encoding" default (no `/BaseEncoding`) need the
font program — defer to stage 1.4; for now StandardEncoding is the default base.
- An unmapped glyph name still yields "no Unicode" (empty) — refined in 1.5.

### 1.3 — composite (Type0/CID) fonts — **functionally done; legacy CJK CMaps deferred**

**Status.** Everything composite fonts need for the corpus seen is in place:
Type0 structure, `Identity-H/V`, `/ToUnicode`-driven extraction (part A), and the
predefined Unicode CMaps (part B, `Uni*-UCS2/UTF16/UTF32`). The **one** remaining
piece is the legacy CJK code→CID CMaps (RKSJ/EUC/Big5/GBK/KSC) plus their
`CID → Unicode` tables — large external data, no fixture in the corpus, and an
uncommon case in modern PDFs. It is treated as a documented deferral (a follow-up
on the landed generator scaffolding), not a blocker: 1.3 can be considered closed
for the renderer's purposes, with the legacy tables an optional add-on.

`Identity-H/V` plus the predefined CMaps (CJK); map CID → Unicode via the CID
system info where defined. The predefined CMaps are large external data sets —
the heaviest data chunk of the stage. Split into two parts because the data
weight is concentrated in part B, and the whole local corpus (every Type0 font
is `Identity-H` + `/ToUnicode`) is covered by part A alone.

**Part A — Type0 structure + `Identity-H/V` + `/ToUnicode` — done.** `parse_font`
detects `/Subtype /Type0`, walks `/DescendantFonts[0]` to record the descendant
CIDFont's `/CIDSystemInfo` `/Registry`/`/Ordering` on `Font` (`composite`,
`cid_registry`, `cid_ordering`), and keeps the Type0 `/Encoding` (a code → CID
CMap, not a glyph-name encoding) out of `parse_encoding` — so `Identity-H` no
longer trips the "unsupported /Encoding name" warning. Extraction runs through
the existing multi-byte `/ToUnicode` path (stage 1.1); `Font::to_unicode`
returns "no Unicode" for a composite font lacking a `/ToUnicode` rather than
mis-splitting its multi-byte codes through the single-byte identity fallback.
Tests: `DocumentParser.composite_font_with_to_unicode` /
`…_without_to_unicode_yields_no_unicode`.

**Part B — predefined CMaps.** A composite font names a predefined CMap as its
`/Encoding`. These split by data weight:

- **Unicode CMaps (`Uni*-UCS2/UTF16/UTF32`) — done.** Their character codes
already *are* Unicode (big-endian), so `pdf_cid.cpp` decodes them directly with
**no data tables**: `Font::to_unicode` records the Type0 `/Encoding` name
(`Font::cid_encoding_name`) and, lacking a `/ToUnicode`, routes it through
`translate_predefined_cmap` (UCS2/UTF16 → 2-byte UTF-16BE incl. surrogate
pairs; UTF32 → 4-byte). Tests: `PdfCid.*` and
`DocumentParser.composite_font_predefined_unicode_cmap`. This covers the bulk
of modern CJK PDFs.
- **Legacy CJK CMaps (RKSJ/EUC/Big5/GBK/KSC) — deferred.** These map
`code → CID`, so they need the per-collection `CID → Unicode` table too
(selected by the recorded `/Registry`+`/Ordering`). The data is large:
`tools/pdf/generate_cid_data.py` already fetches Adobe's `cmap-resources`
(git-ignored input) and emits block-encoded range arrays, measured at ~3.3 MB
of committed C++ (~855 KB if zlib+base64-compressed). The decision on how to
store it compactly — and the C++ lookup over it — is the remaining work. **No
CJK fixture in the corpus**, so this is validated with synthetic inline
mini-PDFs.

### 1.4 — embedded-font fallback — **deferred (needs stage 3)**

Reverse the TrueType `cmap`; read glyph names from Type1/CFF charstrings.
Explicitly depends on stage 3's font *reading*, so it cannot start until that
machinery exists.

### 1.5 — "no Unicode" runs + `/ActualText`

Nothing applies → mark the run "no Unicode" for stage 3's re-encoding.
`/ActualText` (tagged PDFs, ligatures) overrides the whole chain for extraction.
Small, but rides on the run/state plumbing the sub-stages above introduce.
## Stage 1 — text extraction: the code → Unicode chain — **done**

**Goal.** PDF strings are **character codes**; per font, walk code → Unicode and
record it per code (or "no Unicode", which stage 3 re-encodes). The work was
split into sub-stages by data weight and dependency; 1.1–1.3 have landed and
cover the local corpus and the bulk of real-world PDFs. The mechanics live in
*Fonts / text mapping* under *What works*; this is the summary.

**Achieved**
- **1.1 — `ToUnicode` CMap.** Multi-byte codes (codespace-driven chunking), both
`bfrange` forms, multi-character (ligature) targets.
- **1.2 — simple-font `/Encoding`.** Standard/WinAnsi/MacRoman base encodings +
`/Differences` → glyph name → Unicode via the generated Adobe Glyph List (incl.
the algorithmic `uniXXXX`/`uXXXXXX` forms).
- **1.3 — composite (Type0/CID) fonts.** Type0 structure + `/CIDSystemInfo`,
`Identity-H/V`, `/ToUnicode`-driven extraction, and the predefined Unicode
CMaps (`Uni*-UCS2/UTF16/UTF32`, decoded directly with no data tables).

**Deferred — relocated to later stages.** None blocks the renderer and no corpus
fixture needs them yet:
- **Legacy CJK code→CID CMaps** (RKSJ/EUC/Big5/GBK/KSC) + their `CID → Unicode`
tables — large external data; the generator/fetch scaffolding is landed
(`tools/pdf/generate_cid_data.py`), the storage decision and C++ lookup remain.
Tracked under *Other known gaps* (CMap coverage).
- **Embedded-font reverse map** (was 1.4) — needs the font reading machinery;
folded into **stage 3**.
- **"No Unicode" run marking + `/ActualText`** (was 1.5) — rides on the run/state
plumbing introduced by **stage 2**, where it now lives.

## Stage 2 — text positioning & metrics

Expand All @@ -488,6 +366,10 @@ Independent of Unicode work; fixes layout even with today's partial CMaps.
- HTML mapping decision: per-run spans with CSS `transform` (cheap, breaks on
heavy kerning) vs. per-glyph positioning (exact, verbose) — likely per-run
with a kerning threshold that splits runs, like pdf2htmlEX.
- **Extraction refinements** (was stage 1.5, rides on the run plumbing above):
mark a run "no Unicode" when the code → Unicode chain yields nothing, so stage 3
can re-encode it; honour `/ActualText` (tagged PDFs, ligatures) as an extraction
override of the whole chain.

## Stage 3 — fonts in HTML

Expand All @@ -499,8 +381,11 @@ cost of a notoriously heavy build. No trimmed off-the-shelf alternative does
what we need (FreeType/stb_truetype are read-only; hb-subset can only subset
along the *existing* `cmap`, so it cannot inject the PUA mappings below).
Expected ~5–8k lines of focused C++ — on the order of an `oldms/` module.
Reading (SFNT tables, CFF charsets) is the easy part and is needed by stage 1.4
anyway.
Reading (SFNT tables, CFF charsets) is the easy part and also yields the
**embedded-font reverse map** (was stage 1.4): a font with no usable
`/ToUnicode` or `/Encoding` gets code → Unicode from its embedded program — the
TrueType `cmap` reversed, or Type1/CFF charstring glyph names through the AGL —
closing the last gap in the stage-1 extraction chain.

**Architecture: IR for facts, pass-through for glyphs.** No glyph-level font IR:
decompiling and recompiling outlines is the FontForge model — loses hinting,
Expand All @@ -510,7 +395,7 @@ byte-for-byte; even Type1 → Type2 charstrings is a direct sibling-format
translation. What *is* shared: a thin `FontProgram`-style interface — per-flavor
readers producing the facts every consumer needs (glyph count, glyph → Unicode,
advance widths, units-per-em, name, bbox, symbolic flag) with raw bytes kept
alongside. Stage 1.4 reads Unicode from it, the OTF wrap synthesizes
alongside. The embedded-font reverse map (above) reads Unicode from it, the OTF wrap synthesizes
`head`/`hhea`/`hmtx`/`OS/2` from it, the re-encoder assigns PUA code points from
its glyph count.

Expand Down Expand Up @@ -633,8 +518,10 @@ tree, little else.
recognized and extract through their `/ToUnicode` (stage 1.3 part A) or, when
absent, a predefined Unicode `/Encoding` (`Uni*-UCS2/UTF16/UTF32`, stage 1.3
part B). Still open: the legacy CJK code→CID CMaps (RKSJ/EUC/Big5/GBK/KSC) and
their CID → Unicode tables (the deferred half of part B), and embedded-font
reverse maps (stage 1.4); symbolic fonts with a built-in encoding default to
StandardEncoding until 1.4.
their CID → Unicode tables (large external data; the generator scaffolding in
`tools/pdf/generate_cid_data.py` is landed, the storage decision and lookup
remain), and embedded-font reverse maps (stage 3); symbolic fonts with a
built-in encoding default to StandardEncoding until the font program is read
(stage 3).
- **Annotations** are collected but their content is not interpreted (stage 5).
- Revisit the reference-by-lookahead parsing and `read_stream(-1)` fallback.
6 changes: 3 additions & 3 deletions src/odr/internal/pdf/pdf_cid.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -10,13 +10,13 @@ namespace odr::internal::pdf {
/// (a composite font's `/Encoding` named in the PDF, e.g. `UniGB-UCS2-H`),
/// returning the UTF-8 text.
///
/// Stage 1.3 (part B) supports the predefined **Unicode** CMaps — the
/// Supports the predefined **Unicode** CMaps — the
/// `Uni*-UCS2`, `Uni*-UTF16` and `Uni*-UTF32` families — whose character codes
/// already *are* Unicode (big-endian), so they are decoded directly with no
/// data tables. Returns `nullopt` for the legacy CJK code→CID CMaps
/// (RKSJ/EUC/Big5/GBK/KSC) and for `Identity-H/V`, which need CID→Unicode
/// tables (the legacy half of part B, deferred — see
/// `tools/pdf/generate_cid_data.py`) or the embedded font program (stage 1.4);
/// tables (the legacy CMaps, deferred — see
/// `tools/pdf/generate_cid_data.py`) or the embedded font program (stage 3);
/// the caller then treats the run as "no Unicode".
[[nodiscard]] std::optional<std::string>
translate_predefined_cmap(std::string_view name, const std::string &codes);
Expand Down
4 changes: 2 additions & 2 deletions src/odr/internal/pdf/pdf_cmap.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -46,8 +46,8 @@ std::string CMap::translate_string(const std::string &codes) const {
}

// Unknown code: fall back to its numeric value as a single UTF-16 unit
// (identity for single-byte codes). Stage 1.5 will refine the handling of
// these "no Unicode" runs.
// (identity for single-byte codes). These "no Unicode" runs are left for
// later re-encoding.
std::uint32_t value = 0;
for (const char c : code) {
value = (value << 8) | static_cast<std::uint8_t>(c);
Expand Down
6 changes: 3 additions & 3 deletions src/odr/internal/pdf/pdf_document.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -37,12 +37,12 @@ std::string Font::to_unicode(const std::string &codes) const {
if (composite) {
// A composite (Type0) font with no `ToUnicode` CMap. A predefined Unicode
// `/Encoding` (the `Uni*-UCS2/UTF16/UTF32` CMaps) carries Unicode directly
// in its codes, so decode it (stage 1.3 part B). Otherwise code -> CID is
// in its codes, so decode it. Otherwise code -> CID is
// known (identity for `Identity-H/V`) but CID -> Unicode needs a predefined
// CID -> Unicode table (the legacy CMaps, deferred) or the embedded font
// program (stage 1.4): emit "no Unicode" rather than mis-splitting the
// program (stage 3): emit "no Unicode" rather than mis-splitting the
// multi-byte codes into byte-sized garbage through the identity fallback
// below. Stage 1.5 will mark these runs for re-encoding.
// below. Stage 2 will mark these runs for re-encoding.
if (!cid_encoding_name.empty()) {
if (std::optional<std::string> unicode =
translate_predefined_cmap(cid_encoding_name, codes)) {
Expand Down
15 changes: 7 additions & 8 deletions src/odr/internal/pdf/pdf_document_element.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -80,27 +80,26 @@ struct Font final : Element {
/// fallback used when no `ToUnicode` CMap is present.
std::optional<Encoding> encoding;

/// True for composite (Type0) fonts (stage 1.3). Their character codes are
/// True for composite (Type0) fonts. Their character codes are
/// multi-byte and select CIDs via the Type0 `/Encoding` CMap; `/ToUnicode` is
/// the code -> Unicode path. Code -> CID via predefined CJK CMaps and the
/// CID -> Unicode tables are stage 1.3 (part B); embedded-font reverse maps
/// are stage 1.4.
/// CID -> Unicode tables are deferred; embedded-font reverse maps
/// are stage 3.
bool composite{false};
/// The descendant CIDFont's `/CIDSystemInfo` `/Registry` and `/Ordering`
/// (e.g. `Adobe` / `Identity` or `Adobe` / `Japan1`). Recorded for the
/// predefined CID -> Unicode table selection of stage 1.3 (part B); empty for
/// predefined CID -> Unicode table selection; empty for
/// simple fonts.
std::string cid_registry;
std::string cid_ordering;
/// The composite font's `/Encoding` when it is a *predefined* CMap name (e.g.
/// `Identity-H`, `UniGB-UCS2-H`); empty for an embedded CMap stream. Drives
/// the predefined Unicode-CMap extraction path (stage 1.3 part B).
/// the predefined Unicode-CMap extraction path.
std::string cid_encoding_name;

/// Translate a string of character codes to Unicode: the `ToUnicode` CMap
/// when present (authoritative), else, for a composite font, "no Unicode"
/// (stage 1.3 part B / 1.4 territory), else the simple-font `/Encoding`, else
/// identity bytes.
/// when present (authoritative), else, for a composite font, "no Unicode",
/// else the simple-font `/Encoding`, else identity bytes.
[[nodiscard]] std::string to_unicode(const std::string &codes) const;
};

Expand Down
Loading
Loading