Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
116 changes: 73 additions & 43 deletions src/odr/internal/pdf/AGENTS.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,9 +13,9 @@ tables, cross-reference streams, object streams, hybrid files, with a
forward-scan recovery path for broken cross-references), build the page
tree with fonts and annotations, tokenize page content streams into graphics
operators, and emit a **proof-of-concept HTML rendering**: absolutely positioned
text spans, one per show operation, placed by the full text transform (CTM ×
text matrix, stage 2.1), pages sized from `MediaBox`. Encrypted files are
decrypted (RC4, AES-128, AES-256). No glyph advances yet (stage 2.2), no
text spans, one per shown segment, placed by the full text transform (CTM × text
matrix) and advanced by the parsed glyph widths (stages 2.1–2.2), pages sized
from `MediaBox`. Encrypted files are decrypted (RC4, AES-128, AES-256). No
graphics, no images, no font files. Experimental and not production-quality.

---
Expand Down Expand Up @@ -106,6 +106,13 @@ graphics, no images, no font files. Experimental and not production-quality.
other case (`Identity-H/V`, or the legacy CJK code→CID CMaps) yields "no
Unicode" (not byte-garbage) until the legacy CID → Unicode tables (the
deferred half of part B) or the embedded font program (stage 3) land.
- **Glyph metrics** (stage 2.2): a font's advance widths are parsed —
`/FirstChar` + `/Widths` + `/FontDescriptor` `/MissingWidth` for simple fonts,
`/W` + `/DW` (the descendant CIDFont, both `c [w…]` and `c_first c_last w`
forms) for composite fonts. `Font::advance_width(code)` returns the advance in
text-space units (glyph-space / 1000), falling back to `/MissingWidth` or `/DW`.
Codes outside the corpus are interpreted as CIDs for composite fonts (identity);
AFM widths for the non-embedded standard-14 fonts are stage 3.
- **Content streams**: the full graphics-operator vocabulary is tokenized;
`GraphicsState` executes a subset (state stack `q`/`Q`, matrices `cm`/`Tm`,
line parameters, text state `Tc`/`Tw`/`Tz`/`TL`/`Tf`/`Tr`/`Ts`, glyph metrics
Expand All @@ -114,15 +121,22 @@ graphics, no images, no font files. Experimental and not production-quality.
affine `Transform2D` values (`util/math_util.hpp`), with `BT` resetting them, `Td`/`TD`
/`T*` (and the line-move half of `'`/`"`) advancing `Tlm` → `Tm`. Unknown
operators are logged to stderr and skipped.
- **Text layout** (`pdf_page_text`, stage 2.1): `extract_text` runs the operator
parser + `GraphicsState` over a page's content and emits a renderer-agnostic
`TextElement` per show operation (`Tj`/`TJ`/`'`/`"`) — its text-space → user-
space transform (CTM × `Tm`, with horizontal scaling and rise folded in, font
size kept separate), the resolved font, size, spacing parameters, raw codes,
and the CMap-translated Unicode. Font lookup is lenient (unknown ref → warn,
raw codes). **Glyph advances are not yet applied** (stage 2.2): each show op
yields one element at its starting origin, `TJ`'s numeric adjustments are
dropped, and `Tc`/`Tw`/`Tz`-driven spacing is carried but not consumed.
- **Text layout** (`pdf_page_text`, stages 2.1–2.2): `extract_text` runs the
operator parser + `GraphicsState` over a page's content and emits a
renderer-agnostic `TextElement` per shown *segment* (one `Tj`/`'`/`"`, or one
string of a `TJ` array) — its text-space → user-space transform (CTM × `Tm`,
with horizontal scaling and rise folded in, font size kept separate), the
resolved font, size, spacing parameters, raw codes, the CMap-translated
Unicode, and the segment's per-code advances plus their total. Font lookup is
lenient (unknown ref → warn, raw codes). **Glyph advances are applied** (stage
2.2): after each segment the text matrix `Tm` advances by the glyph widths ×
font size plus char/word spacing (× horizontal scaling), and a `TJ` number
translates `Tm` by `−n/1000 × Tfs × Th` — so segments, `TJ` kerning and lines
land in the right place. The element carries the per-code advances directly, so
a renderer wanting per-glyph placement need not re-derive them from
`font->advance_width`. Still deferred: intra-segment glyph shaping (the browser
lays a segment out in a fallback font until stage 3) and vertical writing-mode
advances (stage 2.6).
- **HTML**: one `document.html` view; each page is a `div` sized from `MediaBox`
(points → inches). Each `TextElement` becomes an absolutely positioned `span`
carrying a CSS `transform` matrix (the placement transform mapped from PDF user
Expand All @@ -143,7 +157,7 @@ graphics, no images, no font files. Experimental and not production-quality.
| `pdf_document_parser.{hpp,cpp}` | `parse_document()`: xref/trailer chain → catalog → page tree; lazy object reads with cache; (deep) reference resolution |
| `pdf_encryption.{hpp,cpp}` | Standard security handler: `Authenticator` (parse `/Encrypt`, authenticate password → `Decryptor`) and `Decryptor` (decrypt strings/streams; RC4, AES-128, AES-256), plus a `standard_security` namespace of pure key/password algorithms for known-answer tests |
| `pdf_document.hpp` | `Document`: arena of `Element`s + `catalog` pointer |
| `pdf_document_element.hpp` | Element structs: `Catalog`, `Pages`, `Page`, `Annotation`, `Resources`, `Font` (incl. the `composite`/`cid_registry`/`cid_ordering` Type0 facts and `to_unicode`) |
| `pdf_document_element.hpp` | Element structs: `Catalog`, `Pages`, `Page`, `Annotation`, `Resources`, `Font` (incl. the `composite`/`cid_registry`/`cid_ordering` Type0 facts, the `/Widths`-`/W`/`/DW` glyph metrics + `advance_width`, and `to_unicode`) |
| `pdf_cmap.{hpp,cpp}` | `CMap`: 1-byte glyph → UTF-16 `bfchar` map + string translation |
| `pdf_cmap_parser.{hpp,cpp}` | `ToUnicode` CMap stream parser (`begincodespacerange`/`beginbfchar`/`beginbfrange`; only `bfchar` applied) |
| `pdf_encoding.{hpp,cpp}` | Simple-font `/Encoding` → Unicode: `BaseEncoding` tables, `/Differences` overlay (`Encoding`), glyph-name → Unicode via AGL + `uniXXXX`/`uXXXXXX` (stage 1.2) |
Expand All @@ -152,8 +166,8 @@ graphics, no images, no font files. Experimental and not production-quality.
| `util/math_util.hpp` | `util::math::Transform2D`: 2-D affine transform (PDF row-vector convention) — compose, point-apply, translation/scaling factories (stage 2.1) |
| `pdf_graphics_operator.hpp` | `GraphicsOperatorType` enum (full operator set) + `GraphicsOperator` (type + `Object` arguments) |
| `pdf_graphics_operator_parser.{hpp,cpp}` | Content-stream tokenizer: arguments then operator name |
| `pdf_graphics_state.{hpp,cpp}` | `GraphicsState`: stack of `State` (general/path/text/color), `execute(op)` for the modelled subset; CTM/`Tm`/`Tlm` as `Transform2D`, `text_placement_matrix()` for the text rendering transform sans font size |
| `pdf_page_text.{hpp,cpp}` | `extract_text`: run the content stream through `GraphicsState`, emit a `TextElement` (placed transform + font/size/spacing + codes + Unicode) per show operation (stage 2.1) |
| `pdf_graphics_state.{hpp,cpp}` | `GraphicsState`: stack of `State` (general/path/text/color), `execute(op)` for the modelled subset; CTM/`Tm`/`Tlm` as `Transform2D`, `text_placement_matrix()` for the text rendering transform sans font size, `advance_text()` for the post-glyph `Tm` advance |
| `pdf_page_text.{hpp,cpp}` | `extract_text`: run the content stream through `GraphicsState`, emit a `TextElement` (placed transform + font/size/spacing + codes + Unicode + per-code advances + total advance) per shown segment, advancing `Tm` by the glyph widths and `TJ` adjustments (stages 2.1–2.2) |
| `pdf_file.{hpp,cpp}` | `abstract::PdfFile` wrapper; probes encryption at construction and implements `password_encrypted()`/`decrypt()`, carrying the authenticated `Decryptor` (not the password) so rendering needs no re-derivation |

Consumers outside the module: `open_strategy.cpp` (detection / engine
Expand Down Expand Up @@ -189,13 +203,15 @@ selection) and `html/pdf_file.cpp` (`create_pdf_service`).
decoded through their `/Filter` chain (`read_decoded_stream`), concatenated
with a newline between streams.
6. **Lay out and emit.** `extract_text` runs `GraphicsOperatorParser` +
`GraphicsState` over the content and returns a `TextElement` per show
operation, each placed by `text_placement_matrix()` (CTM × `Tm`, with
horizontal scaling and rise folded in), its glyphs translated through the
font's CMap. The HTML layer maps each element to a positioned `span` with a
CSS `transform` (PDF user space → the page box in CSS pixels) and `font-size`
from the text state. Glyph advances are **not yet applied** (stage 2.2), so
shows without an explicit move overlap.
`GraphicsState` over the content and returns a `TextElement` per shown
segment, each placed by `text_placement_matrix()` (CTM × `Tm`, with horizontal
scaling and rise folded in), its glyphs translated through the font's CMap.
After each segment `Tm` is advanced by the glyph widths (`advance_width`) plus
char/word spacing, and `TJ` numbers translate `Tm` directly, so segments and
lines land correctly. The HTML layer maps each element to a positioned `span`
with a CSS `transform` (PDF user space → the page box in CSS pixels) and
`font-size` from the text state. Intra-segment glyph shaping is the browser's
until the embedded font lands (stage 3).

---

Expand Down Expand Up @@ -294,8 +310,10 @@ such PDFs look right, their text just isn't selectable until the tables land.
stream), plus composite-font coverage (a Type0 font over an `Identity-H`
descendant `CIDFontType2`: `composite`/`/CIDSystemInfo` recorded, 2-byte
`/ToUnicode` extraction, the no-`/ToUnicode` "no Unicode" fallback, and a
predefined `Uni*-UCS2-H` `/Encoding` extracting without a `/ToUnicode`).
End-to-end: the classic fixture
predefined `Uni*-UCS2-H` `/Encoding` extracting without a `/ToUnicode`), plus
glyph-metric coverage (the composite `/W`+`/DW` and a simple
`/FirstChar`/`/Widths`/`/MissingWidth` font, asserted through `advance_width`,
stage 2.2). End-to-end: the classic fixture
`odr-public/pdf/style-various-1.pdf`, plus decryption of
`odr-public/pdf/Casio_WVA-M650-7AJF.pdf` (RC4, empty password) and
`odr-private/pdf/encrypted_fontfile3_opentype.pdf` (AES-256; skipped when the
Expand Down Expand Up @@ -327,10 +345,13 @@ such PDFs look right, their text just isn't selectable until the tables land.
`Transform2D` point-apply (identity/translation/scaling), the ordered
(row-vector) composition, and compose-then-apply ≡ sequential apply (stage 2.1).
- `test/src/internal/pdf/pdf_page_text.cpp` — **assertion-based**, inline content
streams through `extract_text` (empty resources, so codes pass through as
`text`): `Td` translation, `Tm` scaling, `cm` CTM concatenation under `Tm`,
horizontal scaling and rise in the transform, `TJ` string concatenation, and
the `T*`/`'`/`"` line moves with their leading and spacing (stage 2.1).
streams through `extract_text`: `Td` translation, `Tm` scaling, `cm` CTM
concatenation under `Tm`, horizontal scaling and rise in the transform, and the
`T*`/`'`/`"` line moves with their leading and spacing (stage 2.1); plus
glyph-advance coverage with hand-built `Font`s — simple `/Widths` advancing a
following show, `TJ` emitting per string with the numeric adjustment applied,
char spacing, word spacing on the single-byte space, the composite 2-byte `/DW`
advance, and the `advance_width` fallbacks (stage 2.2).

No assertion-based coverage of the tokenizer (escapes, references, hex strings)
or the HTML output itself (the span emission / CSS transform mapping).
Expand Down Expand Up @@ -401,7 +422,7 @@ per-glyph positioning. **The core never commits to either**; this pushes the
run-vs-glyph question all the way down to rendering. (The earlier framing of an
up-front "HTML mapping decision" is dissolved into this.)

### 2.1 — transforms & the placed-text emission — **in progress**
### 2.1 — transforms & the placed-text emission — **done**

The geometry foundation plus the emission contract, *without* glyph advances:
- A 2-D affine `Transform2D` (`util/math_util.hpp`): compose, point-apply,
Expand All @@ -420,20 +441,29 @@ The geometry foundation plus the emission contract, *without* glyph advances:
state, the page y-axis flipped once per page. The text-path debug `std::cout`
(incl. the `"hi"` marker) is removed.

**Deliberately out of scope here (→ 2.2):** glyph advances (`/Widths`,
`/W`/`/DW`) and the *application* of char/word spacing and the `TJ` numeric
adjustments, so consecutive shows on a line without an explicit move still
overlap, and `TJ` renders its strings concatenated at one origin. Precise
baseline placement (needs font ascent metrics) is likewise deferred. Whether 2.2
folds into this PR or branches off is an open call once 2.1 lands.

### 2.2 — glyph advances & metrics

Parse `/Widths` + `/MissingWidth` (simple) and `/W` + `/DW` (CID); apply char/word
spacing, horizontal scaling and the `TJ` numeric adjustments to advance the text
matrix per glyph — so `TJ`, `'`, `"` land correctly, `Tj` runs space correctly,
and the emission can be subdivided per glyph (which makes the renderer's
per-glyph option exercisable).
**Deliberately out of scope here (→ 2.2):** glyph advances. (Precise baseline
placement — needs font ascent metrics — stays deferred past 2.2 too.)

### 2.2 — glyph advances & metrics — **done**

Glyph metrics and the per-glyph text-matrix advance, on top of 2.1's emission:
- **Width parsing** (`pdf_document_parser`, `Font`): `/FirstChar` + `/Widths` +
`/FontDescriptor` `/MissingWidth` (simple), `/W` + `/DW` (the descendant
CIDFont, both `c [w…]` and `c_first c_last w` forms). `Font::advance_width(code)`
returns the advance in text-space units, with the `/MissingWidth` / `/DW`
fallbacks; `code_byte_width()` is 1 (simple) / 2 (composite).
- **Advance application** (`extract_text`, `GraphicsState::advance_text`): a
`TextElement` is now emitted per shown *segment* (one `Tj`/`'`/`"`, or one
string of a `TJ` array); after each segment `Tm` advances by Σ(width × Tfs + Tc
[+ Tw for single-byte 0x20]) × Th, and a `TJ` number translates `Tm` by
`−n/1000 × Tfs × Th`. So `TJ`/`'`/`"` land correctly and `Tj` segments space
correctly. The element carries both its per-code advances and their total, so
per-glyph placement needs no re-derivation from `font->advance_width`, keeping
the run-vs-glyph choice in the renderer.

Out of scope (later): intra-segment glyph shaping (browser fallback until the
embedded font, stage 3), AFM widths for non-embedded standard-14 fonts (stage 3),
vertical writing-mode advances (stage 2.6).

### 2.3 — Form XObjects

Expand Down
14 changes: 14 additions & 0 deletions src/odr/internal/pdf/pdf_document.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,20 @@ std::vector<Page *> Document::collect_pages() const {
return pages;
}

double Font::advance_width(const std::uint32_t code) const {
if (composite) {
if (const auto it = cid_widths.find(code); it != cid_widths.end()) {
return it->second / 1000.0;
}
return cid_default_width / 1000.0;
}
const long index = static_cast<long>(code) - first_char;
if (index >= 0 && index < static_cast<long>(widths.size())) {
return widths[static_cast<std::size_t>(index)] / 1000.0;
}
return missing_width / 1000.0;
}

std::string Font::to_unicode(const std::string &codes) const {
if (!cmap.empty()) {
return cmap.translate_string(codes);
Expand Down
22 changes: 22 additions & 0 deletions src/odr/internal/pdf/pdf_document_element.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -97,6 +97,28 @@ struct Font final : Element {
/// the predefined Unicode-CMap extraction path.
std::string cid_encoding_name;

/// Simple-font glyph metrics (ISO 32000-1 9.2.4): `/Widths`, in glyph space
/// (1/1000 text-space units), indexed by `code - first_char`; codes outside
/// the range fall back to `missing_width` (`/MissingWidth`).
int first_char{0};
std::vector<double> widths;
double missing_width{0};

/// Composite-font glyph metrics (9.7.4.3): the `/DW` default width and the
/// `/W` map CID -> width, both glyph-space units. Codes are interpreted as
/// CIDs (identity), the common `Identity-H/V` case.
double cid_default_width{1000};
std::unordered_map<std::uint32_t, double> cid_widths;

/// Bytes per character code: 2 for composite (Type0) fonts (the
/// `Identity-H/V` and common CID case), 1 for simple fonts.
[[nodiscard]] int code_byte_width() const { return composite ? 2 : 1; }

/// Glyph advance width of a single character code, in text-space units
/// (glyph-space / 1000; multiply by the font size for user space). Falls back
/// to `/MissingWidth` (simple) or `/DW` (composite) for absent codes.
[[nodiscard]] double advance_width(std::uint32_t code) const;

/// Translate a string of character codes to Unicode: the `ToUnicode` CMap
/// when present (authoritative), else, for a composite font, "no Unicode",
/// else the simple-font `/Encoding`, else identity bytes.
Expand Down
Loading
Loading