Skip to content

PDF stage 2.1: text transforms & the placed-text emission#538

Draft
andiwand wants to merge 4 commits into
mainfrom
pdf-text-transforms
Draft

PDF stage 2.1: text transforms & the placed-text emission#538
andiwand wants to merge 4 commits into
mainfrom
pdf-text-transforms

Conversation

@andiwand

Copy link
Copy Markdown
Member

🤖 Generated with Claude Code

First slice of stage 2 (text positioning & metrics). It applies the full text
transform chain and introduces a renderer-agnostic placed-text emission — the
core emits placed text + geometry + Unicode, and the renderer owns the
per-run-vs-per-glyph mapping decision. Glyph advances are deliberately deferred to
stage 2.2 (which branches off this PR).

What's in here

  • pdf_geometry.hpp — a 2-D affine Matrix (compose, point-apply,
    translation/scaling factories), PDF row-vector convention.
  • GraphicsState — the CTM now concatenates on cm (it was being
    overwritten — a latent bug); Tm/Tlm are tracked as Matrix values with
    BT/Td/TD/T* and the line-move half of '/"; text_placement_matrix()
    folds in horizontal scaling and rise, keeping font size separate.
  • pdf_page_text (extract_text) — emits one TextElement per show
    operation (Tj/TJ/'/"), positioned by the placement matrix and carrying
    font/size/spacing/codes/Unicode. Lenient font lookup (unknown ref → warn).
  • html/pdf_file.cpp — maps each TextElement to an absolutely-positioned
    span with a CSS transform (PDF user space → page box in CSS px, glyphs kept
    upright); routed through Logger; the debug std::cout and the "hi" marker
    are gone.

Deliberately out of scope (→ stage 2.2)

Glyph advances (/Widths, /W//DW) and the application of char/word spacing
and TJ numeric adjustments — so consecutive shows on a line without an explicit
move still overlap, and TJ renders its strings concatenated at one origin.
Precise baseline placement (needs font ascent metrics) is likewise deferred — the
baseline currently sits at the span's box top.

Tests

  • test/src/internal/pdf/pdf_geometry.cppMatrix compose/apply, ordered
    composition.
  • test/src/internal/pdf/pdf_page_text.cppTd/Tm/cm/Tz/Ts, TJ
    concatenation, T*/'/" line moves; inline content streams.
  • Full suite green (461 tests), including the end-to-end HtmlOutputTests.

The roadmap in src/odr/internal/pdf/AGENTS.md is updated with the full stage-2
sub-stage split (2.1–2.6).

andiwand and others added 3 commits June 15, 2026 21:28
Apply the full text transform chain and introduce a renderer-agnostic
placed-text emission, the foundation for stage 2's positioning work.
Glyph advances are deliberately left to stage 2.2.

- pdf_geometry.hpp: a 2-D affine `Matrix` (compose, point-apply,
  translation/scaling factories), PDF row-vector convention.
- GraphicsState: the CTM now concatenates on `cm` (it was overwritten);
  the text matrix `Tm` and line matrix `Tlm` are tracked as `Matrix`
  values with `BT`/`Td`/`TD`/`T*` and the line-move half of `'`/`"`;
  `text_placement_matrix()` folds in horizontal scaling and rise, keeping
  font size separate so the run-vs-glyph mapping stays a renderer choice.
- pdf_page_text (`extract_text`): emit one `TextElement` per show
  operation, positioned by the placement matrix, carrying font/size/
  spacing/codes/Unicode. Lenient font lookup (unknown ref -> warn).
- html/pdf_file.cpp: map each `TextElement` to a positioned span with a
  CSS `transform` (PDF user space -> page box in CSS px, glyphs upright);
  route through `Logger`; drop the debug `std::cout` and the `"hi"`
  marker.

Tests: pdf_geometry.cpp (Matrix compose/apply) and pdf_page_text.cpp
(Td/Tm/cm/Tz/Ts, TJ concatenation, T*/'/" line moves), both inline.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@andiwand andiwand force-pushed the pdf-text-transforms branch from 7b46376 to e889169 Compare June 15, 2026 19:29
…th_util.hpp

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant