PDF stage 2.1: text transforms & the placed-text emission#538
Draft
andiwand wants to merge 4 commits into
Draft
Conversation
Apply the full text transform chain and introduce a renderer-agnostic placed-text emission, the foundation for stage 2's positioning work. Glyph advances are deliberately left to stage 2.2. - pdf_geometry.hpp: a 2-D affine `Matrix` (compose, point-apply, translation/scaling factories), PDF row-vector convention. - GraphicsState: the CTM now concatenates on `cm` (it was overwritten); the text matrix `Tm` and line matrix `Tlm` are tracked as `Matrix` values with `BT`/`Td`/`TD`/`T*` and the line-move half of `'`/`"`; `text_placement_matrix()` folds in horizontal scaling and rise, keeping font size separate so the run-vs-glyph mapping stays a renderer choice. - pdf_page_text (`extract_text`): emit one `TextElement` per show operation, positioned by the placement matrix, carrying font/size/ spacing/codes/Unicode. Lenient font lookup (unknown ref -> warn). - html/pdf_file.cpp: map each `TextElement` to a positioned span with a CSS `transform` (PDF user space -> page box in CSS px, glyphs upright); route through `Logger`; drop the debug `std::cout` and the `"hi"` marker. Tests: pdf_geometry.cpp (Matrix compose/apply) and pdf_page_text.cpp (Td/Tm/cm/Tz/Ts, TJ concatenation, T*/'/" line moves), both inline. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
7b46376 to
e889169
Compare
…th_util.hpp Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
🤖 Generated with Claude Code
First slice of stage 2 (text positioning & metrics). It applies the full text
transform chain and introduces a renderer-agnostic placed-text emission — the
core emits placed text + geometry + Unicode, and the renderer owns the
per-run-vs-per-glyph mapping decision. Glyph advances are deliberately deferred to
stage 2.2 (which branches off this PR).
What's in here
pdf_geometry.hpp— a 2-D affineMatrix(compose, point-apply,translation/scaling factories), PDF row-vector convention.
GraphicsState— the CTM now concatenates oncm(it was beingoverwritten — a latent bug);
Tm/Tlmare tracked asMatrixvalues withBT/Td/TD/T*and the line-move half of'/";text_placement_matrix()folds in horizontal scaling and rise, keeping font size separate.
pdf_page_text(extract_text) — emits oneTextElementper showoperation (
Tj/TJ/'/"), positioned by the placement matrix and carryingfont/size/spacing/codes/Unicode. Lenient font lookup (unknown ref → warn).
html/pdf_file.cpp— maps eachTextElementto an absolutely-positionedspan with a CSS
transform(PDF user space → page box in CSS px, glyphs keptupright); routed through
Logger; the debugstd::coutand the"hi"markerare gone.
Deliberately out of scope (→ stage 2.2)
Glyph advances (
/Widths,/W//DW) and the application of char/word spacingand
TJnumeric adjustments — so consecutive shows on a line without an explicitmove still overlap, and
TJrenders its strings concatenated at one origin.Precise baseline placement (needs font ascent metrics) is likewise deferred — the
baseline currently sits at the span's box top.
Tests
test/src/internal/pdf/pdf_geometry.cpp—Matrixcompose/apply, orderedcomposition.
test/src/internal/pdf/pdf_page_text.cpp—Td/Tm/cm/Tz/Ts,TJconcatenation,
T*/'/"line moves; inline content streams.HtmlOutputTests.The roadmap in
src/odr/internal/pdf/AGENTS.mdis updated with the full stage-2sub-stage split (2.1–2.6).