Skip to content

PDF stage 2.2: glyph advances & metrics#539

Open
andiwand wants to merge 4 commits into
mainfrom
pdf-glyph-advances
Open

PDF stage 2.2: glyph advances & metrics#539
andiwand wants to merge 4 commits into
mainfrom
pdf-glyph-advances

Conversation

@andiwand

Copy link
Copy Markdown
Member

🤖 Generated with Claude Code

Stacked on #538 (stage 2.1) — base is pdf-text-transforms; retarget to main once 2.1 merges.

Second slice of stage 2. Parses font glyph widths and advances the text matrix
per glyph on top of 2.1's placed-text emission, so segments, TJ kerning and
lines land in the right place. Per the agreed architecture, the run-vs-glyph
choice stays in the renderer.

What's in here

  • Width parsing (pdf_document_parser, Font): /FirstChar + /Widths +
    /FontDescriptor /MissingWidth (simple); /W + /DW from the descendant
    CIDFont (both c [w…] and c_first c_last w forms, with a range guard).
    Font::advance_width(code) returns the advance in text-space units with the
    /MissingWidth / /DW fallbacks; code_byte_width() is 1 (simple) / 2
    (composite, the Identity-H/V case).
  • Advance application (extract_text, GraphicsState::advance_text): a
    TextElement is now emitted per shown segment (one Tj/'/", or one
    string of a TJ array). After each segment Tm advances by
    Σ(width × Tfs + Tc [+ Tw for single-byte 0x20]) × Th, and a TJ number
    translates Tm by −n/1000 × Tfs × Th. The element carries its total advance;
    a renderer wanting per-glyph placement re-derives per-code advances from
    font->advance_width.

Out of scope (later)

Intra-segment glyph shaping (the browser lays a segment out in a fallback font
until the embedded font lands — stage 3), AFM widths for the non-embedded
standard-14 fonts (stage 3), and vertical writing-mode advances (stage 2.6).
Precise baseline placement (needs ascent metrics) also remains deferred.

Tests

  • pdf_document_parser.cpp — composite /W+/DW and a simple
    /FirstChar//Widths//MissingWidth font, asserted through advance_width.
  • pdf_page_text.cpp — simple /Widths advancing a following show, TJ emitting
    per string with the numeric adjustment applied, char spacing, word spacing on
    the single-byte space, the composite 2-byte /DW advance, and the
    advance_width fallbacks.
  • Full suite green (469 tests), including the end-to-end HtmlOutputTests.

@andiwand andiwand force-pushed the pdf-glyph-advances branch from fe34d47 to 8afe023 Compare June 15, 2026 19:29
@andiwand andiwand force-pushed the pdf-text-transforms branch from 7b46376 to e889169 Compare June 15, 2026 19:29
@andiwand andiwand force-pushed the pdf-glyph-advances branch 2 times, most recently from 04283f1 to 1b1ed2e Compare June 15, 2026 20:52
Base automatically changed from pdf-text-transforms to main June 15, 2026 21:35
andiwand and others added 3 commits June 16, 2026 00:18
Parse font glyph widths and advance the text matrix per glyph, on top of
2.1's placed-text emission, so segments, TJ kerning and lines land in the
right place.

- Font metrics (pdf_document_parser, Font): /FirstChar + /Widths +
  /FontDescriptor /MissingWidth (simple), /W + /DW (descendant CIDFont,
  both `c [w...]` and `c_first c_last w` forms). Font::advance_width(code)
  returns the advance in text-space units with the MissingWidth/DW
  fallbacks; code_byte_width() is 1 (simple) / 2 (composite).
- Advance application (extract_text, GraphicsState::advance_text): emit one
  TextElement per shown segment (one Tj/'/", or one string of a TJ array);
  after each, advance Tm by sum(width*Tfs + Tc [+ Tw for single-byte 0x20])
  * Th, and translate Tm by -n/1000*Tfs*Th for a TJ number. The element
  carries its total advance; per-glyph placement stays re-derivable from
  font->advance_width, keeping the run-vs-glyph choice in the renderer.

Out of scope (later): intra-segment glyph shaping (stage 3), AFM widths for
non-embedded standard-14 fonts (stage 3), vertical writing advances (2.6).

Tests: composite /W+/DW and simple /Widths+/MissingWidth parsing asserted
through advance_width; extract_text advance coverage (simple widths, TJ
adjustment, char/word spacing, composite /DW, advance_width fallbacks).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@andiwand andiwand force-pushed the pdf-glyph-advances branch from 9f6baa6 to 256b988 Compare June 15, 2026 22:26
@andiwand andiwand marked this pull request as ready for review June 15, 2026 22:26

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 256b988f1b

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread src/odr/internal/html/pdf_file.cpp Outdated
const double height = page_box[3].as_real() - box_y0;

out.write_element_begin(
<<<<<<< HEAD

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Resolve the leftover merge conflict

This leaves merge-conflict markers in the committed source, so any build that compiles the PDF HTML service will fail before tests can run because the compiler sees <<<<<<</=======/>>>>>>> inside write_document. Pick the intended write_element_begin version and remove the markers.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant