Skip to content

[BUG MODEL]: Mistral OCR-latest drops column of PDF at default DPI #581

Description

@Olliejp

Model

mistral-small-latest

Request Payload

OCR models tested: mistral-ocr-latest and mistral-ocr-4-0 (both affected).
Document: a digital (non-scanned) A4 invoice PDF with a real text layer, two-column header layout.

client.ocr.process(
    model="mistral-ocr-4-0",
    document={"type": "document_url", "document_url": "data:application/pdf;base64,<PDF>"},
    include_image_base64=False,
)

Toggling extract_header=True / extract_footer=True (or leaving them off) makes no difference.

Output

The API silently omits an entire content region: the top-right header column
(invoice number, dates, customer ID, subscription block). The omitted text is
NOT in markdown, NOT in header/footer, and is NOT captured as an image
region. It is simply absent. No error, no warning, no confidence signal.

The dropped text is present in the PDF text layer (extractable with any PDF text
tool), so the content is unambiguously in the document.

What I narrowed it down to: the document_url path renders the page internally
to 719x1018 px (dimensions.dpi = 87). I rasterized the same page at several
DPIs and submitted each as image_url:

Input Raster (px) Region returned?
PDF via document_url 719x1018 (87 dpi, internal) NO
image_url PNG @ 72 dpi 595x842 yes
image_url PNG @ 87 dpi 719x1018 NO
image_url PNG @ 96 dpi 794x1123 yes
image_url PNG @ 150/200/300 dpi up to 2480x3508 yes

The failure reproduces exactly at 719x1018, which is the raster the
document_url path produces. Both lower (72 dpi) and higher (96+ dpi) rasters
return the full content. This looks like a fragile layout/segmentation failure
at that specific raster size for a two-column header, and the PDF path renders
right into it.

Expected Behavior

The right-column content should be returned in markdown regardless of input
modality, the same way it is when the identical page is submitted as image_url
at any other resolution.

At minimum, dropping an entire detected region should not happen silently. There
is currently no parameter on ocr.process to control the internal render DPI
for the document_url path, so callers cannot work around this without
rasterizing PDFs to images themselves.

Additional Context

I don't want to share the affected PDF here publicly, but if you need the original pdf to recreate this issue with please provide me with a way to share it with you privately.

Suggested Solutions

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions