Model
mistral-small-latest
Request Payload
OCR models tested: mistral-ocr-latest and mistral-ocr-4-0 (both affected).
Document: a digital (non-scanned) A4 invoice PDF with a real text layer, two-column header layout.
client.ocr.process(
model="mistral-ocr-4-0",
document={"type": "document_url", "document_url": "data:application/pdf;base64,<PDF>"},
include_image_base64=False,
)
Toggling extract_header=True / extract_footer=True (or leaving them off) makes no difference.
Output
The API silently omits an entire content region: the top-right header column
(invoice number, dates, customer ID, subscription block). The omitted text is
NOT in markdown, NOT in header/footer, and is NOT captured as an image
region. It is simply absent. No error, no warning, no confidence signal.
The dropped text is present in the PDF text layer (extractable with any PDF text
tool), so the content is unambiguously in the document.
What I narrowed it down to: the document_url path renders the page internally
to 719x1018 px (dimensions.dpi = 87). I rasterized the same page at several
DPIs and submitted each as image_url:
| Input |
Raster (px) |
Region returned? |
PDF via document_url |
719x1018 (87 dpi, internal) |
NO |
image_url PNG @ 72 dpi |
595x842 |
yes |
image_url PNG @ 87 dpi |
719x1018 |
NO |
image_url PNG @ 96 dpi |
794x1123 |
yes |
image_url PNG @ 150/200/300 dpi |
up to 2480x3508 |
yes |
The failure reproduces exactly at 719x1018, which is the raster the
document_url path produces. Both lower (72 dpi) and higher (96+ dpi) rasters
return the full content. This looks like a fragile layout/segmentation failure
at that specific raster size for a two-column header, and the PDF path renders
right into it.
Expected Behavior
The right-column content should be returned in markdown regardless of input
modality, the same way it is when the identical page is submitted as image_url
at any other resolution.
At minimum, dropping an entire detected region should not happen silently. There
is currently no parameter on ocr.process to control the internal render DPI
for the document_url path, so callers cannot work around this without
rasterizing PDFs to images themselves.
Additional Context
I don't want to share the affected PDF here publicly, but if you need the original pdf to recreate this issue with please provide me with a way to share it with you privately.
Suggested Solutions
No response
Model
mistral-small-latest
Request Payload
OCR models tested:
mistral-ocr-latestandmistral-ocr-4-0(both affected).Document: a digital (non-scanned) A4 invoice PDF with a real text layer, two-column header layout.
Toggling
extract_header=True/extract_footer=True(or leaving them off) makes no difference.Output
The API silently omits an entire content region: the top-right header column
(invoice number, dates, customer ID, subscription block). The omitted text is
NOT in
markdown, NOT inheader/footer, and is NOT captured as an imageregion. It is simply absent. No error, no warning, no confidence signal.
The dropped text is present in the PDF text layer (extractable with any PDF text
tool), so the content is unambiguously in the document.
What I narrowed it down to: the
document_urlpath renders the page internallyto 719x1018 px (
dimensions.dpi = 87). I rasterized the same page at severalDPIs and submitted each as
image_url:document_urlimage_urlPNG @ 72 dpiimage_urlPNG @ 87 dpiimage_urlPNG @ 96 dpiimage_urlPNG @ 150/200/300 dpiThe failure reproduces exactly at 719x1018, which is the raster the
document_urlpath produces. Both lower (72 dpi) and higher (96+ dpi) rastersreturn the full content. This looks like a fragile layout/segmentation failure
at that specific raster size for a two-column header, and the PDF path renders
right into it.
Expected Behavior
The right-column content should be returned in
markdownregardless of inputmodality, the same way it is when the identical page is submitted as
image_urlat any other resolution.
At minimum, dropping an entire detected region should not happen silently. There
is currently no parameter on
ocr.processto control the internal render DPIfor the
document_urlpath, so callers cannot work around this withoutrasterizing PDFs to images themselves.
Additional Context
I don't want to share the affected PDF here publicly, but if you need the original pdf to recreate this issue with please provide me with a way to share it with you privately.
Suggested Solutions
No response