Description
When using ocr.process() with a PDF that contains tables where cell content includes pipe characters | (e.g., reference numbers like "A01 |2001|"), the Markdown output does not escape these characters. Since pipes are also used as Markdown table column delimiters, this breaks the table structure.
Steps to Reproduce
- Process a PDF containing a table where cells have
| characters in their content
- The OCR correctly recognizes the text, but the Markdown output uses raw
| without escaping
Expected Behavior
Pipe characters that are part of cell content should be escaped as \| in the Markdown output to distinguish them from column delimiters.
Actual Behavior
Raw | in cell content creates extra columns, causing a mismatch between header column count and data row column count.
Example
Original document table:
| Unit |
Area m² |
€/m² |
Monthly Rent |
Reduction % |
Days |
Reduction |
| A01 |2001| |
398.08 |
EUR 34.86 |
EUR 13,877.07 |
75% |
17 |
EUR 5,801.07 |
| A02 |2002| |
140.72 |
EUR 56.41 |
EUR 7,938.49 |
75% |
17 |
EUR 3,318.55 |
OCR Markdown output (broken — 8 columns in data, 7 in header):
| Unit | Area m² | €/m² | Monthly Rent | Reduction % | Days | Reduction |
| --- | --- | --- | --- | --- | --- | --- |
| A01 |2001 | 398.08 | EUR 34.86 | EUR 13,877.07 | 75% | 17 | EUR 5,801.07 |
| A02 |2002 | 140.72 | EUR 56.41 | EUR 7,938.49 | 75% | 17 | EUR 3,318.55 |
The |2001 is part of the cell content "A01 |2001|" but is interpreted as a column delimiter.
Environment
- SDK:
mistralai==2.5.0
- Model:
mistral-ocr-latest
- API call:
client.ocr.process() with include_blocks=True
Description
When using
ocr.process()with a PDF that contains tables where cell content includes pipe characters|(e.g., reference numbers like "A01 |2001|"), the Markdown output does not escape these characters. Since pipes are also used as Markdown table column delimiters, this breaks the table structure.Steps to Reproduce
|characters in their content|without escapingExpected Behavior
Pipe characters that are part of cell content should be escaped as
\|in the Markdown output to distinguish them from column delimiters.Actual Behavior
Raw
|in cell content creates extra columns, causing a mismatch between header column count and data row column count.Example
Original document table:
OCR Markdown output (broken — 8 columns in data, 7 in header):
The
|2001is part of the cell content "A01 |2001|" but is interpreted as a column delimiter.Environment
mistralai==2.5.0mistral-ocr-latestclient.ocr.process()withinclude_blocks=True