Skip to content

OCR: Pipe characters | in table cell content are not escaped in Markdown output #583

Description

@patrickpenn

Description

When using ocr.process() with a PDF that contains tables where cell content includes pipe characters | (e.g., reference numbers like "A01 |2001|"), the Markdown output does not escape these characters. Since pipes are also used as Markdown table column delimiters, this breaks the table structure.

Steps to Reproduce

  1. Process a PDF containing a table where cells have | characters in their content
  2. The OCR correctly recognizes the text, but the Markdown output uses raw | without escaping

Expected Behavior

Pipe characters that are part of cell content should be escaped as \| in the Markdown output to distinguish them from column delimiters.

Actual Behavior

Raw | in cell content creates extra columns, causing a mismatch between header column count and data row column count.

Example

Original document table:

Unit Area m² €/m² Monthly Rent Reduction % Days Reduction
A01 |2001| 398.08 EUR 34.86 EUR 13,877.07 75% 17 EUR 5,801.07
A02 |2002| 140.72 EUR 56.41 EUR 7,938.49 75% 17 EUR 3,318.55

OCR Markdown output (broken — 8 columns in data, 7 in header):

|  Unit | Area m² | €/m² | Monthly Rent | Reduction % | Days | Reduction  |
| --- | --- | --- | --- | --- | --- | --- |
|  A01 |2001 | 398.08 | EUR 34.86 | EUR 13,877.07 | 75% | 17 | EUR 5,801.07  |
|  A02 |2002 | 140.72 | EUR 56.41 | EUR 7,938.49 | 75% | 17 | EUR 3,318.55  |

The |2001 is part of the cell content "A01 |2001|" but is interpreted as a column delimiter.

Environment

  • SDK: mistralai==2.5.0
  • Model: mistral-ocr-latest
  • API call: client.ocr.process() with include_blocks=True

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions