decode non-utf8 sql bytes as latin-1 not unicode-escape by alhudz · Pull Request #852 · andialbrecht/sqlparse

alhudz · 2026-06-09T10:40:40Z

Repro: sqlparse.parse(b"SELECT '\x41', '\n' \xff") where the input is bytes that aren't valid UTF-8 (a single stray \xff is enough to take the fallback branch).
Cause: the non-UTF-8 fallback in Lexer.get_tokens decodes via unicode-escape, which evaluates backslash escape sequences in the SQL bytes (\x41 becomes A, \n becomes a newline, plus octal and \u…). The parsed token stream then no longer matches the raw bytes the database receives, so anything inspecting or sanitising the SQL bytes sees a different statement from the one that runs.
Fix: decode the fallback as latin-1, which maps all 256 byte values one-to-one without evaluating escapes. For input without backslashes the result is byte-identical to today; only the escape reinterpretation is dropped.

ran the tests (pytest)
all style issues addressed (ruff)
your changes are covered by tests
your changes are documented, if needed

codecov · 2026-06-09T10:41:44Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 97.27%. Comparing base (7334ac9) to head (0262a1a).
⚠️ Report is 102 commits behind head on master.

Additional details and impacted files

@@            Coverage Diff             @@
##           master     #852      +/-   ##
==========================================
+ Coverage   97.04%   97.27%   +0.22%     
==========================================
  Files          20       31      +11     
  Lines        1558     3664    +2106     
  Branches        0      328     +328     
==========================================
+ Hits         1512     3564    +2052     
- Misses         46       59      +13     
- Partials        0       41      +41

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

decode non-utf8 sql bytes as latin-1 not unicode-escape

0262a1a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

decode non-utf8 sql bytes as latin-1 not unicode-escape#852

decode non-utf8 sql bytes as latin-1 not unicode-escape#852
alhudz wants to merge 1 commit into
andialbrecht:masterfrom
alhudz:lexer-latin1-fallback

alhudz commented Jun 9, 2026

Uh oh!

codecov Bot commented Jun 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

alhudz commented Jun 9, 2026

Uh oh!

codecov Bot commented Jun 9, 2026

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant