Turn messy bank and credit-card statements into clean, normalized transaction JSON. Deterministic, rule-based, dependency-free, and easy to audit.
statement-normalizer reads the statement formats banks and cards actually
export, CSV, OFX/QFX, MT940, CAMT.053, QIF, and line-oriented text, and emits a
single normalized transaction schema with a consistent sign convention, parsed
dates, and proper decimal amounts. Given the same input it produces the same
output every time, which is exactly what you want when the data feeds
bookkeeping, reconciliation, or any financial workflow.
Every bank exports statements differently: some give you a single signed
Amount column, others split Debit / Credit, some wrap negatives in
parentheses, some hand you OFX/QFX or QIF, some only give you a PDF. Downstream
tools all want the same thing: a tidy list of transactions with a consistent
sign convention, parsed dates, and proper decimal amounts.
statement-normalizer does exactly that, and does it well. It is a focused,
well-tested building block you can drop into an import pipeline: it parses the
real-world export shapes, normalizes the sign convention, and de-duplicates
across overlapping statements, so the rest of your pipeline only ever sees one
clean schema.
- Parses the statement shapes banks and cards actually export:
- CSV — signed-amount or separate debit/credit columns, optional running balance, fuzzy header detection with a broad alias table covering the real-world headers from Chase, Bank of America, Wells Fargo, Amex, Capital One, Discover and many others.
- OFX / QFX — both the SGML (OFX 1.x) and XML (OFX 2.x) export styles, using a small built-in tokenizer (no external OFX dependency).
- MT940 — the SWIFT bank-statement message format (
.sta/.940), common for European business accounts. - CAMT.053 — ISO 20022
BankToCustomerStatementXML, the modern standard replacing MT940. Parsed with the standard-library XML parser, namespace-agnostic. - CAMT.052 — ISO 20022
BankToCustomerAccountReportXML, the intra-day / interim report banks send before the end-of-day CAMT.053 is cut. Same entry model as CAMT.053; auto-detected from the container element so the two XML shapes never get confused. - QIF — Quicken Interchange Format (
.qif), the long-lived plain-text export used by Quicken and many banks and personal-finance tools. Parsed with a small built-in tokenizer (no external QIF dependency). - Text — line-oriented statement text, e.g. the output of running a PDF
through a text extractor like
pdftotext.
| Format | Extensions | Direction source | Account id | Strong dedup key | Multi-currency |
|---|---|---|---|---|---|
| CSV (signed) | .csv |
sign of Amount (or --invert-amounts) |
– | content hash | per-row Currency col |
| CSV (debit/credit) | .csv |
which column is populated | – | content hash | per-row Currency col |
| OFX / QFX | .ofx .qfx |
signed TRNAMT |
ACCTID |
FITID |
CURDEF |
| MT940 | .sta .940 .mt940 |
D / C mark on :61: |
:25: |
content hash | :60F: currency |
| CAMT.053 | .xml (sniffed) |
CdtDbtInd (DBIT/CRDT) |
IBAN / Othr id |
AcctSvcrRef |
Amt@Ccy |
| CAMT.052 | .xml (sniffed) |
CdtDbtInd (DBIT/CRDT) |
IBAN / Othr id |
AcctSvcrRef |
Amt@Ccy |
| QIF | .qif |
sign of T / U amount |
– | content hash (N ref hint) |
--currency default |
| Text / PDF-text | .txt |
sign / column heuristic | – | content hash | --currency default |
All amounts are normalized to one sign convention (debits negative, credits positive) regardless of how the source expressed direction.
- Emits a single normalized
Transactionschema with a consistent sign convention (debits negative, credits positive), parseddate,Decimalamount, derivedtxn_type, optionalbalance,fitid,account_id, and currency. - De-duplicates transactions with deterministic heuristics:
- Bank-assigned
FITID(from OFX) is the authoritative key when present. - Otherwise a content hash over (date, signed amount, currency, canonical description), while preserving legitimately-repeated same-day charges.
- Bank-assigned
- Merges multiple statements (e.g. overlapping monthly exports) into one de-duplicated list.
- Ships a CLI and a small Python API.
- Binary PDF workflows. Pipe your PDF through a text extractor first, then feed the text in.
- Transaction categorization. Pair normalized output with your preferred categorization layer.
pip install statement-normalizerOr from a checkout:
pip install -e .Python 3.9+. No runtime dependencies.
# Normalize a single file (format auto-detected by extension/content)
statement-normalizer statement.csv --pretty
# Force a format
statement-normalizer --format ofx export.qfx
# MT940, CAMT.053, and QIF work the same way
statement-normalizer export.sta --stats
statement-normalizer --format camt053 statement.xml
statement-normalizer export.qif --pretty
# Merge overlapping monthly exports, de-duplicating across them
statement-normalizer jan.csv feb.csv --merge -o all.json
# Flat CSV instead of JSON
statement-normalizer statement.ofx --csv -o transactions.csv
# Print a summary (counts, totals, date range) to stderr while still emitting JSON
statement-normalizer statement.csv --stats
# Read from stdin ('-')
cat statement.csv | statement-normalizer --format csv -
# Issuers that report charges as positive / payments as negative (e.g. some
# credit-card exports) — flip the sign into the debit-negative convention
statement-normalizer amex.csv --invert-amounts
# Set a default currency when the source omits one
statement-normalizer eu_statement.csv --currency EUR| Flag | Effect |
|---|---|
--format {csv,ofx,text,mt940,camt053,qif} |
force input format (default: auto-detect) |
--csv |
emit a flat transactions CSV instead of JSON |
--stats |
print a counts/totals/date-range summary to stderr |
--merge |
merge all inputs into one cross-statement-deduped list |
--no-dedup |
disable de-duplication |
--invert-amounts |
flip the sign of CSV single-Amount-column values |
--currency CCY |
default currency when the source omits one |
--pretty |
pretty-print JSON |
-o FILE |
write to a file instead of stdout |
from statement_normalizer import normalize_file
statement = normalize_file("statement.csv")
print(statement.account_id, statement.currency)
for txn in statement.transactions:
print(txn.date, txn.amount, txn.txn_type.value, txn.description)
# JSON-serializable dict
import json
print(json.dumps(statement.to_dict(), indent=2))Merging multiple files with cross-statement dedup:
from statement_normalizer.normalize import normalize_many
txns = normalize_many(["jan.ofx", "feb.ofx"])The examples/ directory ships synthetic statements in the shapes
real banks and cards export, plus a runnable tour:
python examples/demo.py| File | Shape it demonstrates |
|---|---|
chase_checking.csv |
Chase Details/Posting Date/Description/Amount/Type/Balance |
bofa_checking.csv |
Bank of America Date/Description/Amount/Running Bal. |
wells_fargo_checking.csv |
Wells Fargo Date/Amount/*/Payee/Memo (blank Payee, text in Memo) |
amex_creditcard.csv |
Amex single Amount with charges positive (use --invert-amounts) |
capital_one_creditcard.csv |
Capital One split Debit/Credit columns |
discover_creditcard.csv |
Discover Trans. Date/Amount/Category (charges positive) |
sample.mt940 |
SWIFT MT940 (:25:/:60F:/:61:/:86:) |
sample.camt053.xml |
ISO 20022 CAMT.053 (<Ntry> / CdtDbtInd) |
sample.qif |
Quicken QIF (!Type:Bank, D/T/P/M/N/L fields) |
overlap_jan.csv, overlap_feb.csv |
two overlapping months for the dedup/merge demo |
Dedup/merge across overlapping months in one line:
statement-normalizer examples/overlap_jan.csv examples/overlap_feb.csv --merge --stats
# 12 raw rows across the two files -> 9 after the 3 overlapping rows collapse,
# while a legitimately-repeated same-merchant charge is preserved.Input CSV:
Transaction Date,Description,Amount
2024-02-01,GAS STATION 4471,(45.20)
2024-02-05,PAYMENT - THANK YOU,250.00Output JSON (abridged):
{
"account_id": null,
"currency": "USD",
"source_format": "csv",
"transaction_count": 2,
"transactions": [
{
"date": "2024-02-01",
"amount": "-45.20",
"description": "GAS STATION 4471",
"txn_type": "debit",
"currency": "USD",
"balance": null,
"fitid": null,
"account_id": null,
"source_format": "csv"
},
{
"date": "2024-02-05",
"amount": "250.00",
"description": "PAYMENT - THANK YOU",
"txn_type": "credit",
"currency": "USD",
"balance": null,
"fitid": null,
"account_id": null,
"source_format": "csv"
}
]
}amount is always signed: negative = money out (debit), positive = money
in (credit). txn_type is derived from the sign for convenience. Parsers
normalize to this convention regardless of how the source expressed direction
(separate debit columns, parentheses, trailing minus, etc.).
pip install -e ".[dev]"
pytest
python examples/demo.py # runnable tour over the synthetic examplesThe test suite runs entirely over synthetic sample statements committed under
tests/fixtures/ and examples/. No real account
data is used anywhere in this project; please keep it that way and only commit
synthetic fixtures.
CI runs the tests on Python 3.9–3.13 (plus a macOS/Windows spot-check) and builds the sdist + wheel on every push and pull request.