An evaluation harness for document classifiers that sort accounting
documents into a fixed taxonomy: W-2, 1099, invoice, bank statement, receipt
(plus unknown for honest abstention).
It gives you everything you need to measure a classifier and prove the number to someone else: a stable label taxonomy, a clean classifier interface, a hard, 200+-example synthetic eval set, confusion matrices, per-class error analysis, macro and micro averages, a leaderboard, and a CLI that scores a folder of documents. Drop in your own model behind one method and you get an honest, reproducible score in seconds. A dependency-free keyword baseline and a nearest-centroid baseline ship in the box so you always have something to beat.
The kit centers on reproducible evaluation: a fixed taxonomy, synthetic examples, classifier interfaces, scoring tools, and result reporting. The baseline classifier uses simple regex and keyword matching, while optional examples show how to plug in your own classifier behind the same interface.
Anyone can write a classifier. The hard part is knowing whether it's good, and
proving it to someone else. A score of 1.000 on a handful of clean fixtures
tells you nothing. This kit ships a deliberately hard synthetic set (noisy
OCR, ambiguous overlaps, adversarial decoys, out-of-scope documents that should
abstain) so the baseline does not ace it, and so your model's score means
something.
$ python -m doc_classifier_kit eval --set hard
Classifier: keyword-baseline
Examples: 234 Accuracy: 0.868
label precision recall f1 support
-----------------------------------------------------
w2 1.000 0.750 0.857 36
form_1099 0.800 1.000 0.889 36
invoice 0.767 0.917 0.835 36
bank_statement 1.000 0.972 0.986 36
receipt 0.923 1.000 0.960 36
-----------------------------------------------------
macro avg 0.898 0.928 0.905 234
That headroom is the whole point. Beat it with your model.
Python 3.10+.
git clone https://github.com/maxed-oss/doc-classifier-kit
cd doc-classifier-kit
pip install -e ".[dev]"Zero runtime dependencies. [dev] only pulls in pytest.
# Score the keyword baseline on the hard set, with a confusion matrix and a
# per-class error analysis:
python -m doc_classifier_kit eval --set hard --confusion --errors
# Emit a JSON results object (for a results file / leaderboard submission):
python -m doc_classifier_kit eval --set hard --json > results.json
# Rank the built-in classifiers on the same set:
python -m doc_classifier_kit leaderboard --set hard
# Classify a folder of .txt files (no ground truth needed):
python -m doc_classifier_kit score ./examples/sample_docseval --confusion shows exactly where a classifier confuses one label for
another (the unknown column makes abstentions visible):
Confusion matrix (rows=truth, cols=prediction):
truth \ pred w2 1099 inv bank rcpt unk
w2 27 0 0 0 0 9
1099 0 36 0 0 0 0
inv 0 0 33 0 3 0
bank 0 0 1 35 0 0
rcpt 0 0 0 0 36 0
unk 0 9 9 0 0 36
eval --errors clusters the mistakes and shows examples with the rule that
fired, per-rule error analysis you can act on:
Most common confusions (truth -> predicted):
unknown -> form_1099 x9
w2 -> unknown x9 [abstain]
unknown -> invoice x9
invoice -> receipt x3
See docs/RESULTS.md for the documented results format and
the leaderboard rules.
The kit ships two structurally different baselines so the leaderboard means something out of the box:
KeywordClassifierscores hand-written regex/keyword rules per label.CentroidClassifierlearns one bag-of-words term-frequency profile (a centroid) per label from labeled examples you give it, then classifies by cosine similarity. It uses only the standard library, no model downloads.
from doc_classifier_kit import evaluate, leaderboard
from doc_classifier_kit.classifiers import CentroidClassifier, KeywordClassifier
from doc_classifier_kit.datasets import hard_eval_set
examples = hard_eval_set()
reports = [
evaluate(KeywordClassifier(), examples),
evaluate(CentroidClassifier.fit(examples), examples),
]
print(leaderboard(reports))leaderboard --set hard includes the centroid baseline automatically.
Wrap any callable, a hosted LLM, a local model, a scikit-learn pipeline, a
heuristic, with the ByoModelClassifier adapter. It plugs into the same
interface and the same harness.
from doc_classifier_kit import Document, DocumentLabel, evaluate
from doc_classifier_kit.classifiers import ByoModelClassifier
from doc_classifier_kit.datasets import hard_eval_set
def my_model(text: str):
# Call YOUR model here. This kit does not provide one.
label = call_your_model(text) # -> "invoice"
return DocumentLabel(label), 0.93 # (label, confidence)
clf = ByoModelClassifier(my_model, name="my-model")
report = evaluate(clf, hard_eval_set())
print(report.format_table())
print(report.macro_f1, report.accuracy)Your function may return any of:
- a
DocumentLabel(or its string value, e.g."invoice") - a
(label, confidence)tuple - a
(label, confidence, rationale)tuple
examples/byo_llm_classifier.py is a
complete, runnable adapter that classifies with OpenAI or Anthropic
Claude using your API key, then puts it on the leaderboard next to the
baseline. The kit bundles no key and no model, "bring your own" is literal:
pip install openai # or: pip install anthropic
export OPENAI_API_KEY=sk-... # or: ANTHROPIC_API_KEY=sk-ant-...
python examples/byo_llm_classifier.py --provider openai --limit 40Without a key (or without the SDK) it prints setup instructions and exits, it never fabricates results.
For full control, subclass Classifier:
from doc_classifier_kit import Classifier, Document, DocumentLabel, Prediction
class MyClassifier(Classifier):
name = "my-classifier"
def predict(self, document: Document) -> Prediction:
# ... your logic ...
return Prediction(DocumentLabel.RECEIPT, confidence=0.8)The harness only talks to the Classifier interface, so it scores the baseline
and your model identically. Run several on the same set and rank them:
from doc_classifier_kit import evaluate, leaderboard
from doc_classifier_kit.classifiers import KeywordClassifier
from doc_classifier_kit.datasets import hard_eval_set
examples = hard_eval_set()
reports = [
evaluate(KeywordClassifier(), examples),
evaluate(my_classifier, examples),
]
print(leaderboard(reports))Metrics use standard definitions (precision, recall, F1 per label; both macro
averages across labels and micro averages pooled over documents, so you can read
the score the right way for a balanced or an imbalanced set). An unknown
prediction is treated as an
abstention: it counts as a miss for the true label but is never charged as a
false positive against any class, so abstaining honestly beats guessing wrong,
yet you still cannot game the board by always abstaining (you lose recall).
| set | builder | size | character |
|---|---|---|---|
smoke |
synthetic_eval_set() |
16 | clean, unambiguous; good for eyeballing the plumbing |
hard |
hard_eval_set() |
234 | messy, ambiguous, adversarial; measures a model |
Everything is synthetic and reproducible (hard is seeded). Swap in your own
(synthetic!) fixtures by building a list of LabeledExample:
from doc_classifier_kit import Document, DocumentLabel, LabeledExample
my_set = [
LabeledExample(Document(text="..."), DocumentLabel.INVOICE),
# ...
]Keep evaluation data synthetic or appropriately handled, never commit real client documents to a public repository.
src/doc_classifier_kit/
taxonomy.py # DocumentLabel enum (the label vocabulary)
types.py # Document, Prediction, LabeledExample
base.py # Classifier ABC (the interface)
evaluation.py # evaluate(), EvaluationReport, confusion, leaderboard
datasets.py # synthetic_eval_set() + hard_eval_set()
cli.py # eval / score / leaderboard commands
classifiers/
keyword.py # KeywordClassifier (regex/keyword baseline)
centroid.py # CentroidClassifier (bag-of-words nearest-centroid)
byo_model.py # ByoModelClassifier (bring-your-own adapter)
examples/
byo_llm_classifier.py # runnable OpenAI/Claude BYO example (your key)
sample_docs/ # synthetic .txt files for `score`
docs/RESULTS.md # results format + leaderboard rules
tests/ # pytest suite
Add a member to DocumentLabel in taxonomy.py, give it a human_name, and
(if you use the baseline) add rules for it in classifiers/keyword.py. The enum
is the single source of truth; never hard-code raw label strings elsewhere.
pytestThe suite includes a guard that the keyword baseline scores below 1.0 on the hard set, if the set ever becomes too easy, CI fails.
Apache-2.0. See NOTICE.
Issues and PRs welcome at https://github.com/maxed-oss/doc-classifier-kit. Three house rules:
- Keep it a harness. No model weights, no training code, no real datasets.
- Synthetic fixtures only. Never commit real documents or personal data.
- Don't make the eval set easier to win. New examples should be at least as hard; the baseline must stay below a perfect score.