doc-classifier-kit

An evaluation harness for document classifiers that sort accounting documents into a fixed taxonomy: W-2, 1099, invoice, bank statement, receipt (plus unknown for honest abstention).

It gives you everything you need to measure a classifier and prove the number to someone else: a stable label taxonomy, a clean classifier interface, a hard, 200+-example synthetic eval set, confusion matrices, per-class error analysis, macro and micro averages, a leaderboard, and a CLI that scores a folder of documents. Drop in your own model behind one method and you get an honest, reproducible score in seconds. A dependency-free keyword baseline and a nearest-centroid baseline ship in the box so you always have something to beat.

What it includes

The kit centers on reproducible evaluation: a fixed taxonomy, synthetic examples, classifier interfaces, scoring tools, and result reporting. The baseline classifier uses simple regex and keyword matching, while optional examples show how to plug in your own classifier behind the same interface.

Why the harness is the point

Anyone can write a classifier. The hard part is knowing whether it's good, and proving it to someone else. A score of 1.000 on a handful of clean fixtures tells you nothing. This kit ships a deliberately hard synthetic set (noisy OCR, ambiguous overlaps, adversarial decoys, out-of-scope documents that should abstain) so the baseline does not ace it, and so your model's score means something.

$ python -m doc_classifier_kit eval --set hard
Classifier: keyword-baseline
Examples: 234   Accuracy: 0.868

label             precision   recall      f1  support
-----------------------------------------------------
w2                    1.000    0.750   0.857       36
form_1099             0.800    1.000   0.889       36
invoice               0.767    0.917   0.835       36
bank_statement        1.000    0.972   0.986       36
receipt               0.923    1.000   0.960       36
-----------------------------------------------------
macro avg             0.898    0.928   0.905      234

That headroom is the whole point. Beat it with your model.

Install

Python 3.10+.

git clone https://github.com/maxed-oss/doc-classifier-kit
cd doc-classifier-kit
pip install -e ".[dev]"

Zero runtime dependencies. [dev] only pulls in pytest.

The CLI (measure anything)

# Score the keyword baseline on the hard set, with a confusion matrix and a
# per-class error analysis:
python -m doc_classifier_kit eval --set hard --confusion --errors

# Emit a JSON results object (for a results file / leaderboard submission):
python -m doc_classifier_kit eval --set hard --json > results.json

# Rank the built-in classifiers on the same set:
python -m doc_classifier_kit leaderboard --set hard

# Classify a folder of .txt files (no ground truth needed):
python -m doc_classifier_kit score ./examples/sample_docs

eval --confusion shows exactly where a classifier confuses one label for another (the unknown column makes abstentions visible):

Confusion matrix (rows=truth, cols=prediction):
truth \ pred      w2  1099   inv  bank  rcpt   unk
w2                27     0     0     0     0     9
1099               0    36     0     0     0     0
inv                0     0    33     0     3     0
bank               0     0     1    35     0     0
rcpt               0     0     0     0    36     0
unk                0     9     9     0     0    36

eval --errors clusters the mistakes and shows examples with the rule that fired, per-rule error analysis you can act on:

Most common confusions (truth -> predicted):
         unknown -> form_1099      x9
              w2 -> unknown        x9 [abstain]
         unknown -> invoice        x9
         invoice -> receipt        x3

See docs/RESULTS.md for the documented results format and the leaderboard rules.

Built-in baselines

The kit ships two structurally different baselines so the leaderboard means something out of the box:

KeywordClassifier scores hand-written regex/keyword rules per label.
CentroidClassifier learns one bag-of-words term-frequency profile (a centroid) per label from labeled examples you give it, then classifies by cosine similarity. It uses only the standard library, no model downloads.

from doc_classifier_kit import evaluate, leaderboard
from doc_classifier_kit.classifiers import CentroidClassifier, KeywordClassifier
from doc_classifier_kit.datasets import hard_eval_set

examples = hard_eval_set()
reports = [
    evaluate(KeywordClassifier(), examples),
    evaluate(CentroidClassifier.fit(examples), examples),
]
print(leaderboard(reports))

leaderboard --set hard includes the centroid baseline automatically.

Bring your own model

Wrap any callable, a hosted LLM, a local model, a scikit-learn pipeline, a heuristic, with the ByoModelClassifier adapter. It plugs into the same interface and the same harness.

from doc_classifier_kit import Document, DocumentLabel, evaluate
from doc_classifier_kit.classifiers import ByoModelClassifier
from doc_classifier_kit.datasets import hard_eval_set

def my_model(text: str):
    # Call YOUR model here. This kit does not provide one.
    label = call_your_model(text)        # -> "invoice"
    return DocumentLabel(label), 0.93    # (label, confidence)

clf = ByoModelClassifier(my_model, name="my-model")
report = evaluate(clf, hard_eval_set())
print(report.format_table())
print(report.macro_f1, report.accuracy)

Your function may return any of:

a DocumentLabel (or its string value, e.g. "invoice")
a (label, confidence) tuple
a (label, confidence, rationale) tuple

Runnable BYO-LLM example (your key, your bill)

examples/byo_llm_classifier.py is a complete, runnable adapter that classifies with OpenAI or Anthropic Claude using your API key, then puts it on the leaderboard next to the baseline. The kit bundles no key and no model, "bring your own" is literal:

pip install openai                 # or: pip install anthropic
export OPENAI_API_KEY=sk-...        # or: ANTHROPIC_API_KEY=sk-ant-...
python examples/byo_llm_classifier.py --provider openai --limit 40

Without a key (or without the SDK) it prints setup instructions and exits, it never fabricates results.

Implement the interface directly

For full control, subclass Classifier:

from doc_classifier_kit import Classifier, Document, DocumentLabel, Prediction

class MyClassifier(Classifier):
    name = "my-classifier"

    def predict(self, document: Document) -> Prediction:
        # ... your logic ...
        return Prediction(DocumentLabel.RECEIPT, confidence=0.8)

Building a leaderboard

The harness only talks to the Classifier interface, so it scores the baseline and your model identically. Run several on the same set and rank them:

from doc_classifier_kit import evaluate, leaderboard
from doc_classifier_kit.classifiers import KeywordClassifier
from doc_classifier_kit.datasets import hard_eval_set

examples = hard_eval_set()
reports = [
    evaluate(KeywordClassifier(), examples),
    evaluate(my_classifier, examples),
]
print(leaderboard(reports))

Metrics use standard definitions (precision, recall, F1 per label; both macro averages across labels and micro averages pooled over documents, so you can read the score the right way for a balanced or an imbalanced set). An unknown prediction is treated as an abstention: it counts as a miss for the true label but is never charged as a false positive against any class, so abstaining honestly beats guessing wrong, yet you still cannot game the board by always abstaining (you lose recall).

The eval sets

set	builder	size	character
`smoke`	`synthetic_eval_set()`	16	clean, unambiguous; good for eyeballing the plumbing
`hard`	`hard_eval_set()`	234	messy, ambiguous, adversarial; measures a model

Everything is synthetic and reproducible (hard is seeded). Swap in your own (synthetic!) fixtures by building a list of LabeledExample:

from doc_classifier_kit import Document, DocumentLabel, LabeledExample

my_set = [
    LabeledExample(Document(text="..."), DocumentLabel.INVOICE),
    # ...
]

Keep evaluation data synthetic or appropriately handled, never commit real client documents to a public repository.

Project layout

src/doc_classifier_kit/
  taxonomy.py            # DocumentLabel enum (the label vocabulary)
  types.py               # Document, Prediction, LabeledExample
  base.py                # Classifier ABC (the interface)
  evaluation.py          # evaluate(), EvaluationReport, confusion, leaderboard
  datasets.py            # synthetic_eval_set() + hard_eval_set()
  cli.py                 # eval / score / leaderboard commands
  classifiers/
    keyword.py           # KeywordClassifier (regex/keyword baseline)
    centroid.py          # CentroidClassifier (bag-of-words nearest-centroid)
    byo_model.py         # ByoModelClassifier (bring-your-own adapter)
examples/
  byo_llm_classifier.py  # runnable OpenAI/Claude BYO example (your key)
  sample_docs/           # synthetic .txt files for `score`
docs/RESULTS.md          # results format + leaderboard rules
tests/                   # pytest suite

Extending the taxonomy

Add a member to DocumentLabel in taxonomy.py, give it a human_name, and (if you use the baseline) add rules for it in classifiers/keyword.py. The enum is the single source of truth; never hard-code raw label strings elsewhere.

Running the tests

pytest

The suite includes a guard that the keyword baseline scores below 1.0 on the hard set, if the set ever becomes too easy, CI fails.

License

Apache-2.0. See NOTICE.

Contributing

Issues and PRs welcome at https://github.com/maxed-oss/doc-classifier-kit. Three house rules:

Keep it a harness. No model weights, no training code, no real datasets.
Synthetic fixtures only. Never commit real documents or personal data.
Don't make the eval set easier to win. New examples should be at least as hard; the baseline must stay below a perfect score.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.github/workflows		.github/workflows
docs		docs
examples		examples
src/doc_classifier_kit		src/doc_classifier_kit
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
NOTICE		NOTICE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

doc-classifier-kit

What it includes

Why the harness is the point

Install

The CLI (measure anything)

Built-in baselines

Bring your own model

Runnable BYO-LLM example (your key, your bill)

Implement the interface directly

Building a leaderboard

The eval sets

Project layout

Extending the taxonomy

Running the tests

License

Contributing

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

doc-classifier-kit

What it includes

Why the harness is the point

Install

The CLI (measure anything)

Built-in baselines

Bring your own model

Runnable BYO-LLM example (your key, your bill)

Implement the interface directly

Building a leaderboard

The eval sets

Project layout

Extending the taxonomy

Running the tests

License

Contributing

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages