Skip to content

Maxed-OSS/doc-classifier-kit

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

doc-classifier-kit

An evaluation harness for document classifiers that sort accounting documents into a fixed taxonomy: W-2, 1099, invoice, bank statement, receipt (plus unknown for honest abstention).

It gives you everything you need to measure a classifier and prove the number to someone else: a stable label taxonomy, a clean classifier interface, a hard, 200+-example synthetic eval set, confusion matrices, per-class error analysis, macro and micro averages, a leaderboard, and a CLI that scores a folder of documents. Drop in your own model behind one method and you get an honest, reproducible score in seconds. A dependency-free keyword baseline and a nearest-centroid baseline ship in the box so you always have something to beat.

What it includes

The kit centers on reproducible evaluation: a fixed taxonomy, synthetic examples, classifier interfaces, scoring tools, and result reporting. The baseline classifier uses simple regex and keyword matching, while optional examples show how to plug in your own classifier behind the same interface.

Why the harness is the point

Anyone can write a classifier. The hard part is knowing whether it's good, and proving it to someone else. A score of 1.000 on a handful of clean fixtures tells you nothing. This kit ships a deliberately hard synthetic set (noisy OCR, ambiguous overlaps, adversarial decoys, out-of-scope documents that should abstain) so the baseline does not ace it, and so your model's score means something.

$ python -m doc_classifier_kit eval --set hard
Classifier: keyword-baseline
Examples: 234   Accuracy: 0.868

label             precision   recall      f1  support
-----------------------------------------------------
w2                    1.000    0.750   0.857       36
form_1099             0.800    1.000   0.889       36
invoice               0.767    0.917   0.835       36
bank_statement        1.000    0.972   0.986       36
receipt               0.923    1.000   0.960       36
-----------------------------------------------------
macro avg             0.898    0.928   0.905      234

That headroom is the whole point. Beat it with your model.

Install

Python 3.10+.

git clone https://github.com/maxed-oss/doc-classifier-kit
cd doc-classifier-kit
pip install -e ".[dev]"

Zero runtime dependencies. [dev] only pulls in pytest.

The CLI (measure anything)

# Score the keyword baseline on the hard set, with a confusion matrix and a
# per-class error analysis:
python -m doc_classifier_kit eval --set hard --confusion --errors

# Emit a JSON results object (for a results file / leaderboard submission):
python -m doc_classifier_kit eval --set hard --json > results.json

# Rank the built-in classifiers on the same set:
python -m doc_classifier_kit leaderboard --set hard

# Classify a folder of .txt files (no ground truth needed):
python -m doc_classifier_kit score ./examples/sample_docs

eval --confusion shows exactly where a classifier confuses one label for another (the unknown column makes abstentions visible):

Confusion matrix (rows=truth, cols=prediction):
truth \ pred      w2  1099   inv  bank  rcpt   unk
w2                27     0     0     0     0     9
1099               0    36     0     0     0     0
inv                0     0    33     0     3     0
bank               0     0     1    35     0     0
rcpt               0     0     0     0    36     0
unk                0     9     9     0     0    36

eval --errors clusters the mistakes and shows examples with the rule that fired, per-rule error analysis you can act on:

Most common confusions (truth -> predicted):
         unknown -> form_1099      x9
              w2 -> unknown        x9 [abstain]
         unknown -> invoice        x9
         invoice -> receipt        x3

See docs/RESULTS.md for the documented results format and the leaderboard rules.

Built-in baselines

The kit ships two structurally different baselines so the leaderboard means something out of the box:

  • KeywordClassifier scores hand-written regex/keyword rules per label.
  • CentroidClassifier learns one bag-of-words term-frequency profile (a centroid) per label from labeled examples you give it, then classifies by cosine similarity. It uses only the standard library, no model downloads.
from doc_classifier_kit import evaluate, leaderboard
from doc_classifier_kit.classifiers import CentroidClassifier, KeywordClassifier
from doc_classifier_kit.datasets import hard_eval_set

examples = hard_eval_set()
reports = [
    evaluate(KeywordClassifier(), examples),
    evaluate(CentroidClassifier.fit(examples), examples),
]
print(leaderboard(reports))

leaderboard --set hard includes the centroid baseline automatically.

Bring your own model

Wrap any callable, a hosted LLM, a local model, a scikit-learn pipeline, a heuristic, with the ByoModelClassifier adapter. It plugs into the same interface and the same harness.

from doc_classifier_kit import Document, DocumentLabel, evaluate
from doc_classifier_kit.classifiers import ByoModelClassifier
from doc_classifier_kit.datasets import hard_eval_set

def my_model(text: str):
    # Call YOUR model here. This kit does not provide one.
    label = call_your_model(text)        # -> "invoice"
    return DocumentLabel(label), 0.93    # (label, confidence)

clf = ByoModelClassifier(my_model, name="my-model")
report = evaluate(clf, hard_eval_set())
print(report.format_table())
print(report.macro_f1, report.accuracy)

Your function may return any of:

  • a DocumentLabel (or its string value, e.g. "invoice")
  • a (label, confidence) tuple
  • a (label, confidence, rationale) tuple

Runnable BYO-LLM example (your key, your bill)

examples/byo_llm_classifier.py is a complete, runnable adapter that classifies with OpenAI or Anthropic Claude using your API key, then puts it on the leaderboard next to the baseline. The kit bundles no key and no model, "bring your own" is literal:

pip install openai                 # or: pip install anthropic
export OPENAI_API_KEY=sk-...        # or: ANTHROPIC_API_KEY=sk-ant-...
python examples/byo_llm_classifier.py --provider openai --limit 40

Without a key (or without the SDK) it prints setup instructions and exits, it never fabricates results.

Implement the interface directly

For full control, subclass Classifier:

from doc_classifier_kit import Classifier, Document, DocumentLabel, Prediction

class MyClassifier(Classifier):
    name = "my-classifier"

    def predict(self, document: Document) -> Prediction:
        # ... your logic ...
        return Prediction(DocumentLabel.RECEIPT, confidence=0.8)

Building a leaderboard

The harness only talks to the Classifier interface, so it scores the baseline and your model identically. Run several on the same set and rank them:

from doc_classifier_kit import evaluate, leaderboard
from doc_classifier_kit.classifiers import KeywordClassifier
from doc_classifier_kit.datasets import hard_eval_set

examples = hard_eval_set()
reports = [
    evaluate(KeywordClassifier(), examples),
    evaluate(my_classifier, examples),
]
print(leaderboard(reports))

Metrics use standard definitions (precision, recall, F1 per label; both macro averages across labels and micro averages pooled over documents, so you can read the score the right way for a balanced or an imbalanced set). An unknown prediction is treated as an abstention: it counts as a miss for the true label but is never charged as a false positive against any class, so abstaining honestly beats guessing wrong, yet you still cannot game the board by always abstaining (you lose recall).

The eval sets

set builder size character
smoke synthetic_eval_set() 16 clean, unambiguous; good for eyeballing the plumbing
hard hard_eval_set() 234 messy, ambiguous, adversarial; measures a model

Everything is synthetic and reproducible (hard is seeded). Swap in your own (synthetic!) fixtures by building a list of LabeledExample:

from doc_classifier_kit import Document, DocumentLabel, LabeledExample

my_set = [
    LabeledExample(Document(text="..."), DocumentLabel.INVOICE),
    # ...
]

Keep evaluation data synthetic or appropriately handled, never commit real client documents to a public repository.

Project layout

src/doc_classifier_kit/
  taxonomy.py            # DocumentLabel enum (the label vocabulary)
  types.py               # Document, Prediction, LabeledExample
  base.py                # Classifier ABC (the interface)
  evaluation.py          # evaluate(), EvaluationReport, confusion, leaderboard
  datasets.py            # synthetic_eval_set() + hard_eval_set()
  cli.py                 # eval / score / leaderboard commands
  classifiers/
    keyword.py           # KeywordClassifier (regex/keyword baseline)
    centroid.py          # CentroidClassifier (bag-of-words nearest-centroid)
    byo_model.py         # ByoModelClassifier (bring-your-own adapter)
examples/
  byo_llm_classifier.py  # runnable OpenAI/Claude BYO example (your key)
  sample_docs/           # synthetic .txt files for `score`
docs/RESULTS.md          # results format + leaderboard rules
tests/                   # pytest suite

Extending the taxonomy

Add a member to DocumentLabel in taxonomy.py, give it a human_name, and (if you use the baseline) add rules for it in classifiers/keyword.py. The enum is the single source of truth; never hard-code raw label strings elsewhere.

Running the tests

pytest

The suite includes a guard that the keyword baseline scores below 1.0 on the hard set, if the set ever becomes too easy, CI fails.

License

Apache-2.0. See NOTICE.

Contributing

Issues and PRs welcome at https://github.com/maxed-oss/doc-classifier-kit. Three house rules:

  1. Keep it a harness. No model weights, no training code, no real datasets.
  2. Synthetic fixtures only. Never commit real documents or personal data.
  3. Don't make the eval set easier to win. New examples should be at least as hard; the baseline must stay below a perfect score.

About

Accounting document classification toolkit with fixed taxonomy, synthetic fixtures, scoring tools, and pluggable classifier interfaces.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages