Skip to content

majmur404/CodeRAG

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CodeRAG

7-stage local RAG engine over your codebase. Async orchestrator, sqlite-backed vector index, hybrid BM25+cosine retrieval with RRF fusion. ~3200 LOC.

Architecture

Stage Role LOC
Crawler Walk repo respecting .gitignore, filter by language 280
Chunker AST-aware chunks for Python, line-window for others 320
Embedder Deterministic hash-sketch + TF-IDF, L2-normalized 280
VectorIndex SQLite-backed BLOB vectors, FTS5 hybrid ranking 340
Retriever Top-k via RRF score fusion (BM25 + cosine) 280
ContextPacker Diversity-sampled context window with dedup 260
AnswerSynthesizer Template synthesis with file:line citations 240

The RAGEngine facade owns the lifecycle: index(repo_path) builds the FTS5+vector store, query(question) runs the 4 read-side stages and returns a RAGAnswer with citations.

Features

  • Async orchestrator dispatches sync stages via asyncio.to_thread
  • Hybrid retrieval: BM25 (FTS5) + cosine (numpy on BLOB) fused via RRF
  • AST-aware chunking preserves function/class boundaries for Python
  • Incremental re-index by file mtime
  • Cited answers: every claim links back to file path + line range
  • Pluggable stages via Stage inheritance
  • Pure stdlib + numpy — no model downloads, no torch

Usage

pip install -r requirements.txt
python -m src.cli index ./my_repo --db ./my.db
python -m src.cli query "how does authentication work?" --top-k 8 --db ./my.db
python -m src.cli stats --db ./my.db

Token Consumption

During the design and implementation phase, this project consumed ~16M tokens/day across Hermes Agent, Claude Code, and Xiaomi MiMo V2.5 Pro for AST-aware chunking strategy iteration, RRF score-fusion design, hybrid retrieval tuning, and continuous test maintenance.

Testing

pytest tests/ -v

108 tests covering all 7 stages, RAGEngine dispatch, sqlite vector index, RRF fusion, AST chunker, schemas, and the language/text/hashing utilities. Real I/O, no mocks — every test runs against a fresh :memory: store or a tmp_path repo.

Project Structure

src/
├── stages/        # 7 stages + Stage base class
├── storage/       # sqlite_store + schema.sql
├── io/            # file reader + language detection
├── models/        # RAGContext + dataclass schemas
├── utils/         # config, logger, hashing, text, metrics
├── engine.py      # RAGEngine facade
└── cli.py         # click CLI

Configuration

config/default.yaml controls chunk size, embedding dim, top-k, RRF k constant, and per-stage timeouts. Override via --config or CODERAG_* env vars.

License

MIT


Built with: Hermes Agent, MiMo + Claude series

About

7-Stage Local RAG Engine over Codebases — Crawler, Chunker, Embedder, VectorIndex, Retriever, ContextPacker, AnswerSynthesizer. ~3200 LOC. MiMo + Hermes.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages