From ae88a4bb1952850d29f66ae24feaf079f0128137 Mon Sep 17 00:00:00 2001 From: Nemanja Date: Mon, 18 May 2026 09:14:54 +0200 Subject: [PATCH 1/6] feat: Add blog section --- .gitignore | 1 - blog.md | 304 +++++++++++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 304 insertions(+), 1 deletion(-) create mode 100644 blog.md diff --git a/.gitignore b/.gitignore index d9e067b..8fe139c 100644 --- a/.gitignore +++ b/.gitignore @@ -21,7 +21,6 @@ build/ *.iml .idea/ -blog.md # Local config (use config.example.yaml as a template) config.yaml diff --git a/blog.md b/blog.md new file mode 100644 index 0000000..0989d54 --- /dev/null +++ b/blog.md @@ -0,0 +1,304 @@ +# Blog Post Plan: How semcode Builds a RAG System for Code Search + +## Context + +This blog post explains the RAG (retrieval-augmented generation) pipeline behind +[**semcode**](https://github.com/GoodbyePlanet/semcode), an MCP server that does +semantic code search across your GitHub repositories. It covers both parts of the pipeline: the **ingestion** side — how +repositories are found, how code is parsed into symbols with Tree-sitter, how embedding inputs are constructed both +dense and sparse, and how +points land in Qdrant incrementally — and the **retrieval** side — how queries are encoded into both dense and sparse +vectors and fused server-side with RRF (Reciprocal Rank Fusion). Along the way we'll cover why a hybrid dense+sparse +approach beats either one alone for code, and why the *payload* stored next to each vector matters as much as the vector +itself. + +Audience: engineers familiar with RAG, embeddings, and vector DBs, curious about applying RAG to source code +specifically (not prose). + +--- + +## Section 1 — Why RAG for code is different from RAG for documents + +Most RAG systems are built around prose — PDFs, internal documentation, wikis... The content is natural language written +for humans, meaning is carried in sentences, and semantic search over plain text works well, and when you add second +stage retrieval (reranker), you get a system that can answer your questions with high confidence. +Software code is different: it's structured, symbolic, it's written for compilers and interpreters. Meaning is +distributed across structure, not sentences: + +- A function name (retryWithBackoff) carries intent +- The signature (attempts: int, delay_ms: int) carries contract +- The body carries implementation details +- Annotations (@Retryable, @CircuitBreaker) carry framework behavior +- The class it belongs to (OrderProcessingService) carries domain context + +None of that is a sentence. You can't chunk code by paragraph — you chunk by symbol (function, class, method). +Let's see how that is implemented in **semcode**. + +--- + +## Section 2 — From source files to Code Symbols - Tree-sitter parsing + +What is an AST? + +An Abstract Syntax Tree is a tree representation of source code's grammatical structure (logical parts of this code and +how do they relate to each other). Every construct in your code — +a function definition, a class, an if statement, a variable assignment — becomes a node in the tree, where parent-child +relationships express nesting and ownership. + +For clarity, bellow is a pruned AST. Just to give you a mental model of how a parser sees +a function: a decorated async definition with typed parameters, a return annotation, and a body containing a +docstring and a single return. + +```shell +@app.get("/users") +async def list_users(db: Session) -> list[User]: + """Return all users.""" + return db.query(User).all() + +module +└── decorated_definition + ├── decorator → "@app.get("/users")" + └── function_definition + ├── name → "list_users" + ├── parameters → "(db: Session)" + ├── return_type → "list[User]" + └── body + ├── expression_statement + │ └── string → '"""Return all users."""' + └── return_statement +``` + +What is Tree sitter? + +Tree-sitter is a parser generator tool and an incremental parsing library. It can build a concrete syntax tree for a +source file and efficiently update the syntax tree as the source file is edited. +[Tree-sitter official documentation](https://tree-sitter.github.io/tree-sitter/) + +What is a Code Symbol in **semcode**? + +A symbol is one named, self-contained unit of code that a language considers meaningful — a function, a class, a method, +an interface, a React component, a hook... In **semcode** a symbol is a CodeSymbol dataclass, +which captures everything needed to search, understand, and locate it without reading the surrounding file. + +What a `CodeSymbol` carries: + +**name / symbol_type / language** — These uniquely describe what kind of thing this is (save, +method, java) so retrieval can filter by language or type before even looking at embeddings. + +**signature** — The declaration line only, e.g. *def save(self, db: Session) -> User*. This is what you'd see in an +IDE's autocomplete popup — compact enough to show in search results without including the full body. + +**source** — The complete raw text of the symbol from open brace to closing brace. This is what gets embedded into the +vector store, giving the model the full implementation context when a chunk is retrieved. + +**start_line / end_line** — Position recorded by Tree-sitter during parsing, used to link a search result back +to an exact location in the file. + +**parent_name / package** — Structural context. **parent_name** says which class owns this method; **package** says +which Java +package or Python module the file belongs to. Without these, two methods both named save in different services are +indistinguishable. + +**annotations / extras** — Language-specific enrichment. A Java @GetMapping("/users") lands in annotations; the +extracted +HTTP route string (GET /users) lands in extras. For TypeScript, extras flags whether a component uses hooks, or whether +a function matches the React component signature pattern. + +Example: + +```shell +CodeSymbol( + name="list_users", + symbol_type="api_route", + language="python", + source="async def list_users(db: Session) -> list[User]:\n ...", + file_path="auth-service/routers/users.py", + start_line=2, + end_line=4, + parent_name=None, + package="auth-service.routers.users", + annotations=["app.get(\"/users\")"], + signature="async def list_users (db: Session) -> list[User]", + docstring='"""Return all users."""', + extras={"is_async": True, "http_method": "GET", "http_route": "/users"}, +) +``` + +So the full pipeline is: +Tree-sitter parses code into an AST. The parser goes through that AST node by node, asks each node where it starts/ends +and what it contains, and puts all of that into a **CodeSymbol** — one symbol per meaningful language construct. +--- + +## Section 3 — Building the embedding input + +Now, having knowledge about **CodeSymbols**, we can build the input for a vector database. In **semcode** +[Qdrant](https://qdrant.tech/) is used for to store vectors we have two types of inputs: dense and sparse. + +What are dense embeddings? + +**Dense embeddings** encode the *meaning* of text into a fixed-size vector of floating-point numbers — typically +hundreds or thousands of dimensions depending on which embedding provider is chosen. Two pieces of text that express the +same idea will land close together in that vector space even if they share no words in common. For code search this +means a query like "find the method that handles payment retries" can surface `retryWithBackoff()` +without those words appearing anywhere in the source. + +```shell +dense = [0.2, 0.3, 0.5, 0.7, ...] # several hundred floats +``` + +What are sparse embeddings? + +**Sparse embeddings** work the opposite way: instead of capturing meaning, they represent text as a large vocabulary +vector where almost every entry is zero and only the terms that actually appear get a non-zero weight. BM25 is the +algorithm behind this — it scores each token by how often it appears in a document relative to how common it +is across the whole corpus. This makes sparse embeddings excellent at exact keyword matching: if you search for +`PlaceOrderRequest` or `@Transactional`, BM25 will find every document that contains those tokens precisely. + +```shell +# Taken from Qdrant docs +sparse = [{331: 0.5}, {14136: 0.7}] # 20 key value pairs +# The numbers 331 and 14136 map to specific tokens in the vocabulary e.g. ['Transactional', 'PlaceOrderRequest']. +# The rest of the values are zero. This is why it’s called a sparse vector. +``` + +How does **semcode** build the dense input? + +The whole `CodeSymbol` object is not embedded directly — it is first serialized into a single text string, and that +string is what the embedding model sees. One symbol produces one string, which produces one vector: an array of +floating-point numbers (e.g. 768 or 3072 floats depending on the provider). The `CodeSymbol` fields that carry +*meaning* go into that string. +It starts with a human-readable preamble that names the language, symbol type, parent class, and owning service, then +layers in framework-specific metadata — Spring stereotypes, HTTP method and route, annotations — followed by a truncated +docstring and the full signature. Finally, the raw source body is appended, capped at ~6,000 characters (~1,500 +tokens). The goal is to give the embedding model everything it would need to understand the symbol's role, not just +its implementation. +The fields that are useful for *displaying or filtering* results (like `start_line`, +`file_path`, or `parent_name`, `package`) are stored separately as the Qdrant **payload** — they sit next to the vector +but are never embedded. + +How does **semcode** build the sparse input? + +Building BM25 text input is minimal — it concatenates only the signature, docstring, and raw source, with no metadata. +It splits camelCase and snake_case identifiers into their component words while keeping the original form alongside. A +token like `PlaceOrderRequest`becomes `Place Order Request` — so BM25 can match the exact identifier *and* a +natural-language query like "place order request" that doesn't use the original casing. + +So the full picture is: +Every `CodeSymbol` produces two inputs. The dense input is wide and context-rich — it tells the model the symbol's +place in the system. The sparse input is narrow and literal — it gives BM25 the exact tokens to match against. Both +are computed in the same pipeline step and stored together as a single point in Qdrant. + +--- + +## Section 4 — The sparse side: BM25 with code-aware tokenization + +- BM25 input is intentionally coarser: signature + docstring + source only + - Reference: `server/indexer/pipeline.py:94-101` +- Identifier expansion: `CamelCase` and `snake_case` are split so BM25 can match partial queries + - Both original and split forms kept → "PlaceOrderRequest" matches exact lookups *and* "place order" + - Reference: `server/embeddings/code_tokenizer.py:6-16` +- Implementation: fastembed's `Bm25("Qdrant/bm25")`, stored as a native sparse vector in Qdrant + - Reference: `server/embeddings/bm25.py` +- What BM25 solves that dense doesn't: + - Exact symbol-name lookups + - Rare tokens (vocabulary mismatch — domain jargon, project-specific names) + - Queries that are *literal* references rather than intent descriptions + +--- + +## Section 5 — The dense side: pluggable embedding providers + +- Five providers, all behind one interface: Jina API (hosted), self-hosted Jina via TEI, OpenAI, Voyage, Ollama + - Reference: `server/embeddings/{jina_api,jina,openai,voyage,ollama}.py` +- Why pluggable matters for code: dimensions vary (768 → 3072), code-tuned models (jina-code-embeddings, voyage-code-3) + outperform general-purpose ones +- Optional callout: the factory pattern refactor (commit `cd778ee`) — each provider self-registers on import, so adding + a new one doesn't touch `factory.py` (OCP) + - Reference: `server/embeddings/__init__.py`, `server/embeddings/factory.py` + +--- + +## Section 6 — What goes into Qdrant: the named-vector schema + +- One collection (`code_symbols`) with **two named vectors per point**: + - `text-dense` — cosine, provider-dependent dims + - `text-sparse` — Qdrant native BM25 sparse index + - Reference: `server/store/qdrant.py:47-62` +- The payload (the underappreciated half of every vector DB): + - Identity: `symbol_name`, `symbol_type`, `language`, `service`, `file_path`, `package`, `parent_name` + - Display: `signature`, `source`, `docstring`, `start_line`, `end_line` + - Filtering: `annotations`, `chunk_tier`, framework `extras` (HTTP method, route, Spring stereotype) + - Bookkeeping: `file_hash` (for incremental reindex), `indexed_at` + - Reference: `server/indexer/pipeline.py:104-125` +- Keyword payload indexes on the high-cardinality filter fields → fast `language=python AND service=catalog` style + filters +- Separate `git_commits` collection — dense-only, message + diff metadata + +--- + +## Section 7 — Hybrid retrieval at query time (RRF in one Qdrant call) + +- The query goes through *both* encoders: dense (full model) and sparse (tokenizer + BM25) +- One Qdrant `query_points` call does the fusion server-side: + ``` + FusionQuery(fusion=Fusion.RRF), + prefetch=[ + Prefetch(query=dense_vec, using="text-dense", limit=K*2), + Prefetch(query=sparse_vec, using="text-sparse", limit=K*2), + ] + ``` + - Reference: `server/store/qdrant.py:203-223` +- How RRF works in one paragraph: each retriever returns a ranked list, RRF scores each doc by `Σ 1/(k + rank_i)`, ties + broken by combined rank. No tuning of weights needed. +- Why this beats weighted sum: scale-free, doesn't depend on score calibration between dense cosine and BM25 +- Reference: `server/tools/search.py:20-78` + +--- + +## Section 8 — Indexing flow: incremental, content-addressed + +- Walk the repo (GitHub API or local), apply excludes +- For each file: compute blob SHA → compare against payload's `file_hash` → skip if unchanged +- Parse → build dense + sparse inputs → batch-embed → upsert (delete-then-insert per file path) +- Cleanup pass removes stale symbols for files no longer in the repo +- Reference: `server/indexer/pipeline.py:128-249` +- Why this matters: embedding API costs amortize across reindexes; large monorepos stay tractable + +--- + +## Section 9 — Bonus: indexing git history as a second RAG corpus + +- Separate pipeline embeds **commit messages + file deltas** into the `git_commits` collection +- Dense-only (commit messages are short, sparse adds little) +- Enables "when was retry logic introduced?" style queries +- Reference: `server/indexer/git_history.py:24-63`, `server/tools/history.py` + +--- + +## Section 10 — What I'd do differently / open questions + +- Re-ranker on top of RRF (cross-encoder) — worth the latency? +- Per-language collections vs single collection — when does the trade-off flip? +- Embedding the *call graph* (cross-symbol relationships), not just symbols in isolation +- Tuning the 6000-char source cap per language + +--- + +## Section 11 — Takeaways + +- Symbol-level chunking + rich, language-aware embedding inputs are the foundation +- Hybrid dense+sparse with RRF gives you both "intent" and "exact name" search for free, server-side +- The payload is half the system — invest in it +- Incremental indexing via blob SHAs is what makes this affordable at repo scale + +--- + +## Appendix — Suggested diagrams + +1. Pipeline overview: file → Tree-sitter → `CodeSymbol` → dense input + sparse input → Qdrant +2. Qdrant point anatomy: two named vectors + payload fields, annotated +3. Query-time RRF: query → two encoders → two ranked lists → fused result + +## Reference + +https://qdrant.tech/articles/sparse-vectors/ From efcb7482e2cfa3e0dceebe095d4d72276b9892ce Mon Sep 17 00:00:00 2001 From: Nemanja Date: Tue, 19 May 2026 09:54:59 +0200 Subject: [PATCH 2/6] Improve section 3 and remove section 4 --- blog.md | 49 ++++++++++++++----------------------------------- 1 file changed, 14 insertions(+), 35 deletions(-) diff --git a/blog.md b/blog.md index 0989d54..9509d84 100644 --- a/blog.md +++ b/blog.md @@ -180,9 +180,16 @@ How does **semcode** build the sparse input? Building BM25 text input is minimal — it concatenates only the signature, docstring, and raw source, with no metadata. It splits camelCase and snake_case identifiers into their component words while keeping the original form alongside. A -token like `PlaceOrderRequest`becomes `Place Order Request` — so BM25 can match the exact identifier *and* a +token like `PlaceOrderRequest` becomes `Place Order Request` — so BM25 can match the exact identifier *and* a natural-language query like "place order request" that doesn't use the original casing. +Why does sparse matter when the dense input is already rich? Dense embeddings excel at intent — a query like "find +the method that retries payments" can surface `retryWithBackoff` even if no query word appears in the source — but that +power trades precision for meaning, and rare or project-specific identifiers like `PlaceOrderRequest` get smoothed +toward neighboring concepts in the model's vector space. BM25 fills exactly that gap: it matches tokens literally with +no compression, and **semcode's** code-aware tokenization splits `PlaceOrderRequest` into `Place Order Request` alongside +the original, so it handles both exact identifier lookups and natural-language queries that dense alone would miss. + So the full picture is: Every `CodeSymbol` produces two inputs. The dense input is wide and context-rich — it tells the model the symbol's place in the system. The sparse input is narrow and literal — it gives BM25 the exact tokens to match against. Both @@ -190,35 +197,7 @@ are computed in the same pipeline step and stored together as a single point in --- -## Section 4 — The sparse side: BM25 with code-aware tokenization - -- BM25 input is intentionally coarser: signature + docstring + source only - - Reference: `server/indexer/pipeline.py:94-101` -- Identifier expansion: `CamelCase` and `snake_case` are split so BM25 can match partial queries - - Both original and split forms kept → "PlaceOrderRequest" matches exact lookups *and* "place order" - - Reference: `server/embeddings/code_tokenizer.py:6-16` -- Implementation: fastembed's `Bm25("Qdrant/bm25")`, stored as a native sparse vector in Qdrant - - Reference: `server/embeddings/bm25.py` -- What BM25 solves that dense doesn't: - - Exact symbol-name lookups - - Rare tokens (vocabulary mismatch — domain jargon, project-specific names) - - Queries that are *literal* references rather than intent descriptions - ---- - -## Section 5 — The dense side: pluggable embedding providers - -- Five providers, all behind one interface: Jina API (hosted), self-hosted Jina via TEI, OpenAI, Voyage, Ollama - - Reference: `server/embeddings/{jina_api,jina,openai,voyage,ollama}.py` -- Why pluggable matters for code: dimensions vary (768 → 3072), code-tuned models (jina-code-embeddings, voyage-code-3) - outperform general-purpose ones -- Optional callout: the factory pattern refactor (commit `cd778ee`) — each provider self-registers on import, so adding - a new one doesn't touch `factory.py` (OCP) - - Reference: `server/embeddings/__init__.py`, `server/embeddings/factory.py` - ---- - -## Section 6 — What goes into Qdrant: the named-vector schema +## Section 4 — What goes into Qdrant: the named-vector schema - One collection (`code_symbols`) with **two named vectors per point**: - `text-dense` — cosine, provider-dependent dims @@ -236,7 +215,7 @@ are computed in the same pipeline step and stored together as a single point in --- -## Section 7 — Hybrid retrieval at query time (RRF in one Qdrant call) +## Section 5 — Hybrid retrieval at query time (RRF in one Qdrant call) - The query goes through *both* encoders: dense (full model) and sparse (tokenizer + BM25) - One Qdrant `query_points` call does the fusion server-side: @@ -255,7 +234,7 @@ are computed in the same pipeline step and stored together as a single point in --- -## Section 8 — Indexing flow: incremental, content-addressed +## Section 6 — Indexing flow: incremental, content-addressed - Walk the repo (GitHub API or local), apply excludes - For each file: compute blob SHA → compare against payload's `file_hash` → skip if unchanged @@ -266,7 +245,7 @@ are computed in the same pipeline step and stored together as a single point in --- -## Section 9 — Bonus: indexing git history as a second RAG corpus +## Section 7 — Bonus: indexing git history as a second RAG corpus - Separate pipeline embeds **commit messages + file deltas** into the `git_commits` collection - Dense-only (commit messages are short, sparse adds little) @@ -275,7 +254,7 @@ are computed in the same pipeline step and stored together as a single point in --- -## Section 10 — What I'd do differently / open questions +## Section 8 — What I'd do differently / open questions - Re-ranker on top of RRF (cross-encoder) — worth the latency? - Per-language collections vs single collection — when does the trade-off flip? @@ -284,7 +263,7 @@ are computed in the same pipeline step and stored together as a single point in --- -## Section 11 — Takeaways +## Section 9 — Takeaways - Symbol-level chunking + rich, language-aware embedding inputs are the foundation - Hybrid dense+sparse with RRF gives you both "intent" and "exact name" search for free, server-side From 491e098ecb3a6670220231ab9f96a83af5620903 Mon Sep 17 00:00:00 2001 From: Nemanja Date: Fri, 22 May 2026 09:18:17 +0200 Subject: [PATCH 3/6] chore: Add section 4: what goes into Qdrant --- blog.md | 65 +++++++++++++++++++++++++++++++++++++++++++-------------- 1 file changed, 49 insertions(+), 16 deletions(-) diff --git a/blog.md b/blog.md index 9509d84..4a80b36 100644 --- a/blog.md +++ b/blog.md @@ -172,9 +172,9 @@ layers in framework-specific metadata — Spring stereotypes, HTTP method and ro docstring and the full signature. Finally, the raw source body is appended, capped at ~6,000 characters (~1,500 tokens). The goal is to give the embedding model everything it would need to understand the symbol's role, not just its implementation. -The fields that are useful for *displaying or filtering* results (like `start_line`, -`file_path`, or `parent_name`, `package`) are stored separately as the Qdrant **payload** — they sit next to the vector -but are never embedded. +The fields that are useful for *displaying* results (like `start_line`, `end_line`, `file_path`, `signature`, `source`) +or *filtering* them (like `language`, `service`, `symbol_type`) are stored separately as the Qdrant **payload** — +they sit next to the vector but are never embedded. How does **semcode** build the sparse input? @@ -199,19 +199,52 @@ are computed in the same pipeline step and stored together as a single point in ## Section 4 — What goes into Qdrant: the named-vector schema -- One collection (`code_symbols`) with **two named vectors per point**: - - `text-dense` — cosine, provider-dependent dims - - `text-sparse` — Qdrant native BM25 sparse index - - Reference: `server/store/qdrant.py:47-62` -- The payload (the underappreciated half of every vector DB): - - Identity: `symbol_name`, `symbol_type`, `language`, `service`, `file_path`, `package`, `parent_name` - - Display: `signature`, `source`, `docstring`, `start_line`, `end_line` - - Filtering: `annotations`, `chunk_tier`, framework `extras` (HTTP method, route, Spring stereotype) - - Bookkeeping: `file_hash` (for incremental reindex), `indexed_at` - - Reference: `server/indexer/pipeline.py:104-125` -- Keyword payload indexes on the high-cardinality filter fields → fast `language=python AND service=catalog` style - filters -- Separate `git_commits` collection — dense-only, message + diff metadata +In Section 3 it's explained that we have two inputs per symbol — dense and sparse — stored together in Qdrant. +This section explains *how* they are stored: the shape of a single stored point and why that shape matters at query time. + +### Named vectors: two vectors, one point + +Qdrant lets a single point carry multiple vectors under distinct names, each with its own distance metric and index. +**semcode** uses this directly: the `code_symbols` collection defines two named vectors per point. + +- `text-dense` — cosine distance, dimensionality set by the embedding provider. +- `text-sparse` — Qdrant's native BM25 sparse index. + +The advantage of named vectors over two parallel collections is that one point ID identifies one symbol everywhere. +Dense and sparse retrievers always agree on what "document 42" means, which is what makes server-side fusion (next +section) possible in a single round-trip. + +### Anatomy of a stored point + +Alongside the two vectors, there is the payload — the non-embedded half of the point. +Payload is a JSON object with the following fields: + +- **Identity & filtering** — `symbol_name`, `symbol_type`, `language`, `service`, + `file_path`, `package`, `parent_name`. These uniquely place the symbol in + the repo, and three of them — `language`, `service`, `symbol_type` — are + wired as active query-time filters. +- **Display** — `signature`, `source`, `docstring`, `start_line`, `end_line`, + `annotations`, `extras` (HTTP method, route, Spring stereotype). These are + what the MCP client renders back to the user — they are never filtered on, + just returned alongside the score (`server/tools/search.py:60-71`). +- **Bookkeeping** — `file_hash`, `indexed_at`. Not exposed at query time, but + critical for the incremental reindex flow: the hash is how the pipeline + decides a file hasn't changed and can be skipped (`server/indexer/pipeline.py:122-123`). + +### Payload indexes: filters before vectors + +By default, when you search Qdrant, it scores vectors first and filters results afterward. That means if you ask for +"OAuth 2.0 implementation in payment-service", Qdrant would still compare your query vector against *every* stored +symbol — then throw away the ones that don't match. + +Payload indexes flip this order. **semcode** indexes six fields — `language`, `service`, `symbol_type`, `chunk_tier`, +`parent_name`, `file_path` — so Qdrant can narrow the candidate set *before* any vector math happens. The +vector search then runs only over the matching symbols, not the whole collection. + +### A second, simpler collection + +Code symbols aren't the only RAG corpus in **semcode**. A separate `git_commits` collection stores commit messages and +diff metadata as dense-only points. --- From fcfa7f5279c2d1d91d2b47e46f4d744017b51d9f Mon Sep 17 00:00:00 2001 From: Nemanja Date: Tue, 26 May 2026 22:05:38 +0200 Subject: [PATCH 4/6] chore: Add section explaining RRF --- blog.md | 48 ++++++++++++++++++++++++++++++------------------ 1 file changed, 30 insertions(+), 18 deletions(-) diff --git a/blog.md b/blog.md index 4a80b36..48db017 100644 --- a/blog.md +++ b/blog.md @@ -187,7 +187,8 @@ Why does sparse matter when the dense input is already rich? Dense embeddings ex the method that retries payments" can surface `retryWithBackoff` even if no query word appears in the source — but that power trades precision for meaning, and rare or project-specific identifiers like `PlaceOrderRequest` get smoothed toward neighboring concepts in the model's vector space. BM25 fills exactly that gap: it matches tokens literally with -no compression, and **semcode's** code-aware tokenization splits `PlaceOrderRequest` into `Place Order Request` alongside +no compression, and **semcode's** code-aware tokenization splits `PlaceOrderRequest` into `Place Order Request` +alongside the original, so it handles both exact identifier lookups and natural-language queries that dense alone would miss. So the full picture is: @@ -200,7 +201,8 @@ are computed in the same pipeline step and stored together as a single point in ## Section 4 — What goes into Qdrant: the named-vector schema In Section 3 it's explained that we have two inputs per symbol — dense and sparse — stored together in Qdrant. -This section explains *how* they are stored: the shape of a single stored point and why that shape matters at query time. +This section explains *how* they are stored: the shape of a single stored point and why that shape matters at query +time. ### Named vectors: two vectors, one point @@ -248,23 +250,32 @@ diff metadata as dense-only points. --- -## Section 5 — Hybrid retrieval at query time (RRF in one Qdrant call) - -- The query goes through *both* encoders: dense (full model) and sparse (tokenizer + BM25) -- One Qdrant `query_points` call does the fusion server-side: - ``` - FusionQuery(fusion=Fusion.RRF), - prefetch=[ - Prefetch(query=dense_vec, using="text-dense", limit=K*2), - Prefetch(query=sparse_vec, using="text-sparse", limit=K*2), - ] - ``` - - Reference: `server/store/qdrant.py:203-223` -- How RRF works in one paragraph: each retriever returns a ranked list, RRF scores each doc by `Σ 1/(k + rank_i)`, ties - broken by combined rank. No tuning of weights needed. -- Why this beats weighted sum: scale-free, doesn't depend on score calibration between dense cosine and BM25 -- Reference: `server/tools/search.py:20-78` +## Section 5 — Hybrid retrieval at query time +At query time, the same two-track split like in the ingestion phase runs in reverse. The query string goes through both +encoders — the dense model turns it into a floating-point vector, the BM25 turns it into a sparse vector. +Both are sent to Qdrant in a single call, which runs each retriever independently, ranks the top K×2 candidates +from each, and produces two separate ranked lists. + +Qdrant then uses **Reciprocal Rank Fusion (RRF)** to merge those two ranked lists into one before returning the +final top K results. The merge looks like this step by step, using the query _"find the method that retries +failed payments"_ as an example: + +1. Dense retriever returns its ranked list: + `[retryWithBackoff (rank 1), processPayment (rank 2), PlaceOrderRequest (rank 3), ...]` +2. Sparse retriever returns its ranked list: + `[PlaceOrderRequest (rank 1), retryWithBackoff (rank 2), handleTimeout (rank 3), ...]` +3. RRF scores each result with `1 / (k + rank)` from every list it appears in, then sums those contributions +4. Everything is re-sorted by that combined score → one final list: + `[retryWithBackoff, PlaceOrderRequest, processPayment, handleTimeout, ...]` + +`retryWithBackoff` ranked first in dense and second in sparse — both retrievers agreed, so it floats to the top. +`PlaceOrderRequest` ranked first in sparse (exact token match) but third in dense — it still surfaces near the top +because the sparse retriever was confident. `processPayment` only appeared in one list despite a good dense rank, +so it scores lower. + +RRF rewards consistent rank across retrievers. The score it produces answers a simpler question: +"how consistently did this result appear near the top across both dense and sparse retrievers?" --- ## Section 6 — Indexing flow: incremental, content-addressed @@ -314,3 +325,4 @@ diff metadata as dense-only points. ## Reference https://qdrant.tech/articles/sparse-vectors/ +https://www.elastic.co/docs/reference/elasticsearch/rest-apis/reciprocal-rank-fusion \ No newline at end of file From dd45324dc0cb069a82cc27190f12abe7d8fe2c29 Mon Sep 17 00:00:00 2001 From: Nemanja Date: Sun, 31 May 2026 21:22:45 +0200 Subject: [PATCH 5/6] feat: Add section on indexing flow --- blog.md | 75 ++++++++++++++++++++++++++++++++------------------------- 1 file changed, 42 insertions(+), 33 deletions(-) diff --git a/blog.md b/blog.md index 48db017..66c0c62 100644 --- a/blog.md +++ b/blog.md @@ -250,7 +250,46 @@ diff metadata as dense-only points. --- -## Section 5 — Hybrid retrieval at query time +## Section 5 — Indexing flow: incremental, content-addressed + +Embedding API calls are the dominant cost in any indexing run, and re-embedding an entire repository on every push would +be expensive at scale. **semcode** avoids this by treating indexing as a diff operation: it uses git blob +SHAs as content fingerprints to identify which files have changed, and only those files are parsed, embedded, and +upserted. A service with 1,000 files where 10 changed sends 10 embedding requests, not 1,000. This section describes +the full indexing pipeline. + +### Step 1 — Discovery via the Git Trees API + +The pipeline opens by calling GitHub's Trees API. One request returns every file in the repository tree. Crucially, +each entry already includes the git `blob_sha` — git's own content hash for that file +— without downloading a single byte of source code. + +### Step 2 — Hash comparison before any network I/O + +Before fetching any file content, the pipeline loads the `file_hash` values stored in the Qdrant payload for all +already-indexed symbols in this service. It then compares each file's `blob_sha` +against that map. If the hashes match, the file is skipped entirely — no HTTP download, no parsing, no embedding call. +This is the core of the incremental design — instead of re-embedding every symbol on every run, only files whose content +actually changed are embedded again. + +### Step 3 — Fetch, parse, embed, upsert + +For every file that is new or has a changed blob SHA, the pipeline fetches the content by SHA, +parses it into `CodeSymbol` objects, builds both dense and sparse inputs as described in Section 3, +and calls both embedding providers in a batch. + +The upsert is a **delete-then-insert at the file level**: all existing points whose `file_path` matches are removed +first, then the freshly embedded points are inserted. This keeps the index clean when a file loses methods, +gains new ones, or is restructured. + +### Step 4 — Cleanup pass for deleted files + +After the main loop, the pipeline diffs the current repo file set against every `file_path` that exists in Qdrant. +Any path no longer present in the repo is deleted. + +--- + +## Section 6 — Hybrid retrieval at query time At query time, the same two-track split like in the ingestion phase runs in reverse. The query string goes through both encoders — the dense model turns it into a floating-point vector, the BM25 turns it into a sparse vector. @@ -258,8 +297,7 @@ Both are sent to Qdrant in a single call, which runs each retriever independentl from each, and produces two separate ranked lists. Qdrant then uses **Reciprocal Rank Fusion (RRF)** to merge those two ranked lists into one before returning the -final top K results. The merge looks like this step by step, using the query _"find the method that retries -failed payments"_ as an example: +final top K results. For example, using the query _"find the method that retries failed payments"_ merge looks like this: 1. Dense retriever returns its ranked list: `[retryWithBackoff (rank 1), processPayment (rank 2), PlaceOrderRequest (rank 3), ...]` @@ -278,36 +316,7 @@ RRF rewards consistent rank across retrievers. The score it produces answers a s "how consistently did this result appear near the top across both dense and sparse retrievers?" --- -## Section 6 — Indexing flow: incremental, content-addressed - -- Walk the repo (GitHub API or local), apply excludes -- For each file: compute blob SHA → compare against payload's `file_hash` → skip if unchanged -- Parse → build dense + sparse inputs → batch-embed → upsert (delete-then-insert per file path) -- Cleanup pass removes stale symbols for files no longer in the repo -- Reference: `server/indexer/pipeline.py:128-249` -- Why this matters: embedding API costs amortize across reindexes; large monorepos stay tractable - ---- - -## Section 7 — Bonus: indexing git history as a second RAG corpus - -- Separate pipeline embeds **commit messages + file deltas** into the `git_commits` collection -- Dense-only (commit messages are short, sparse adds little) -- Enables "when was retry logic introduced?" style queries -- Reference: `server/indexer/git_history.py:24-63`, `server/tools/history.py` - ---- - -## Section 8 — What I'd do differently / open questions - -- Re-ranker on top of RRF (cross-encoder) — worth the latency? -- Per-language collections vs single collection — when does the trade-off flip? -- Embedding the *call graph* (cross-symbol relationships), not just symbols in isolation -- Tuning the 6000-char source cap per language - ---- - -## Section 9 — Takeaways +## Section 7 — Takeaways - Symbol-level chunking + rich, language-aware embedding inputs are the foundation - Hybrid dense+sparse with RRF gives you both "intent" and "exact name" search for free, server-side From c71482fd6dc8e715348bfff304427c0b386fbce5 Mon Sep 17 00:00:00 2001 From: Nemanja Date: Sun, 31 May 2026 21:40:41 +0200 Subject: [PATCH 6/6] feat: Add conclustion section --- blog.md | 29 +++++++++++++++-------------- 1 file changed, 15 insertions(+), 14 deletions(-) diff --git a/blog.md b/blog.md index 66c0c62..af6d3ec 100644 --- a/blog.md +++ b/blog.md @@ -297,7 +297,8 @@ Both are sent to Qdrant in a single call, which runs each retriever independentl from each, and produces two separate ranked lists. Qdrant then uses **Reciprocal Rank Fusion (RRF)** to merge those two ranked lists into one before returning the -final top K results. For example, using the query _"find the method that retries failed payments"_ merge looks like this: +final top K results. For example, using the query _"find the method that retries failed payments"_ merge looks like +this: 1. Dense retriever returns its ranked list: `[retryWithBackoff (rank 1), processPayment (rank 2), PlaceOrderRequest (rank 3), ...]` @@ -316,22 +317,22 @@ RRF rewards consistent rank across retrievers. The score it produces answers a s "how consistently did this result appear near the top across both dense and sparse retrievers?" --- -## Section 7 — Takeaways +## Conclusion -- Symbol-level chunking + rich, language-aware embedding inputs are the foundation -- Hybrid dense+sparse with RRF gives you both "intent" and "exact name" search for free, server-side -- The payload is half the system — invest in it -- Incremental indexing via blob SHAs is what makes this affordable at repo scale +Building a RAG system for code has its own challenges, is not just RAG with a different file types — +it requires rethinking every layer of the pipeline, from how you chunk (by symbol, not paragraph) +to how you embed (rich context for dense vectors, exact tokens for sparse vectors) to how you store +(named vectors with a payload that carries as much signal as the vectors themselves). Hybrid +dense+sparse retrieval with server-side RRF bridges the gap between intent-based queries and exact identifier lookups, +giving you both in a single round-trip. The payload is half the system: without language, service, and type fields +indexed as filters, every search scans the entire collection regardless of how good the vectors are. And without +incremental indexing via blob SHAs, the embedding cost alone would make continuous reindexing impractical at any serious +repository scale. Together these choices form a pipeline that stays accurate, stays fast, and stays affordable as the +codebase grows. --- -## Appendix — Suggested diagrams - -1. Pipeline overview: file → Tree-sitter → `CodeSymbol` → dense input + sparse input → Qdrant -2. Qdrant point anatomy: two named vectors + payload fields, annotated -3. Query-time RRF: query → two encoders → two ranked lists → fused result - ## Reference -https://qdrant.tech/articles/sparse-vectors/ -https://www.elastic.co/docs/reference/elasticsearch/rest-apis/reciprocal-rank-fusion \ No newline at end of file +[Sparse Vectors](https://qdrant.tech/articles/sparse-vectors/) +[Reciprocal Rank Fusion (RRF)](https://www.elastic.co/docs/reference/elasticsearch/rest-apis/reciprocal-rank-fusion) \ No newline at end of file