From ae88a4bb1952850d29f66ae24feaf079f0128137 Mon Sep 17 00:00:00 2001
From: Nemanja <nemanjavasa@gmail.com>
Date: Mon, 18 May 2026 09:14:54 +0200
Subject: [PATCH 1/6] feat: Add blog section

---
 .gitignore |   1 -
 blog.md    | 304 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 304 insertions(+), 1 deletion(-)
 create mode 100644 blog.md

diff --git a/.gitignore b/.gitignore
index d9e067b..8fe139c 100644
--- a/.gitignore
+++ b/.gitignore
@@ -21,7 +21,6 @@ build/
 *.iml
 .idea/
 
-blog.md
 
 # Local config (use config.example.yaml as a template)
 config.yaml
diff --git a/blog.md b/blog.md
new file mode 100644
index 0000000..0989d54
--- /dev/null
+++ b/blog.md
@@ -0,0 +1,304 @@
+# Blog Post Plan: How semcode Builds a RAG System for Code Search
+
+## Context
+
+This blog post explains the RAG (retrieval-augmented generation) pipeline behind
+[**semcode**](https://github.com/GoodbyePlanet/semcode), an MCP server that does
+semantic code search across your GitHub repositories. It covers both parts of the pipeline: the **ingestion** side — how
+repositories are found, how code is parsed into symbols with Tree-sitter, how embedding inputs are constructed both
+dense and sparse, and how
+points land in Qdrant incrementally — and the **retrieval** side — how queries are encoded into both dense and sparse
+vectors and fused server-side with RRF (Reciprocal Rank Fusion). Along the way we'll cover why a hybrid dense+sparse
+approach beats either one alone for code, and why the *payload* stored next to each vector matters as much as the vector
+itself.
+
+Audience: engineers familiar with RAG, embeddings, and vector DBs, curious about applying RAG to source code
+specifically (not prose).
+
+---
+
+## Section 1 — Why RAG for code is different from RAG for documents
+
+Most RAG systems are built around prose — PDFs, internal documentation, wikis... The content is natural language written
+for humans, meaning is carried in sentences, and semantic search over plain text works well, and when you add second
+stage retrieval (reranker), you get a system that can answer your questions with high confidence.
+Software code is different: it's structured, symbolic, it's written for compilers and interpreters. Meaning is
+distributed across structure, not sentences:
+
+- A function name (retryWithBackoff) carries intent
+- The signature (attempts: int, delay_ms: int) carries contract
+- The body carries implementation details
+- Annotations (@Retryable, @CircuitBreaker) carry framework behavior
+- The class it belongs to (OrderProcessingService) carries domain context
+
+None of that is a sentence. You can't chunk code by paragraph — you chunk by symbol (function, class, method).
+Let's see how that is implemented in **semcode**.
+
+---
+
+## Section 2 — From source files to Code Symbols - Tree-sitter parsing
+
+What is an AST?
+
+An Abstract Syntax Tree is a tree representation of source code's grammatical structure (logical parts of this code and
+how do they relate to each other). Every construct in your code —
+a function definition, a class, an if statement, a variable assignment — becomes a node in the tree, where parent-child
+relationships express nesting and ownership.
+
+For clarity, bellow is a pruned AST. Just to give you a mental model of how a parser sees
+a function: a decorated async definition with typed parameters, a return annotation, and a body containing a
+docstring and a single return.
+
+```shell
+@app.get("/users")
+async def list_users(db: Session) -> list[User]:
+    """Return all users."""
+    return db.query(User).all()
+
+module
+└── decorated_definition
+    ├── decorator              → "@app.get("/users")"
+    └── function_definition
+        ├── name               → "list_users"
+        ├── parameters         → "(db: Session)"
+        ├── return_type        → "list[User]"
+        └── body
+            ├── expression_statement
+            │   └── string     → '"""Return all users."""'
+            └── return_statement
+```
+
+What is Tree sitter?
+
+Tree-sitter is a parser generator tool and an incremental parsing library. It can build a concrete syntax tree for a
+source file and efficiently update the syntax tree as the source file is edited.
+[Tree-sitter official documentation](https://tree-sitter.github.io/tree-sitter/)
+
+What is a Code Symbol in **semcode**?
+
+A symbol is one named, self-contained unit of code that a language considers meaningful — a function, a class, a method,
+an interface, a React component, a hook... In **semcode** a symbol is a CodeSymbol dataclass,
+which captures everything needed to search, understand, and locate it without reading the surrounding file.
+
+What a `CodeSymbol` carries:
+
+**name / symbol_type / language** — These uniquely describe what kind of thing this is (save,
+method, java) so retrieval can filter by language or type before even looking at embeddings.
+
+**signature** — The declaration line only, e.g. *def save(self, db: Session) -> User*. This is what you'd see in an
+IDE's autocomplete popup — compact enough to show in search results without including the full body.
+
+**source** — The complete raw text of the symbol from open brace to closing brace. This is what gets embedded into the
+vector store, giving the model the full implementation context when a chunk is retrieved.
+
+**start_line / end_line** — Position recorded by Tree-sitter during parsing, used to link a search result back
+to an exact location in the file.
+
+**parent_name / package** — Structural context. **parent_name** says which class owns this method; **package** says
+which Java
+package or Python module the file belongs to. Without these, two methods both named save in different services are
+indistinguishable.
+
+**annotations / extras** — Language-specific enrichment. A Java @GetMapping("/users") lands in annotations; the
+extracted
+HTTP route string (GET /users) lands in extras. For TypeScript, extras flags whether a component uses hooks, or whether
+a function matches the React component signature pattern.
+
+Example:
+
+```shell
+CodeSymbol(
+    name="list_users",
+    symbol_type="api_route",
+    language="python",
+    source="async def list_users(db: Session) -> list[User]:\n    ...",
+    file_path="auth-service/routers/users.py",
+    start_line=2,
+    end_line=4,
+    parent_name=None,
+    package="auth-service.routers.users",
+    annotations=["app.get(\"/users\")"],
+    signature="async def list_users (db: Session) -> list[User]",
+    docstring='"""Return all users."""',
+    extras={"is_async": True, "http_method": "GET", "http_route": "/users"},
+)
+```
+
+So the full pipeline is:
+Tree-sitter parses code into an AST. The parser goes through that AST node by node, asks each node where it starts/ends
+and what it contains, and puts all of that into a **CodeSymbol** — one symbol per meaningful language construct.
+---
+
+## Section 3 — Building the embedding input
+
+Now, having knowledge about **CodeSymbols**, we can build the input for a vector database. In **semcode**
+[Qdrant](https://qdrant.tech/) is used for to store vectors we have two types of inputs: dense and sparse.
+
+What are dense embeddings?
+
+**Dense embeddings** encode the *meaning* of text into a fixed-size vector of floating-point numbers — typically
+hundreds or thousands of dimensions depending on which embedding provider is chosen. Two pieces of text that express the
+same idea will land close together in that vector space even if they share no words in common. For code search this
+means a query like "find the method that handles payment retries" can surface `retryWithBackoff()`
+without those words appearing anywhere in the source.
+
+```shell
+dense = [0.2, 0.3, 0.5, 0.7, ...]  # several hundred floats
+```
+
+What are sparse embeddings?
+
+**Sparse embeddings** work the opposite way: instead of capturing meaning, they represent text as a large vocabulary
+vector where almost every entry is zero and only the terms that actually appear get a non-zero weight. BM25 is the
+algorithm behind this — it scores each token by how often it appears in a document relative to how common it
+is across the whole corpus. This makes sparse embeddings excellent at exact keyword matching: if you search for
+`PlaceOrderRequest` or `@Transactional`, BM25 will find every document that contains those tokens precisely.
+
+```shell
+# Taken from Qdrant docs
+sparse = [{331: 0.5}, {14136: 0.7}]  # 20 key value pairs
+# The numbers 331 and 14136 map to specific tokens in the vocabulary e.g. ['Transactional', 'PlaceOrderRequest'].
+# The rest of the values are zero. This is why it’s called a sparse vector.
+```
+
+How does **semcode** build the dense input?
+
+The whole `CodeSymbol` object is not embedded directly — it is first serialized into a single text string, and that
+string is what the embedding model sees. One symbol produces one string, which produces one vector: an array of
+floating-point numbers (e.g. 768 or 3072 floats depending on the provider). The `CodeSymbol` fields that carry
+*meaning* go into that string.
+It starts with a human-readable preamble that names the language, symbol type, parent class, and owning service, then
+layers in framework-specific metadata — Spring stereotypes, HTTP method and route, annotations — followed by a truncated
+docstring and the full signature. Finally, the raw source body is appended, capped at ~6,000 characters (~1,500
+tokens). The goal is to give the embedding model everything it would need to understand the symbol's role, not just
+its implementation.
+The fields that are useful for *displaying or filtering* results (like `start_line`,
+`file_path`, or `parent_name`, `package`) are stored separately as the Qdrant **payload** — they sit next to the vector
+but are never embedded.
+
+How does **semcode** build the sparse input?
+
+Building BM25 text input is minimal — it concatenates only the signature, docstring, and raw source, with no metadata.
+It splits camelCase and snake_case identifiers into their component words while keeping the original form alongside. A
+token like `PlaceOrderRequest`becomes `Place Order Request` — so BM25 can match the exact identifier *and* a
+natural-language query like "place order request" that doesn't use the original casing.
+
+So the full picture is:
+Every `CodeSymbol` produces two inputs. The dense input is wide and context-rich — it tells the model the symbol's
+place in the system. The sparse input is narrow and literal — it gives BM25 the exact tokens to match against. Both
+are computed in the same pipeline step and stored together as a single point in Qdrant.
+
+---
+
+## Section 4 — The sparse side: BM25 with code-aware tokenization
+
+- BM25 input is intentionally coarser: signature + docstring + source only
+    - Reference: `server/indexer/pipeline.py:94-101`
+- Identifier expansion: `CamelCase` and `snake_case` are split so BM25 can match partial queries
+    - Both original and split forms kept → "PlaceOrderRequest" matches exact lookups *and* "place order"
+    - Reference: `server/embeddings/code_tokenizer.py:6-16`
+- Implementation: fastembed's `Bm25("Qdrant/bm25")`, stored as a native sparse vector in Qdrant
+    - Reference: `server/embeddings/bm25.py`
+- What BM25 solves that dense doesn't:
+    - Exact symbol-name lookups
+    - Rare tokens (vocabulary mismatch — domain jargon, project-specific names)
+    - Queries that are *literal* references rather than intent descriptions
+
+---
+
+## Section 5 — The dense side: pluggable embedding providers
+
+- Five providers, all behind one interface: Jina API (hosted), self-hosted Jina via TEI, OpenAI, Voyage, Ollama
+    - Reference: `server/embeddings/{jina_api,jina,openai,voyage,ollama}.py`
+- Why pluggable matters for code: dimensions vary (768 → 3072), code-tuned models (jina-code-embeddings, voyage-code-3)
+  outperform general-purpose ones
+- Optional callout: the factory pattern refactor (commit `cd778ee`) — each provider self-registers on import, so adding
+  a new one doesn't touch `factory.py` (OCP)
+    - Reference: `server/embeddings/__init__.py`, `server/embeddings/factory.py`
+
+---
+
+## Section 6 — What goes into Qdrant: the named-vector schema
+
+- One collection (`code_symbols`) with **two named vectors per point**:
+    - `text-dense` — cosine, provider-dependent dims
+    - `text-sparse` — Qdrant native BM25 sparse index
+    - Reference: `server/store/qdrant.py:47-62`
+- The payload (the underappreciated half of every vector DB):
+    - Identity: `symbol_name`, `symbol_type`, `language`, `service`, `file_path`, `package`, `parent_name`
+    - Display: `signature`, `source`, `docstring`, `start_line`, `end_line`
+    - Filtering: `annotations`, `chunk_tier`, framework `extras` (HTTP method, route, Spring stereotype)
+    - Bookkeeping: `file_hash` (for incremental reindex), `indexed_at`
+    - Reference: `server/indexer/pipeline.py:104-125`
+- Keyword payload indexes on the high-cardinality filter fields → fast `language=python AND service=catalog` style
+  filters
+- Separate `git_commits` collection — dense-only, message + diff metadata
+
+---
+
+## Section 7 — Hybrid retrieval at query time (RRF in one Qdrant call)
+
+- The query goes through *both* encoders: dense (full model) and sparse (tokenizer + BM25)
+- One Qdrant `query_points` call does the fusion server-side:
+  ```
+  FusionQuery(fusion=Fusion.RRF),
+  prefetch=[
+      Prefetch(query=dense_vec, using="text-dense", limit=K*2),
+      Prefetch(query=sparse_vec, using="text-sparse", limit=K*2),
+  ]
+  ```
+    - Reference: `server/store/qdrant.py:203-223`
+- How RRF works in one paragraph: each retriever returns a ranked list, RRF scores each doc by `Σ 1/(k + rank_i)`, ties
+  broken by combined rank. No tuning of weights needed.
+- Why this beats weighted sum: scale-free, doesn't depend on score calibration between dense cosine and BM25
+- Reference: `server/tools/search.py:20-78`
+
+---
+
+## Section 8 — Indexing flow: incremental, content-addressed
+
+- Walk the repo (GitHub API or local), apply excludes
+- For each file: compute blob SHA → compare against payload's `file_hash` → skip if unchanged
+- Parse → build dense + sparse inputs → batch-embed → upsert (delete-then-insert per file path)
+- Cleanup pass removes stale symbols for files no longer in the repo
+- Reference: `server/indexer/pipeline.py:128-249`
+- Why this matters: embedding API costs amortize across reindexes; large monorepos stay tractable
+
+---
+
+## Section 9 — Bonus: indexing git history as a second RAG corpus
+
+- Separate pipeline embeds **commit messages + file deltas** into the `git_commits` collection
+- Dense-only (commit messages are short, sparse adds little)
+- Enables "when was retry logic introduced?" style queries
+- Reference: `server/indexer/git_history.py:24-63`, `server/tools/history.py`
+
+---
+
+## Section 10 — What I'd do differently / open questions
+
+- Re-ranker on top of RRF (cross-encoder) — worth the latency?
+- Per-language collections vs single collection — when does the trade-off flip?
+- Embedding the *call graph* (cross-symbol relationships), not just symbols in isolation
+- Tuning the 6000-char source cap per language
+
+---
+
+## Section 11 — Takeaways
+
+- Symbol-level chunking + rich, language-aware embedding inputs are the foundation
+- Hybrid dense+sparse with RRF gives you both "intent" and "exact name" search for free, server-side
+- The payload is half the system — invest in it
+- Incremental indexing via blob SHAs is what makes this affordable at repo scale
+
+---
+
+## Appendix — Suggested diagrams
+
+1. Pipeline overview: file → Tree-sitter → `CodeSymbol` → dense input + sparse input → Qdrant
+2. Qdrant point anatomy: two named vectors + payload fields, annotated
+3. Query-time RRF: query → two encoders → two ranked lists → fused result
+
+## Reference
+
+https://qdrant.tech/articles/sparse-vectors/

From efcb7482e2cfa3e0dceebe095d4d72276b9892ce Mon Sep 17 00:00:00 2001
From: Nemanja <nemanjavasa@gmail.com>
Date: Tue, 19 May 2026 09:54:59 +0200
Subject: [PATCH 2/6] Improve section 3 and remove section 4

---
 blog.md | 49 ++++++++++++++-----------------------------------
 1 file changed, 14 insertions(+), 35 deletions(-)

diff --git a/blog.md b/blog.md
index 0989d54..9509d84 100644
--- a/blog.md
+++ b/blog.md
@@ -180,9 +180,16 @@ How does **semcode** build the sparse input?
 
 Building BM25 text input is minimal — it concatenates only the signature, docstring, and raw source, with no metadata.
 It splits camelCase and snake_case identifiers into their component words while keeping the original form alongside. A
-token like `PlaceOrderRequest`becomes `Place Order Request` — so BM25 can match the exact identifier *and* a
+token like `PlaceOrderRequest` becomes `Place Order Request` — so BM25 can match the exact identifier *and* a
 natural-language query like "place order request" that doesn't use the original casing.
 
+Why does sparse matter when the dense input is already rich? Dense embeddings excel at intent — a query like "find
+the method that retries payments" can surface `retryWithBackoff` even if no query word appears in the source — but that
+power trades precision for meaning, and rare or project-specific identifiers like `PlaceOrderRequest` get smoothed
+toward neighboring concepts in the model's vector space. BM25 fills exactly that gap: it matches tokens literally with
+no compression, and **semcode's** code-aware tokenization splits `PlaceOrderRequest` into `Place Order Request` alongside
+the original, so it handles both exact identifier lookups and natural-language queries that dense alone would miss.
+
 So the full picture is:
 Every `CodeSymbol` produces two inputs. The dense input is wide and context-rich — it tells the model the symbol's
 place in the system. The sparse input is narrow and literal — it gives BM25 the exact tokens to match against. Both
@@ -190,35 +197,7 @@ are computed in the same pipeline step and stored together as a single point in
 
 ---
 
-## Section 4 — The sparse side: BM25 with code-aware tokenization
-
-- BM25 input is intentionally coarser: signature + docstring + source only
-    - Reference: `server/indexer/pipeline.py:94-101`
-- Identifier expansion: `CamelCase` and `snake_case` are split so BM25 can match partial queries
-    - Both original and split forms kept → "PlaceOrderRequest" matches exact lookups *and* "place order"
-    - Reference: `server/embeddings/code_tokenizer.py:6-16`
-- Implementation: fastembed's `Bm25("Qdrant/bm25")`, stored as a native sparse vector in Qdrant
-    - Reference: `server/embeddings/bm25.py`
-- What BM25 solves that dense doesn't:
-    - Exact symbol-name lookups
-    - Rare tokens (vocabulary mismatch — domain jargon, project-specific names)
-    - Queries that are *literal* references rather than intent descriptions
-
----
-
-## Section 5 — The dense side: pluggable embedding providers
-
-- Five providers, all behind one interface: Jina API (hosted), self-hosted Jina via TEI, OpenAI, Voyage, Ollama
-    - Reference: `server/embeddings/{jina_api,jina,openai,voyage,ollama}.py`
-- Why pluggable matters for code: dimensions vary (768 → 3072), code-tuned models (jina-code-embeddings, voyage-code-3)
-  outperform general-purpose ones
-- Optional callout: the factory pattern refactor (commit `cd778ee`) — each provider self-registers on import, so adding
-  a new one doesn't touch `factory.py` (OCP)
-    - Reference: `server/embeddings/__init__.py`, `server/embeddings/factory.py`
-
----
-
-## Section 6 — What goes into Qdrant: the named-vector schema
+## Section 4 — What goes into Qdrant: the named-vector schema
 
 - One collection (`code_symbols`) with **two named vectors per point**:
     - `text-dense` — cosine, provider-dependent dims
@@ -236,7 +215,7 @@ are computed in the same pipeline step and stored together as a single point in
 
 ---
 
-## Section 7 — Hybrid retrieval at query time (RRF in one Qdrant call)
+## Section 5 — Hybrid retrieval at query time (RRF in one Qdrant call)
 
 - The query goes through *both* encoders: dense (full model) and sparse (tokenizer + BM25)
 - One Qdrant `query_points` call does the fusion server-side:
@@ -255,7 +234,7 @@ are computed in the same pipeline step and stored together as a single point in
 
 ---
 
-## Section 8 — Indexing flow: incremental, content-addressed
+## Section 6 — Indexing flow: incremental, content-addressed
 
 - Walk the repo (GitHub API or local), apply excludes
 - For each file: compute blob SHA → compare against payload's `file_hash` → skip if unchanged
@@ -266,7 +245,7 @@ are computed in the same pipeline step and stored together as a single point in
 
 ---
 
-## Section 9 — Bonus: indexing git history as a second RAG corpus
+## Section 7 — Bonus: indexing git history as a second RAG corpus
 
 - Separate pipeline embeds **commit messages + file deltas** into the `git_commits` collection
 - Dense-only (commit messages are short, sparse adds little)
@@ -275,7 +254,7 @@ are computed in the same pipeline step and stored together as a single point in
 
 ---
 
-## Section 10 — What I'd do differently / open questions
+## Section 8 — What I'd do differently / open questions
 
 - Re-ranker on top of RRF (cross-encoder) — worth the latency?
 - Per-language collections vs single collection — when does the trade-off flip?
@@ -284,7 +263,7 @@ are computed in the same pipeline step and stored together as a single point in
 
 ---
 
-## Section 11 — Takeaways
+## Section 9 — Takeaways
 
 - Symbol-level chunking + rich, language-aware embedding inputs are the foundation
 - Hybrid dense+sparse with RRF gives you both "intent" and "exact name" search for free, server-side

From 491e098ecb3a6670220231ab9f96a83af5620903 Mon Sep 17 00:00:00 2001
From: Nemanja <nemanjavasa@gmail.com>
Date: Fri, 22 May 2026 09:18:17 +0200
Subject: [PATCH 3/6] chore: Add section 4: what goes into Qdrant

---
 blog.md | 65 +++++++++++++++++++++++++++++++++++++++++++--------------
 1 file changed, 49 insertions(+), 16 deletions(-)

diff --git a/blog.md b/blog.md
index 9509d84..4a80b36 100644
--- a/blog.md
+++ b/blog.md
@@ -172,9 +172,9 @@ layers in framework-specific metadata — Spring stereotypes, HTTP method and ro
 docstring and the full signature. Finally, the raw source body is appended, capped at ~6,000 characters (~1,500
 tokens). The goal is to give the embedding model everything it would need to understand the symbol's role, not just
 its implementation.
-The fields that are useful for *displaying or filtering* results (like `start_line`,
-`file_path`, or `parent_name`, `package`) are stored separately as the Qdrant **payload** — they sit next to the vector
-but are never embedded.
+The fields that are useful for *displaying* results (like `start_line`, `end_line`, `file_path`, `signature`, `source`)
+or *filtering* them (like `language`, `service`, `symbol_type`) are stored separately as the Qdrant **payload** —
+they sit next to the vector but are never embedded.
 
 How does **semcode** build the sparse input?
 
@@ -199,19 +199,52 @@ are computed in the same pipeline step and stored together as a single point in
 
 ## Section 4 — What goes into Qdrant: the named-vector schema
 
-- One collection (`code_symbols`) with **two named vectors per point**:
-    - `text-dense` — cosine, provider-dependent dims
-    - `text-sparse` — Qdrant native BM25 sparse index
-    - Reference: `server/store/qdrant.py:47-62`
-- The payload (the underappreciated half of every vector DB):
-    - Identity: `symbol_name`, `symbol_type`, `language`, `service`, `file_path`, `package`, `parent_name`
-    - Display: `signature`, `source`, `docstring`, `start_line`, `end_line`
-    - Filtering: `annotations`, `chunk_tier`, framework `extras` (HTTP method, route, Spring stereotype)
-    - Bookkeeping: `file_hash` (for incremental reindex), `indexed_at`
-    - Reference: `server/indexer/pipeline.py:104-125`
-- Keyword payload indexes on the high-cardinality filter fields → fast `language=python AND service=catalog` style
-  filters
-- Separate `git_commits` collection — dense-only, message + diff metadata
+In Section 3 it's explained that we have two inputs per symbol — dense and sparse — stored together in Qdrant.
+This section explains *how* they are stored: the shape of a single stored point and why that shape matters at query time.
+
+### Named vectors: two vectors, one point
+
+Qdrant lets a single point carry multiple vectors under distinct names, each with its own distance metric and index.
+**semcode** uses this directly: the `code_symbols` collection defines two named vectors per point.
+
+- `text-dense` — cosine distance, dimensionality set by the embedding provider.
+- `text-sparse` — Qdrant's native BM25 sparse index.
+
+The advantage of named vectors over two parallel collections is that one point ID identifies one symbol everywhere.
+Dense and sparse retrievers always agree on what "document 42" means, which is what makes server-side fusion (next
+section) possible in a single round-trip.
+
+### Anatomy of a stored point
+
+Alongside the two vectors, there is the payload — the non-embedded half of the point.
+Payload is a JSON object with the following fields:
+
+- **Identity & filtering** — `symbol_name`, `symbol_type`, `language`, `service`,
+  `file_path`, `package`, `parent_name`. These uniquely place the symbol in
+  the repo, and three of them — `language`, `service`, `symbol_type` — are
+  wired as active query-time filters.
+- **Display** — `signature`, `source`, `docstring`, `start_line`, `end_line`,
+  `annotations`, `extras` (HTTP method, route, Spring stereotype). These are
+  what the MCP client renders back to the user — they are never filtered on,
+  just returned alongside the score (`server/tools/search.py:60-71`).
+- **Bookkeeping** — `file_hash`, `indexed_at`. Not exposed at query time, but
+  critical for the incremental reindex flow: the hash is how the pipeline
+  decides a file hasn't changed and can be skipped (`server/indexer/pipeline.py:122-123`).
+
+### Payload indexes: filters before vectors
+
+By default, when you search Qdrant, it scores vectors first and filters results afterward. That means if you ask for
+"OAuth 2.0 implementation in payment-service", Qdrant would still compare your query vector against *every* stored
+symbol — then throw away the ones that don't match.
+
+Payload indexes flip this order. **semcode** indexes six fields — `language`, `service`, `symbol_type`, `chunk_tier`,
+`parent_name`, `file_path` — so Qdrant can narrow the candidate set *before* any vector math happens. The
+vector search then runs only over the matching symbols, not the whole collection.
+
+### A second, simpler collection
+
+Code symbols aren't the only RAG corpus in **semcode**. A separate `git_commits` collection stores commit messages and
+diff metadata as dense-only points.
 
 ---
 

From fcfa7f5279c2d1d91d2b47e46f4d744017b51d9f Mon Sep 17 00:00:00 2001
From: Nemanja <nemanjavasa@gmail.com>
Date: Tue, 26 May 2026 22:05:38 +0200
Subject: [PATCH 4/6] chore: Add section explaining RRF

---
 blog.md | 48 ++++++++++++++++++++++++++++++------------------
 1 file changed, 30 insertions(+), 18 deletions(-)

diff --git a/blog.md b/blog.md
index 4a80b36..48db017 100644
--- a/blog.md
+++ b/blog.md
@@ -187,7 +187,8 @@ Why does sparse matter when the dense input is already rich? Dense embeddings ex
 the method that retries payments" can surface `retryWithBackoff` even if no query word appears in the source — but that
 power trades precision for meaning, and rare or project-specific identifiers like `PlaceOrderRequest` get smoothed
 toward neighboring concepts in the model's vector space. BM25 fills exactly that gap: it matches tokens literally with
-no compression, and **semcode's** code-aware tokenization splits `PlaceOrderRequest` into `Place Order Request` alongside
+no compression, and **semcode's** code-aware tokenization splits `PlaceOrderRequest` into `Place Order Request`
+alongside
 the original, so it handles both exact identifier lookups and natural-language queries that dense alone would miss.
 
 So the full picture is:
@@ -200,7 +201,8 @@ are computed in the same pipeline step and stored together as a single point in
 ## Section 4 — What goes into Qdrant: the named-vector schema
 
 In Section 3 it's explained that we have two inputs per symbol — dense and sparse — stored together in Qdrant.
-This section explains *how* they are stored: the shape of a single stored point and why that shape matters at query time.
+This section explains *how* they are stored: the shape of a single stored point and why that shape matters at query
+time.
 
 ### Named vectors: two vectors, one point
 
@@ -248,23 +250,32 @@ diff metadata as dense-only points.
 
 ---
 
-## Section 5 — Hybrid retrieval at query time (RRF in one Qdrant call)
-
-- The query goes through *both* encoders: dense (full model) and sparse (tokenizer + BM25)
-- One Qdrant `query_points` call does the fusion server-side:
-  ```
-  FusionQuery(fusion=Fusion.RRF),
-  prefetch=[
-      Prefetch(query=dense_vec, using="text-dense", limit=K*2),
-      Prefetch(query=sparse_vec, using="text-sparse", limit=K*2),
-  ]
-  ```
-    - Reference: `server/store/qdrant.py:203-223`
-- How RRF works in one paragraph: each retriever returns a ranked list, RRF scores each doc by `Σ 1/(k + rank_i)`, ties
-  broken by combined rank. No tuning of weights needed.
-- Why this beats weighted sum: scale-free, doesn't depend on score calibration between dense cosine and BM25
-- Reference: `server/tools/search.py:20-78`
+## Section 5 — Hybrid retrieval at query time
 
+At query time, the same two-track split like in the ingestion phase runs in reverse. The query string goes through both
+encoders — the dense model turns it into a floating-point vector, the BM25 turns it into a sparse vector.
+Both are sent to Qdrant in a single call, which runs each retriever independently, ranks the top K×2 candidates
+from each, and produces two separate ranked lists.
+
+Qdrant then uses **Reciprocal Rank Fusion (RRF)** to merge those two ranked lists into one before returning the
+final top K results. The merge looks like this step by step, using the query _"find the method that retries
+failed payments"_ as an example:
+
+1. Dense retriever returns its ranked list:
+   `[retryWithBackoff (rank 1), processPayment (rank 2), PlaceOrderRequest (rank 3), ...]`
+2. Sparse retriever returns its ranked list:
+   `[PlaceOrderRequest (rank 1), retryWithBackoff (rank 2), handleTimeout (rank 3), ...]`
+3. RRF scores each result with `1 / (k + rank)` from every list it appears in, then sums those contributions
+4. Everything is re-sorted by that combined score → one final list:
+   `[retryWithBackoff, PlaceOrderRequest, processPayment, handleTimeout, ...]`
+
+`retryWithBackoff` ranked first in dense and second in sparse — both retrievers agreed, so it floats to the top.
+`PlaceOrderRequest` ranked first in sparse (exact token match) but third in dense — it still surfaces near the top
+because the sparse retriever was confident. `processPayment` only appeared in one list despite a good dense rank,
+so it scores lower.
+
+RRF rewards consistent rank across retrievers. The score it produces answers a simpler question:
+"how consistently did this result appear near the top across both dense and sparse retrievers?"
 ---
 
 ## Section 6 — Indexing flow: incremental, content-addressed
@@ -314,3 +325,4 @@ diff metadata as dense-only points.
 ## Reference
 
 https://qdrant.tech/articles/sparse-vectors/
+https://www.elastic.co/docs/reference/elasticsearch/rest-apis/reciprocal-rank-fusion
\ No newline at end of file

From dd45324dc0cb069a82cc27190f12abe7d8fe2c29 Mon Sep 17 00:00:00 2001
From: Nemanja <nemanjavasa@gmail.com>
Date: Sun, 31 May 2026 21:22:45 +0200
Subject: [PATCH 5/6] feat: Add section on indexing flow

---
 blog.md | 75 ++++++++++++++++++++++++++++++++-------------------------
 1 file changed, 42 insertions(+), 33 deletions(-)

diff --git a/blog.md b/blog.md
index 48db017..66c0c62 100644
--- a/blog.md
+++ b/blog.md
@@ -250,7 +250,46 @@ diff metadata as dense-only points.
 
 ---
 
-## Section 5 — Hybrid retrieval at query time
+## Section 5 — Indexing flow: incremental, content-addressed
+
+Embedding API calls are the dominant cost in any indexing run, and re-embedding an entire repository on every push would
+be expensive at scale. **semcode** avoids this by treating indexing as a diff operation: it uses git blob
+SHAs as content fingerprints to identify which files have changed, and only those files are parsed, embedded, and
+upserted. A service with 1,000 files where 10 changed sends 10 embedding requests, not 1,000. This section describes
+the full indexing pipeline.
+
+### Step 1 — Discovery via the Git Trees API
+
+The pipeline opens by calling GitHub's Trees API. One request returns every file in the repository tree. Crucially,
+each entry already includes the git `blob_sha` — git's own content hash for that file
+— without downloading a single byte of source code.
+
+### Step 2 — Hash comparison before any network I/O
+
+Before fetching any file content, the pipeline loads the `file_hash` values stored in the Qdrant payload for all
+already-indexed symbols in this service. It then compares each file's `blob_sha`
+against that map. If the hashes match, the file is skipped entirely — no HTTP download, no parsing, no embedding call.
+This is the core of the incremental design — instead of re-embedding every symbol on every run, only files whose content
+actually changed are embedded again.
+
+### Step 3 — Fetch, parse, embed, upsert
+
+For every file that is new or has a changed blob SHA, the pipeline fetches the content by SHA,
+parses it into `CodeSymbol` objects, builds both dense and sparse inputs as described in Section 3,
+and calls both embedding providers in a batch.
+
+The upsert is a **delete-then-insert at the file level**: all existing points whose `file_path` matches are removed
+first, then the freshly embedded points are inserted. This keeps the index clean when a file loses methods,
+gains new ones, or is restructured.
+
+### Step 4 — Cleanup pass for deleted files
+
+After the main loop, the pipeline diffs the current repo file set against every `file_path` that exists in Qdrant.
+Any path no longer present in the repo is deleted.
+
+---
+
+## Section 6 — Hybrid retrieval at query time
 
 At query time, the same two-track split like in the ingestion phase runs in reverse. The query string goes through both
 encoders — the dense model turns it into a floating-point vector, the BM25 turns it into a sparse vector.
@@ -258,8 +297,7 @@ Both are sent to Qdrant in a single call, which runs each retriever independentl
 from each, and produces two separate ranked lists.
 
 Qdrant then uses **Reciprocal Rank Fusion (RRF)** to merge those two ranked lists into one before returning the
-final top K results. The merge looks like this step by step, using the query _"find the method that retries
-failed payments"_ as an example:
+final top K results. For example, using the query _"find the method that retries failed payments"_ merge looks like this:
 
 1. Dense retriever returns its ranked list:
    `[retryWithBackoff (rank 1), processPayment (rank 2), PlaceOrderRequest (rank 3), ...]`
@@ -278,36 +316,7 @@ RRF rewards consistent rank across retrievers. The score it produces answers a s
 "how consistently did this result appear near the top across both dense and sparse retrievers?"
 ---
 
-## Section 6 — Indexing flow: incremental, content-addressed
-
-- Walk the repo (GitHub API or local), apply excludes
-- For each file: compute blob SHA → compare against payload's `file_hash` → skip if unchanged
-- Parse → build dense + sparse inputs → batch-embed → upsert (delete-then-insert per file path)
-- Cleanup pass removes stale symbols for files no longer in the repo
-- Reference: `server/indexer/pipeline.py:128-249`
-- Why this matters: embedding API costs amortize across reindexes; large monorepos stay tractable
-
----
-
-## Section 7 — Bonus: indexing git history as a second RAG corpus
-
-- Separate pipeline embeds **commit messages + file deltas** into the `git_commits` collection
-- Dense-only (commit messages are short, sparse adds little)
-- Enables "when was retry logic introduced?" style queries
-- Reference: `server/indexer/git_history.py:24-63`, `server/tools/history.py`
-
----
-
-## Section 8 — What I'd do differently / open questions
-
-- Re-ranker on top of RRF (cross-encoder) — worth the latency?
-- Per-language collections vs single collection — when does the trade-off flip?
-- Embedding the *call graph* (cross-symbol relationships), not just symbols in isolation
-- Tuning the 6000-char source cap per language
-
----
-
-## Section 9 — Takeaways
+## Section 7 — Takeaways
 
 - Symbol-level chunking + rich, language-aware embedding inputs are the foundation
 - Hybrid dense+sparse with RRF gives you both "intent" and "exact name" search for free, server-side

From c71482fd6dc8e715348bfff304427c0b386fbce5 Mon Sep 17 00:00:00 2001
From: Nemanja <nemanjavasa@gmail.com>
Date: Sun, 31 May 2026 21:40:41 +0200
Subject: [PATCH 6/6] feat: Add conclustion section

---
 blog.md | 29 +++++++++++++++--------------
 1 file changed, 15 insertions(+), 14 deletions(-)

diff --git a/blog.md b/blog.md
index 66c0c62..af6d3ec 100644
--- a/blog.md
+++ b/blog.md
@@ -297,7 +297,8 @@ Both are sent to Qdrant in a single call, which runs each retriever independentl
 from each, and produces two separate ranked lists.
 
 Qdrant then uses **Reciprocal Rank Fusion (RRF)** to merge those two ranked lists into one before returning the
-final top K results. For example, using the query _"find the method that retries failed payments"_ merge looks like this:
+final top K results. For example, using the query _"find the method that retries failed payments"_ merge looks like
+this:
 
 1. Dense retriever returns its ranked list:
    `[retryWithBackoff (rank 1), processPayment (rank 2), PlaceOrderRequest (rank 3), ...]`
@@ -316,22 +317,22 @@ RRF rewards consistent rank across retrievers. The score it produces answers a s
 "how consistently did this result appear near the top across both dense and sparse retrievers?"
 ---
 
-## Section 7 — Takeaways
+## Conclusion
 
-- Symbol-level chunking + rich, language-aware embedding inputs are the foundation
-- Hybrid dense+sparse with RRF gives you both "intent" and "exact name" search for free, server-side
-- The payload is half the system — invest in it
-- Incremental indexing via blob SHAs is what makes this affordable at repo scale
+Building a RAG system for code has its own challenges, is not just RAG with a different file types —
+it requires rethinking every layer of the pipeline, from how you chunk (by symbol, not paragraph)
+to how you embed (rich context for dense vectors, exact tokens for sparse vectors) to how you store
+(named vectors with a payload that carries as much signal as the vectors themselves). Hybrid
+dense+sparse retrieval with server-side RRF bridges the gap between intent-based queries and exact identifier lookups,
+giving you both in a single round-trip. The payload is half the system: without language, service, and type fields
+indexed as filters, every search scans the entire collection regardless of how good the vectors are. And without
+incremental indexing via blob SHAs, the embedding cost alone would make continuous reindexing impractical at any serious
+repository scale. Together these choices form a pipeline that stays accurate, stays fast, and stays affordable as the
+codebase grows.
 
 ---
 
-## Appendix — Suggested diagrams
-
-1. Pipeline overview: file → Tree-sitter → `CodeSymbol` → dense input + sparse input → Qdrant
-2. Qdrant point anatomy: two named vectors + payload fields, annotated
-3. Query-time RRF: query → two encoders → two ranked lists → fused result
-
 ## Reference
 
-https://qdrant.tech/articles/sparse-vectors/
-https://www.elastic.co/docs/reference/elasticsearch/rest-apis/reciprocal-rank-fusion
\ No newline at end of file
+[Sparse Vectors](https://qdrant.tech/articles/sparse-vectors/)
+[Reciprocal Rank Fusion (RRF)](https://www.elastic.co/docs/reference/elasticsearch/rest-apis/reciprocal-rank-fusion)
\ No newline at end of file