GoodbyePlanet · GoodbyePlanet · May 31, 2026 · May 18, 2026 · May 19, 2026 · May 22, 2026
diff --git a/.gitignore b/.gitignore
@@ -21,7 +21,6 @@ build/
 *.iml
 .idea/
 
-blog.md
 
 # Local config (use config.example.yaml as a template)
 config.yaml

diff --git a/blog.md b/blog.md
@@ -0,0 +1,338 @@
+# Blog Post Plan: How semcode Builds a RAG System for Code Search
+
+## Context
+
+This blog post explains the RAG (retrieval-augmented generation) pipeline behind
+[**semcode**](https://github.com/GoodbyePlanet/semcode), an MCP server that does
+semantic code search across your GitHub repositories. It covers both parts of the pipeline: the **ingestion** side — how
+repositories are found, how code is parsed into symbols with Tree-sitter, how embedding inputs are constructed both
+dense and sparse, and how
+points land in Qdrant incrementally — and the **retrieval** side — how queries are encoded into both dense and sparse
+vectors and fused server-side with RRF (Reciprocal Rank Fusion). Along the way we'll cover why a hybrid dense+sparse
+approach beats either one alone for code, and why the *payload* stored next to each vector matters as much as the vector
+itself.
+
+Audience: engineers familiar with RAG, embeddings, and vector DBs, curious about applying RAG to source code
+specifically (not prose).
+
+---
+
+## Section 1 — Why RAG for code is different from RAG for documents
+
+Most RAG systems are built around prose — PDFs, internal documentation, wikis... The content is natural language written
+for humans, meaning is carried in sentences, and semantic search over plain text works well, and when you add second
+stage retrieval (reranker), you get a system that can answer your questions with high confidence.
+Software code is different: it's structured, symbolic, it's written for compilers and interpreters. Meaning is
+distributed across structure, not sentences:
+
+- A function name (retryWithBackoff) carries intent
+- The signature (attempts: int, delay_ms: int) carries contract
+- The body carries implementation details
+- Annotations (@Retryable, @CircuitBreaker) carry framework behavior
+- The class it belongs to (OrderProcessingService) carries domain context
+
+None of that is a sentence. You can't chunk code by paragraph — you chunk by symbol (function, class, method).
+Let's see how that is implemented in **semcode**.
+
+---
+
+## Section 2 — From source files to Code Symbols - Tree-sitter parsing
+
+What is an AST?
+
+An Abstract Syntax Tree is a tree representation of source code's grammatical structure (logical parts of this code and
+how do they relate to each other). Every construct in your code —
+a function definition, a class, an if statement, a variable assignment — becomes a node in the tree, where parent-child
+relationships express nesting and ownership.
+
+For clarity, bellow is a pruned AST. Just to give you a mental model of how a parser sees
+a function: a decorated async definition with typed parameters, a return annotation, and a body containing a
+docstring and a single return.
+
+```shell
+@app.get("/users")
+async def list_users(db: Session) -> list[User]:
+    """Return all users."""
+    return db.query(User).all()
+
+module
+└── decorated_definition
+    ├── decorator              → "@app.get("/users")"
+    └── function_definition
+        ├── name               → "list_users"
+        ├── parameters         → "(db: Session)"
+        ├── return_type        → "list[User]"
+        └── body
+            ├── expression_statement
+            │   └── string     → '"""Return all users."""'
+            └── return_statement
+```
+
+What is Tree sitter?
+
+Tree-sitter is a parser generator tool and an incremental parsing library. It can build a concrete syntax tree for a
+source file and efficiently update the syntax tree as the source file is edited.
+[Tree-sitter official documentation](https://tree-sitter.github.io/tree-sitter/)
+
+What is a Code Symbol in **semcode**?
+
+A symbol is one named, self-contained unit of code that a language considers meaningful — a function, a class, a method,
+an interface, a React component, a hook... In **semcode** a symbol is a CodeSymbol dataclass,
+which captures everything needed to search, understand, and locate it without reading the surrounding file.
+
+What a `CodeSymbol` carries:
+
+**name / symbol_type / language** — These uniquely describe what kind of thing this is (save,
+method, java) so retrieval can filter by language or type before even looking at embeddings.
+
+**signature** — The declaration line only, e.g. *def save(self, db: Session) -> User*. This is what you'd see in an
+IDE's autocomplete popup — compact enough to show in search results without including the full body.
+
+**source** — The complete raw text of the symbol from open brace to closing brace. This is what gets embedded into the
+vector store, giving the model the full implementation context when a chunk is retrieved.
+
+**start_line / end_line** — Position recorded by Tree-sitter during parsing, used to link a search result back
+to an exact location in the file.
+
+**parent_name / package** — Structural context. **parent_name** says which class owns this method; **package** says
+which Java
+package or Python module the file belongs to. Without these, two methods both named save in different services are
+indistinguishable.
+
+**annotations / extras** — Language-specific enrichment. A Java @GetMapping("/users") lands in annotations; the
+extracted
+HTTP route string (GET /users) lands in extras. For TypeScript, extras flags whether a component uses hooks, or whether
+a function matches the React component signature pattern.
+
+Example:
+
+```shell
+CodeSymbol(
+    name="list_users",
+    symbol_type="api_route",
+    language="python",
+    source="async def list_users(db: Session) -> list[User]:\n    ...",
+    file_path="auth-service/routers/users.py",
+    start_line=2,
+    end_line=4,
+    parent_name=None,
+    package="auth-service.routers.users",
+    annotations=["app.get(\"/users\")"],
+    signature="async def list_users (db: Session) -> list[User]",
+    docstring='"""Return all users."""',
+    extras={"is_async": True, "http_method": "GET", "http_route": "/users"},
+)
+```
+
+So the full pipeline is:
+Tree-sitter parses code into an AST. The parser goes through that AST node by node, asks each node where it starts/ends
+and what it contains, and puts all of that into a **CodeSymbol** — one symbol per meaningful language construct.
+---
+
+## Section 3 — Building the embedding input
+
+Now, having knowledge about **CodeSymbols**, we can build the input for a vector database. In **semcode**
+[Qdrant](https://qdrant.tech/) is used for to store vectors we have two types of inputs: dense and sparse.
+
+What are dense embeddings?
+
+**Dense embeddings** encode the *meaning* of text into a fixed-size vector of floating-point numbers — typically
+hundreds or thousands of dimensions depending on which embedding provider is chosen. Two pieces of text that express the
+same idea will land close together in that vector space even if they share no words in common. For code search this
+means a query like "find the method that handles payment retries" can surface `retryWithBackoff()`
+without those words appearing anywhere in the source.
+
+```shell
+dense = [0.2, 0.3, 0.5, 0.7, ...]  # several hundred floats
+```
+
+What are sparse embeddings?
+
+**Sparse embeddings** work the opposite way: instead of capturing meaning, they represent text as a large vocabulary
+vector where almost every entry is zero and only the terms that actually appear get a non-zero weight. BM25 is the
+algorithm behind this — it scores each token by how often it appears in a document relative to how common it
+is across the whole corpus. This makes sparse embeddings excellent at exact keyword matching: if you search for
+`PlaceOrderRequest` or `@Transactional`, BM25 will find every document that contains those tokens precisely.
+
+```shell
+# Taken from Qdrant docs
+sparse = [{331: 0.5}, {14136: 0.7}]  # 20 key value pairs
+# The numbers 331 and 14136 map to specific tokens in the vocabulary e.g. ['Transactional', 'PlaceOrderRequest'].
+# The rest of the values are zero. This is why it’s called a sparse vector.
+```
+
+How does **semcode** build the dense input?
+
+The whole `CodeSymbol` object is not embedded directly — it is first serialized into a single text string, and that
+string is what the embedding model sees. One symbol produces one string, which produces one vector: an array of
+floating-point numbers (e.g. 768 or 3072 floats depending on the provider). The `CodeSymbol` fields that carry
+*meaning* go into that string.
+It starts with a human-readable preamble that names the language, symbol type, parent class, and owning service, then
+layers in framework-specific metadata — Spring stereotypes, HTTP method and route, annotations — followed by a truncated
+docstring and the full signature. Finally, the raw source body is appended, capped at ~6,000 characters (~1,500
+tokens). The goal is to give the embedding model everything it would need to understand the symbol's role, not just
+its implementation.
+The fields that are useful for *displaying* results (like `start_line`, `end_line`, `file_path`, `signature`, `source`)
+or *filtering* them (like `language`, `service`, `symbol_type`) are stored separately as the Qdrant **payload** —
+they sit next to the vector but are never embedded.
+
+How does **semcode** build the sparse input?
+
+Building BM25 text input is minimal — it concatenates only the signature, docstring, and raw source, with no metadata.
+It splits camelCase and snake_case identifiers into their component words while keeping the original form alongside. A
+token like `PlaceOrderRequest` becomes `Place Order Request` — so BM25 can match the exact identifier *and* a
+natural-language query like "place order request" that doesn't use the original casing.
+
+Why does sparse matter when the dense input is already rich? Dense embeddings excel at intent — a query like "find
+the method that retries payments" can surface `retryWithBackoff` even if no query word appears in the source — but that
+power trades precision for meaning, and rare or project-specific identifiers like `PlaceOrderRequest` get smoothed
+toward neighboring concepts in the model's vector space. BM25 fills exactly that gap: it matches tokens literally with
+no compression, and **semcode's** code-aware tokenization splits `PlaceOrderRequest` into `Place Order Request`
+alongside
+the original, so it handles both exact identifier lookups and natural-language queries that dense alone would miss.
+
+So the full picture is:
+Every `CodeSymbol` produces two inputs. The dense input is wide and context-rich — it tells the model the symbol's
+place in the system. The sparse input is narrow and literal — it gives BM25 the exact tokens to match against. Both
+are computed in the same pipeline step and stored together as a single point in Qdrant.
+
+---
+
+## Section 4 — What goes into Qdrant: the named-vector schema
+
+In Section 3 it's explained that we have two inputs per symbol — dense and sparse — stored together in Qdrant.
+This section explains *how* they are stored: the shape of a single stored point and why that shape matters at query
+time.
+
+### Named vectors: two vectors, one point
+
+Qdrant lets a single point carry multiple vectors under distinct names, each with its own distance metric and index.
+**semcode** uses this directly: the `code_symbols` collection defines two named vectors per point.
+
+- `text-dense` — cosine distance, dimensionality set by the embedding provider.
+- `text-sparse` — Qdrant's native BM25 sparse index.
+
+The advantage of named vectors over two parallel collections is that one point ID identifies one symbol everywhere.
+Dense and sparse retrievers always agree on what "document 42" means, which is what makes server-side fusion (next
+section) possible in a single round-trip.
+
+### Anatomy of a stored point
+
+Alongside the two vectors, there is the payload — the non-embedded half of the point.
+Payload is a JSON object with the following fields:
+
+- **Identity & filtering** — `symbol_name`, `symbol_type`, `language`, `service`,
+  `file_path`, `package`, `parent_name`. These uniquely place the symbol in
+  the repo, and three of them — `language`, `service`, `symbol_type` — are
+  wired as active query-time filters.
+- **Display** — `signature`, `source`, `docstring`, `start_line`, `end_line`,
+  `annotations`, `extras` (HTTP method, route, Spring stereotype). These are
+  what the MCP client renders back to the user — they are never filtered on,
+  just returned alongside the score (`server/tools/search.py:60-71`).
+- **Bookkeeping** — `file_hash`, `indexed_at`. Not exposed at query time, but
+  critical for the incremental reindex flow: the hash is how the pipeline
+  decides a file hasn't changed and can be skipped (`server/indexer/pipeline.py:122-123`).
+
+### Payload indexes: filters before vectors
+
+By default, when you search Qdrant, it scores vectors first and filters results afterward. That means if you ask for
+"OAuth 2.0 implementation in payment-service", Qdrant would still compare your query vector against *every* stored
+symbol — then throw away the ones that don't match.
+
+Payload indexes flip this order. **semcode** indexes six fields — `language`, `service`, `symbol_type`, `chunk_tier`,
+`parent_name`, `file_path` — so Qdrant can narrow the candidate set *before* any vector math happens. The
+vector search then runs only over the matching symbols, not the whole collection.
+
+### A second, simpler collection
+
+Code symbols aren't the only RAG corpus in **semcode**. A separate `git_commits` collection stores commit messages and
+diff metadata as dense-only points.
+
+---
+
+## Section 5 — Indexing flow: incremental, content-addressed
+
+Embedding API calls are the dominant cost in any indexing run, and re-embedding an entire repository on every push would
+be expensive at scale. **semcode** avoids this by treating indexing as a diff operation: it uses git blob
+SHAs as content fingerprints to identify which files have changed, and only those files are parsed, embedded, and
+upserted. A service with 1,000 files where 10 changed sends 10 embedding requests, not 1,000. This section describes
+the full indexing pipeline.
+
+### Step 1 — Discovery via the Git Trees API
+
+The pipeline opens by calling GitHub's Trees API. One request returns every file in the repository tree. Crucially,
+each entry already includes the git `blob_sha` — git's own content hash for that file
+— without downloading a single byte of source code.
+
+### Step 2 — Hash comparison before any network I/O
+
+Before fetching any file content, the pipeline loads the `file_hash` values stored in the Qdrant payload for all
+already-indexed symbols in this service. It then compares each file's `blob_sha`
+against that map. If the hashes match, the file is skipped entirely — no HTTP download, no parsing, no embedding call.
+This is the core of the incremental design — instead of re-embedding every symbol on every run, only files whose content
+actually changed are embedded again.
+
+### Step 3 — Fetch, parse, embed, upsert
+
+For every file that is new or has a changed blob SHA, the pipeline fetches the content by SHA,
+parses it into `CodeSymbol` objects, builds both dense and sparse inputs as described in Section 3,
+and calls both embedding providers in a batch.
+
+The upsert is a **delete-then-insert at the file level**: all existing points whose `file_path` matches are removed
+first, then the freshly embedded points are inserted. This keeps the index clean when a file loses methods,
+gains new ones, or is restructured.
+
+### Step 4 — Cleanup pass for deleted files
+
+After the main loop, the pipeline diffs the current repo file set against every `file_path` that exists in Qdrant.
+Any path no longer present in the repo is deleted.
+
+---
+
+## Section 6 — Hybrid retrieval at query time
+
+At query time, the same two-track split like in the ingestion phase runs in reverse. The query string goes through both
+encoders — the dense model turns it into a floating-point vector, the BM25 turns it into a sparse vector.
+Both are sent to Qdrant in a single call, which runs each retriever independently, ranks the top K×2 candidates
+from each, and produces two separate ranked lists.
+
+Qdrant then uses **Reciprocal Rank Fusion (RRF)** to merge those two ranked lists into one before returning the
+final top K results. For example, using the query _"find the method that retries failed payments"_ merge looks like
+this:
+
+1. Dense retriever returns its ranked list:
+   `[retryWithBackoff (rank 1), processPayment (rank 2), PlaceOrderRequest (rank 3), ...]`
+2. Sparse retriever returns its ranked list:
+   `[PlaceOrderRequest (rank 1), retryWithBackoff (rank 2), handleTimeout (rank 3), ...]`
+3. RRF scores each result with `1 / (k + rank)` from every list it appears in, then sums those contributions
+4. Everything is re-sorted by that combined score → one final list:
+   `[retryWithBackoff, PlaceOrderRequest, processPayment, handleTimeout, ...]`
+
+`retryWithBackoff` ranked first in dense and second in sparse — both retrievers agreed, so it floats to the top.
+`PlaceOrderRequest` ranked first in sparse (exact token match) but third in dense — it still surfaces near the top
+because the sparse retriever was confident. `processPayment` only appeared in one list despite a good dense rank,
+so it scores lower.
+
+RRF rewards consistent rank across retrievers. The score it produces answers a simpler question:
+"how consistently did this result appear near the top across both dense and sparse retrievers?"
+---
+
+## Conclusion
+
+Building a RAG system for code has its own challenges, is not just RAG with a different file types —
+it requires rethinking every layer of the pipeline, from how you chunk (by symbol, not paragraph)
+to how you embed (rich context for dense vectors, exact tokens for sparse vectors) to how you store
+(named vectors with a payload that carries as much signal as the vectors themselves). Hybrid
+dense+sparse retrieval with server-side RRF bridges the gap between intent-based queries and exact identifier lookups,
+giving you both in a single round-trip. The payload is half the system: without language, service, and type fields
+indexed as filters, every search scans the entire collection regardless of how good the vectors are. And without
+incremental indexing via blob SHAs, the embedding cost alone would make continuous reindexing impractical at any serious
+repository scale. Together these choices form a pipeline that stays accurate, stays fast, and stays affordable as the
+codebase grows.
+
+---
+
+## Reference
+
+[Sparse Vectors](https://qdrant.tech/articles/sparse-vectors/)
+[Reciprocal Rank Fusion (RRF)](https://www.elastic.co/docs/reference/elasticsearch/rest-apis/reciprocal-rank-fusion)
-Original file line number
+Diff line change
@@ Expand Up / @@ -21,7 +21,6 @@ build/ @@
     *.iml
     .idea/
-    blog.md
     # Local config (use config.example.yaml as a template)
     config.yaml
@@ Expand Down @@