diff --git a/blog.md b/blog.md index af6d3ec..466f74e 100644 --- a/blog.md +++ b/blog.md @@ -83,7 +83,7 @@ which captures everything needed to search, understand, and locate it without re What a `CodeSymbol` carries: **name / symbol_type / language** — These uniquely describe what kind of thing this is (save, -method, java) so retrieval can filter by language or type before even looking at embeddings. +method, java), and are stored on the point so results can be displayed and grouped by language or type. **signature** — The declaration line only, e.g. *def save(self, db: Session) -> User*. This is what you'd see in an IDE's autocomplete popup — compact enough to show in search results without including the full body. @@ -173,7 +173,7 @@ docstring and the full signature. Finally, the raw source body is appended, capp tokens). The goal is to give the embedding model everything it would need to understand the symbol's role, not just its implementation. The fields that are useful for *displaying* results (like `start_line`, `end_line`, `file_path`, `signature`, `source`) -or *filtering* them (like `language`, `service`, `symbol_type`) are stored separately as the Qdrant **payload** — +or *filtering* them (like `service`) are stored separately as the Qdrant **payload** — they sit next to the vector but are never embedded. How does **semcode** build the sparse input? @@ -223,8 +223,9 @@ Payload is a JSON object with the following fields: - **Identity & filtering** — `symbol_name`, `symbol_type`, `language`, `service`, `file_path`, `package`, `parent_name`. These uniquely place the symbol in - the repo, and three of them — `language`, `service`, `symbol_type` — are - wired as active query-time filters. + the repo. Only one of them — `service` — is wired as an active query-time + filter on semantic search; the others are kept on the payload for display, + scoped lookups (e.g. exact-name search), and future use. - **Display** — `signature`, `source`, `docstring`, `start_line`, `end_line`, `annotations`, `extras` (HTTP method, route, Spring stereotype). These are what the MCP client renders back to the user — they are never filtered on, @@ -241,7 +242,9 @@ symbol — then throw away the ones that don't match. Payload indexes flip this order. **semcode** indexes six fields — `language`, `service`, `symbol_type`, `chunk_tier`, `parent_name`, `file_path` — so Qdrant can narrow the candidate set *before* any vector math happens. The -vector search then runs only over the matching symbols, not the whole collection. +vector search then runs only over the matching symbols, not the whole collection. In practice the semantic search +path only filters on `service`; the other indexes still pay off for direct symbol lookups and the incremental +reindex flow, which scrolls the collection by `service` and `file_path`. ### A second, simpler collection @@ -324,8 +327,8 @@ it requires rethinking every layer of the pipeline, from how you chunk (by symbo to how you embed (rich context for dense vectors, exact tokens for sparse vectors) to how you store (named vectors with a payload that carries as much signal as the vectors themselves). Hybrid dense+sparse retrieval with server-side RRF bridges the gap between intent-based queries and exact identifier lookups, -giving you both in a single round-trip. The payload is half the system: without language, service, and type fields -indexed as filters, every search scans the entire collection regardless of how good the vectors are. And without +giving you both in a single round-trip. The payload is half the system: without a `service` filter indexed on the +payload, every search scans the entire collection regardless of how good the vectors are. And without incremental indexing via blob SHAs, the embedding cost alone would make continuous reindexing impractical at any serious repository scale. Together these choices form a pipeline that stays accurate, stays fast, and stays affordable as the codebase grows. diff --git a/server/store/qdrant.py b/server/store/qdrant.py index cafb4ea..8ab2169 100644 --- a/server/store/qdrant.py +++ b/server/store/qdrant.py @@ -182,23 +182,15 @@ async def search( dense_vector: list[float], sparse_vector: SparseVector, limit: int = 10, - language: str | None = None, service: str | None = None, - symbol_type: str | None = None, ) -> list[ScoredPoint]: - must = [] - if language: - must.append( - FieldCondition(key="language", match=MatchValue(value=language)) - ) - if service: - must.append(FieldCondition(key="service", match=MatchValue(value=service))) - if symbol_type: - must.append( - FieldCondition(key="symbol_type", match=MatchValue(value=symbol_type)) + query_filter = ( + Filter( + must=[FieldCondition(key="service", match=MatchValue(value=service))] ) - - query_filter = Filter(must=must) if must else None + if service + else None + ) result = await self._client.query_points( collection_name=self._collection, @@ -217,7 +209,6 @@ async def search( ), ], query=FusionQuery(fusion=Fusion.RRF), - query_filter=query_filter, limit=limit, with_payload=True, ) diff --git a/server/tools/search.py b/server/tools/search.py index 7d67e52..c99146c 100644 --- a/server/tools/search.py +++ b/server/tools/search.py @@ -19,19 +19,14 @@ def register_search_tools(mcp: FastMCP) -> None: @mcp.tool() async def search_code( query: str, - language: str | None = None, service: str | None = None, - symbol_type: str | None = None, limit: int = 10, ) -> str: """Semantically search code across indexed services using natural language. Args: query: Natural language description of what you're looking for. - language: Filter by language: java, python, typescript service: Filter by service name - symbol_type: Filter by type: class, method, interface, enum, record, function, - react_component, react_hook, type, pydantic_model limit: Maximum number of results (default 10) """ embedder = get_embedding_provider() @@ -44,9 +39,7 @@ async def search_code( dense_vector=dense_vector, sparse_vector=sparse_vector, limit=limit, - language=language, service=service, - symbol_type=symbol_type, ) if not results: