Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
17 changes: 10 additions & 7 deletions blog.md
Original file line number Diff line number Diff line change
Expand Up @@ -83,7 +83,7 @@ which captures everything needed to search, understand, and locate it without re
What a `CodeSymbol` carries:

**name / symbol_type / language** — These uniquely describe what kind of thing this is (save,
method, java) so retrieval can filter by language or type before even looking at embeddings.
method, java), and are stored on the point so results can be displayed and grouped by language or type.

**signature** — The declaration line only, e.g. *def save(self, db: Session) -> User*. This is what you'd see in an
IDE's autocomplete popup — compact enough to show in search results without including the full body.
Expand Down Expand Up @@ -173,7 +173,7 @@ docstring and the full signature. Finally, the raw source body is appended, capp
tokens). The goal is to give the embedding model everything it would need to understand the symbol's role, not just
its implementation.
The fields that are useful for *displaying* results (like `start_line`, `end_line`, `file_path`, `signature`, `source`)
or *filtering* them (like `language`, `service`, `symbol_type`) are stored separately as the Qdrant **payload** —
or *filtering* them (like `service`) are stored separately as the Qdrant **payload** —
they sit next to the vector but are never embedded.

How does **semcode** build the sparse input?
Expand Down Expand Up @@ -223,8 +223,9 @@ Payload is a JSON object with the following fields:

- **Identity & filtering** — `symbol_name`, `symbol_type`, `language`, `service`,
`file_path`, `package`, `parent_name`. These uniquely place the symbol in
the repo, and three of them — `language`, `service`, `symbol_type` — are
wired as active query-time filters.
the repo. Only one of them — `service` — is wired as an active query-time
filter on semantic search; the others are kept on the payload for display,
scoped lookups (e.g. exact-name search), and future use.
- **Display** — `signature`, `source`, `docstring`, `start_line`, `end_line`,
`annotations`, `extras` (HTTP method, route, Spring stereotype). These are
what the MCP client renders back to the user — they are never filtered on,
Expand All @@ -241,7 +242,9 @@ symbol — then throw away the ones that don't match.

Payload indexes flip this order. **semcode** indexes six fields — `language`, `service`, `symbol_type`, `chunk_tier`,
`parent_name`, `file_path` — so Qdrant can narrow the candidate set *before* any vector math happens. The
vector search then runs only over the matching symbols, not the whole collection.
vector search then runs only over the matching symbols, not the whole collection. In practice the semantic search
path only filters on `service`; the other indexes still pay off for direct symbol lookups and the incremental
reindex flow, which scrolls the collection by `service` and `file_path`.

### A second, simpler collection

Expand Down Expand Up @@ -324,8 +327,8 @@ it requires rethinking every layer of the pipeline, from how you chunk (by symbo
to how you embed (rich context for dense vectors, exact tokens for sparse vectors) to how you store
(named vectors with a payload that carries as much signal as the vectors themselves). Hybrid
dense+sparse retrieval with server-side RRF bridges the gap between intent-based queries and exact identifier lookups,
giving you both in a single round-trip. The payload is half the system: without language, service, and type fields
indexed as filters, every search scans the entire collection regardless of how good the vectors are. And without
giving you both in a single round-trip. The payload is half the system: without a `service` filter indexed on the
payload, every search scans the entire collection regardless of how good the vectors are. And without
incremental indexing via blob SHAs, the embedding cost alone would make continuous reindexing impractical at any serious
repository scale. Together these choices form a pipeline that stays accurate, stays fast, and stays affordable as the
codebase grows.
Expand Down
21 changes: 6 additions & 15 deletions server/store/qdrant.py
Original file line number Diff line number Diff line change
Expand Up @@ -182,23 +182,15 @@ async def search(
dense_vector: list[float],
sparse_vector: SparseVector,
limit: int = 10,
language: str | None = None,
service: str | None = None,
symbol_type: str | None = None,
) -> list[ScoredPoint]:
must = []
if language:
must.append(
FieldCondition(key="language", match=MatchValue(value=language))
)
if service:
must.append(FieldCondition(key="service", match=MatchValue(value=service)))
if symbol_type:
must.append(
FieldCondition(key="symbol_type", match=MatchValue(value=symbol_type))
query_filter = (
Filter(
must=[FieldCondition(key="service", match=MatchValue(value=service))]
)

query_filter = Filter(must=must) if must else None
if service
else None
)

result = await self._client.query_points(
collection_name=self._collection,
Expand All @@ -217,7 +209,6 @@ async def search(
),
],
query=FusionQuery(fusion=Fusion.RRF),
query_filter=query_filter,
limit=limit,
with_payload=True,
)
Expand Down
7 changes: 0 additions & 7 deletions server/tools/search.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,19 +19,14 @@ def register_search_tools(mcp: FastMCP) -> None:
@mcp.tool()
async def search_code(
query: str,
language: str | None = None,
service: str | None = None,
symbol_type: str | None = None,
limit: int = 10,
) -> str:
"""Semantically search code across indexed services using natural language.

Args:
query: Natural language description of what you're looking for.
language: Filter by language: java, python, typescript
service: Filter by service name
symbol_type: Filter by type: class, method, interface, enum, record, function,
react_component, react_hook, type, pydantic_model
limit: Maximum number of results (default 10)
"""
embedder = get_embedding_provider()
Expand All @@ -44,9 +39,7 @@ async def search_code(
dense_vector=dense_vector,
sparse_vector=sparse_vector,
limit=limit,
language=language,
service=service,
symbol_type=symbol_type,
)

if not results:
Expand Down
Loading