Benchmark: Nomic-embed-text (Ollama) vs Default SentenceTransformer — Results & Migration Path #668

doobidoo · 2026-04-08T10:51:03Z

doobidoo
Apr 8, 2026
Maintainer

Context

With external embedding API support merged (PR #386) and the embedding migration script available (#556), I benchmarked nomic-embed-text via Ollama against the default SentenceTransformer (all-MiniLM-L6-v2) on an M-series Mac.

Benchmark Results

Latency

Metric	Nomic (Ollama local)	Anthropic/OpenAI API
First call (warmup)	596ms	N/A
Single query (warm)	16-23ms	200-400ms
Batch (8 queries)	11.6ms/query	N/A

Embedding Dimensions

Nomic-embed-text: 768
all-MiniLM-L6-v2: 384
⚠️ Dimension mismatch requires full re-embedding via scripts/maintenance/migrate_embeddings.py

Similarity Quality

Test	Score	Pass?
"AI agent skill evolution" ↔ "Self-evolving skills"	0.867	✓
"Claude Code performance optimization" ↔ "Making Claude Code faster with bare flag"	0.574	Borderline
"Vector database for memory storage" ↔ "Embedding search in sqlite-vec"	0.539	Borderline
"AI agent skill evolution" ↔ "Cloud video transcoding" (dissimilar)	0.385	✓
"Memory consolidation" ↔ "Cold email outreach" (dissimilar)	0.370	✓

Observation: Nomic is strong on broad semantic similarity but weaker on domain-specific technical terms. The BM25 hybrid search (enabled by default, 0.3/0.7 weights) effectively compensates by catching exact keyword matches.

Cost

Nomic local: $0.00/month
API embeddings: ~$0.60/month at 1000 queries/day

Configuration

# Already supported via PR #386
export MCP_EXTERNAL_EMBEDDING_URL=http://localhost:11434/v1/embeddings
export MCP_EXTERNAL_EMBEDDING_MODEL=nomic-embed-text

# Migration from 384 to 768 dimensions
python scripts/maintenance/migrate_embeddings.py

Recommendation

Nomic-embed-text is a viable local alternative — faster than API calls, zero cost, decent quality. The dimension mismatch (768 vs 384) means migration is required. Best time to switch: during a major version upgrade or when re-embedding is needed anyway.

The already-active BM25 hybrid search compensates for embedding model weaknesses on exact keyword matches regardless of which model you use.

Questions for the Community

Has anyone run similar benchmarks with other local models (e.g., bge-base-en-v1.5, e5-small)?
Would a benchmark script in the repo be useful for users evaluating embedding alternatives?
Interest in documenting recommended Ollama + Nomic setup in the wiki?

doobidoo · 2026-04-08T10:59:26Z

doobidoo
Apr 8, 2026
Maintainer Author

⚠️ Important: Hybrid/Cloudflare Backend Incompatibility

After further investigation, Nomic-embed-text (and any external embedding API) is NOT compatible with the hybrid or cloudflare storage backends.

Root Cause

The Cloudflare backend hardcodes @cf/baai/bge-base-en-v1.5 (768-dim) via Workers AI. There is no way to inject an external embedding model into the Cloudflare path. The code in sqlite_vec.py explicitly detects hybrid/cloudflare backends and disables external embedding API support with a warning:

if storage_backend in ("hybrid", "cloudflare"):
    logger.warning("External embedding API not supported with hybrid/cloudflare backend...")
    external_api_url = None  # Disable external API

What This Means

Backend	External Embedding (Nomic/Ollama)	Embedding Source
`sqlite_vec`	✅ Works	Ollama / vLLM / TEI
`cloudflare`	❌ Blocked	Workers AI (`bge-base-en-v1.5`)
`hybrid`	❌ Blocked	Local ONNX/SentenceTransformer + Workers AI

The Blocker Is Architectural, Not Dimensional

Interestingly, the dimensions actually match: both Nomic-embed-text (768) and Cloudflare Workers AI bge-base-en-v1.5 (768) produce 768-dimensional vectors. This means that if the Cloudflare backend were refactored to accept external embeddings, Nomic vectors would be dimensionally compatible with the existing Vectorize index.

The real blocker is purely architectural: the Cloudflare Worker code calls Workers AI directly for embeddings, and there is no hook to substitute an external embedding source.

Note: The dimension mismatch only exists between the default local ONNX model (all-MiniLM-L6-v2, 384-dim) and Nomic (768-dim) — relevant when switching models on the sqlite_vec backend (requires re-embedding all memories).

Recommendation

For users on the hybrid backend (recommended for production), Nomic-embed-text is not a viable option today. The path forward would be:

Option A: Refactor the Cloudflare backend to accept external embeddings via API — the matching 768 dimensions mean Nomic vectors could slot into the existing Vectorize index without re-indexing
Option B: Use Nomic only in sqlite_vec mode and accept no cloud sync
Option C: Wait for Cloudflare to support custom embedding models in Workers AI

Option A is more feasible than initially thought, precisely because the dimensions already align. The refactor would primarily involve routing embedding generation through an external API instead of Workers AI, while keeping the Vectorize storage layer unchanged.

This should be documented more prominently in the external embeddings guide.

0 replies

kinthaiofficial · 2026-04-29T00:06:25Z

kinthaiofficial
Apr 29, 2026

Useful benchmark! A few additional dimensions worth testing when choosing embeddings for agent memory:

Temporal query performance: Agent memory queries are often temporal ("what did we discuss about X recently?" vs "what do I know about X at all?"). Test whether the embedding model preserves temporal distance — embeddings of similar content from different time periods should be distinguishable when combined with a timestamp field.

Update latency matters more than read latency for active agents: In a multi-agent system with 200+ agents each updating shared memory, write throughput is the bottleneck, not read speed. Benchmark: how many concurrent embedding + write operations can you sustain before p99 latency exceeds 50ms?

Cross-agent coherency: If Agent A and Agent B both embed the same factual statement slightly differently phrased, do their embeddings cluster together (good) or diverge (bad)? Test with paraphrase pairs from your actual agent outputs.

Dimensionality vs. agent memory size: Higher-dimensional embeddings give better accuracy but cost more storage and memory. For large agent memory stores (100K+ entries), the difference between 512-d and 1536-d embeddings is significant in both cost and retrieval speed.

For our three-tier memory system (hot/warm/cold), we use different embedding strategies per tier — full-precision on hot (few entries), quantized on cold (many entries). This balances accuracy against storage cost.

Persistent memory architecture: https://blog.kinthai.ai/why-character-ai-forgets-you-persistent-memory-architecture

0 replies

doobidoo · 2026-05-15T05:39:45Z

doobidoo
May 15, 2026
Maintainer Author

Great additions @kinthaiofficial — let me go through each one against the current architecture:

1. Temporal query performance

The service stores created_at and updated_at as top-level fields but does not currently fuse them into the embedding vector itself — retrieval is purely semantic similarity ranked by cosine distance. Temporal recency is a separate signal applied via quality scoring (implicit_signals.py: access count + recency decay), but it does not influence which memories are retrieved, only how they are ranked post-retrieval. Your point is valid: two semantically near-identical memories from different time periods will receive nearly identical embedding scores. The workaround today is filtering via memory_search(start_date=..., end_date=...) before semantic ranking — not ideal for "what did we discuss recently about X?" queries. This is worth a dedicated improvement.

2. Write throughput for multi-agent systems

The original benchmark covered read latency; you are right that for multi-agent write-heavy workloads, write throughput is the real constraint. Current architecture: SQLite in WAL mode (journal_mode=WAL) allows concurrent reads alongside a single writer. Under sustained concurrent writes, SQLite serialises them — p99 will climb with writer count. For 200+ agents writing simultaneously, the Hybrid backend offloads writes to a background sync queue (local SQLite write → Cloudflare async), which helps. The Milvus backend (v10.42+) is the more appropriate choice for that concurrency profile — it handles concurrent writes natively. We have benchmark_hybrid_sync.py in scripts/benchmarks/ but no multi-writer concurrency benchmark yet; that would be a useful addition.

3. Cross-agent coherency

This is actually well-handled by design: semantic similarity search clusters paraphrase pairs together regardless of phrasing, so Agent A and Agent B embedding the same fact differently will both surface on retrieval. The practical risk is the inverse — two semantically similar but factually distinct memories (same topic, different conclusion) being retrieved interchangeably. That is exactly what Phase 4 (Temporal Contradiction Detection, currently in design) aims to address.

4. Dimensionality vs. corpus size

Agreed on the storage/speed tradeoff. The default 384-dim (all-MiniLM-L6-v2) was chosen for this reason — good enough accuracy at minimal footprint. At 100K+ memories, switching to 768-dim roughly doubles both the index size and KNN scan time. External embedding API support (PR #386) lets users substitute any model, but re-embedding an existing corpus requires scripts/maintenance/migrate_embeddings.py — dimension changes are not zero-downtime.

Your hot/warm/cold tier approach with quantized embeddings on cold storage is a compelling pattern that we have not formally supported. The architecture does not currently differentiate embedding precision by tier — worth a feature discussion. The blog post you linked does a good job framing why this matters at scale.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Benchmark: Nomic-embed-text (Ollama) vs Default SentenceTransformer — Results & Migration Path #668

Uh oh!

{{title}}

Uh oh!

Replies: 3 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Benchmark: Nomic-embed-text (Ollama) vs Default SentenceTransformer — Results & Migration Path #668

Uh oh!

doobidoo Apr 8, 2026 Maintainer

Context

Benchmark Results

Latency

Embedding Dimensions

Similarity Quality

Cost

Configuration

Recommendation

Questions for the Community

Replies: 3 comments

Uh oh!

Uh oh!

doobidoo Apr 8, 2026 Maintainer Author

⚠️ Important: Hybrid/Cloudflare Backend Incompatibility

Root Cause

What This Means

The Blocker Is Architectural, Not Dimensional

Recommendation

Uh oh!

kinthaiofficial Apr 29, 2026

Uh oh!

doobidoo May 15, 2026 Maintainer Author

doobidoo
Apr 8, 2026
Maintainer

doobidoo
Apr 8, 2026
Maintainer Author

kinthaiofficial
Apr 29, 2026

doobidoo
May 15, 2026
Maintainer Author