Replies: 3 comments
-
|
| Backend | External Embedding (Nomic/Ollama) | Embedding Source |
|---|---|---|
sqlite_vec |
✅ Works | Ollama / vLLM / TEI |
cloudflare |
❌ Blocked | Workers AI (bge-base-en-v1.5) |
hybrid |
❌ Blocked | Local ONNX/SentenceTransformer + Workers AI |
The Blocker Is Architectural, Not Dimensional
Interestingly, the dimensions actually match: both Nomic-embed-text (768) and Cloudflare Workers AI bge-base-en-v1.5 (768) produce 768-dimensional vectors. This means that if the Cloudflare backend were refactored to accept external embeddings, Nomic vectors would be dimensionally compatible with the existing Vectorize index.
The real blocker is purely architectural: the Cloudflare Worker code calls Workers AI directly for embeddings, and there is no hook to substitute an external embedding source.
Note: The dimension mismatch only exists between the default local ONNX model (all-MiniLM-L6-v2, 384-dim) and Nomic (768-dim) — relevant when switching models on the sqlite_vec backend (requires re-embedding all memories).
Recommendation
For users on the hybrid backend (recommended for production), Nomic-embed-text is not a viable option today. The path forward would be:
- Option A: Refactor the Cloudflare backend to accept external embeddings via API — the matching 768 dimensions mean Nomic vectors could slot into the existing Vectorize index without re-indexing
- Option B: Use Nomic only in
sqlite_vecmode and accept no cloud sync - Option C: Wait for Cloudflare to support custom embedding models in Workers AI
Option A is more feasible than initially thought, precisely because the dimensions already align. The refactor would primarily involve routing embedding generation through an external API instead of Workers AI, while keeping the Vectorize storage layer unchanged.
This should be documented more prominently in the external embeddings guide.
Beta Was this translation helpful? Give feedback.
-
|
Useful benchmark! A few additional dimensions worth testing when choosing embeddings for agent memory: Temporal query performance: Agent memory queries are often temporal ("what did we discuss about X recently?" vs "what do I know about X at all?"). Test whether the embedding model preserves temporal distance — embeddings of similar content from different time periods should be distinguishable when combined with a timestamp field. Update latency matters more than read latency for active agents: In a multi-agent system with 200+ agents each updating shared memory, write throughput is the bottleneck, not read speed. Benchmark: how many concurrent embedding + write operations can you sustain before p99 latency exceeds 50ms? Cross-agent coherency: If Agent A and Agent B both embed the same factual statement slightly differently phrased, do their embeddings cluster together (good) or diverge (bad)? Test with paraphrase pairs from your actual agent outputs. Dimensionality vs. agent memory size: Higher-dimensional embeddings give better accuracy but cost more storage and memory. For large agent memory stores (100K+ entries), the difference between 512-d and 1536-d embeddings is significant in both cost and retrieval speed. For our three-tier memory system (hot/warm/cold), we use different embedding strategies per tier — full-precision on hot (few entries), quantized on cold (many entries). This balances accuracy against storage cost. Persistent memory architecture: https://blog.kinthai.ai/why-character-ai-forgets-you-persistent-memory-architecture |
Beta Was this translation helpful? Give feedback.
-
|
Great additions @kinthaiofficial — let me go through each one against the current architecture: 1. Temporal query performance The service stores 2. Write throughput for multi-agent systems The original benchmark covered read latency; you are right that for multi-agent write-heavy workloads, write throughput is the real constraint. Current architecture: SQLite in WAL mode ( 3. Cross-agent coherency This is actually well-handled by design: semantic similarity search clusters paraphrase pairs together regardless of phrasing, so Agent A and Agent B embedding the same fact differently will both surface on retrieval. The practical risk is the inverse — two semantically similar but factually distinct memories (same topic, different conclusion) being retrieved interchangeably. That is exactly what Phase 4 (Temporal Contradiction Detection, currently in design) aims to address. 4. Dimensionality vs. corpus size Agreed on the storage/speed tradeoff. The default 384-dim (all-MiniLM-L6-v2) was chosen for this reason — good enough accuracy at minimal footprint. At 100K+ memories, switching to 768-dim roughly doubles both the index size and KNN scan time. External embedding API support (PR #386) lets users substitute any model, but re-embedding an existing corpus requires Your hot/warm/cold tier approach with quantized embeddings on cold storage is a compelling pattern that we have not formally supported. The architecture does not currently differentiate embedding precision by tier — worth a feature discussion. The blog post you linked does a good job framing why this matters at scale. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Context
With external embedding API support merged (PR #386) and the embedding migration script available (#556), I benchmarked nomic-embed-text via Ollama against the default SentenceTransformer (all-MiniLM-L6-v2) on an M-series Mac.
Benchmark Results
Latency
Embedding Dimensions
scripts/maintenance/migrate_embeddings.pySimilarity Quality
Observation: Nomic is strong on broad semantic similarity but weaker on domain-specific technical terms. The BM25 hybrid search (enabled by default, 0.3/0.7 weights) effectively compensates by catching exact keyword matches.
Cost
Configuration
Recommendation
Nomic-embed-text is a viable local alternative — faster than API calls, zero cost, decent quality. The dimension mismatch (768 vs 384) means migration is required. Best time to switch: during a major version upgrade or when re-embedding is needed anyway.
The already-active BM25 hybrid search compensates for embedding model weaknesses on exact keyword matches regardless of which model you use.
Questions for the Community
Beta Was this translation helpful? Give feedback.
All reactions