The RAG architect's blind spot: treating vector databases as black boxes

We're making the mistake we made with databases in the 1990s. And it's going to cost someone a production outage.

Vector databases are solid technology. But I keep seeing teams deploy them like they're different from every other system—more like a math operation than infrastructure. You wouldn't ship a SQL database, skip the query optimizer, and assume it works at scale. Yet that's exactly what's happening with vector search right now.

The failure mode is straightforward. HNSW—Hierarchical Navigable Small World, the default in most vector databases—works well at 100K documents. Retrieval is fast and accurate. Teams ship it. At 1M documents, recall starts dropping. By 10M documents, you're pulling documents that have nothing to do with the query. Your LLM builds an answer on garbage context and sounds confident doing it.

The dangerous part: your monitoring doesn't catch it. Latency stays green. No errors. No exceptions. The system looks healthy. It's just returning wrong results.

I watched a team spend three weeks on this. Their RAG system kept hallucinating. They had excellent instrumentation around latency, throughput, cost. Zero visibility into whether retrieval was working. When they finally measured it—using an LLM judge against ground truth—recall had fallen from 0.82 to 0.54 over six months. Same latency. Half the relevance.

This isn't a vector database problem. It's an architecture problem. Teams skipped the design work that made relational databases functional.

When relational databases became production infrastructure, teams learned to index strategically, monitor index health, understand query patterns. You measured whether your indexes worked. Those practices vanished when embeddings showed up. Suddenly vector search felt like math, not infrastructure.

HNSW degrades because the embedding space gets denser as data grows. The hierarchical structure that worked when documents were spread out becomes less effective. You can increase ef_search to evaluate more candidates, but the cost is sharp: ef=160 takes 3x longer than ef=40. So teams don't tune. They accept it.

Relational databases had the same problem. When a single index got too slow, you didn't just make it bigger. You partitioned data. You added metadata filters. You used multiple index types for different queries.

RAG at scale needs the same approach. The teams shipping reliable RAG are not using pure vector similarity. They narrow the search space with metadata filters first—document type, domain, date range. Then vector search runs on a smaller set. Some are combining keyword search with semantic search. Partitioning vectors by domain. Treating the database like infrastructure that needs design.

The vector database alone doesn't scale. That's not a failure of the technology. It's what all single-index systems hit eventually. But you have to know that going in.

Right now, teams deploy vector databases like they're solved. Six months later, retrieval quality is unreliable, and they're surprised because they never measured it. That's the blind spot.

Sources

Written byYevhen Kim

Continue Reading

Browse all journal entries

If this article was useful, there are more notes on architecture, AI workflows, delivery, and engineering practice in the journal.

2026-04-23•pinned

The RAG architect's blind spot: treating vector databases as black boxes

Sources

Continue Reading

Multimodal RAG is not ready for production, and that's okay

Why your deployment pipeline is about to become your competitive advantage

AI in Development Is No Longer an Add-On — It Is Becoming a Base Layer of Work