Vector Databases: Semantic Search & AI System Design

After this topic, you will be able to:

Analyze vector embedding representation for semantic similarity search
Evaluate approximate nearest neighbor algorithms (HNSW, IVF) for vector search trade-offs
Design vector database architectures for AI/ML applications

TL;DR

Vector databases store and query high-dimensional embeddings (dense numerical arrays representing semantic meaning) using specialized indexing algorithms like HNSW and IVF to enable fast similarity search. Unlike traditional databases that answer “find row where id=42,” vector databases answer “what data is most similar to this?” in milliseconds across billions of vectors. Critical for AI applications: semantic search, recommendation engines, and RAG systems where you need to find conceptually related content, not just exact keyword matches.

Background

Traditional databases excel at exact lookups using B-trees and hash indexes, but they fail catastrophically when you need to find “similar” items. The rise of deep learning created a new data primitive: vector embeddings—dense arrays of 384 to 1536+ floating-point numbers that encode semantic meaning. A sentence like “I love pizza” might become [0.23, -0.41, 0.87, …], where similar sentences cluster nearby in high-dimensional space.

Brute-force similarity search requires computing distance against every vector in your database. With 10 million 768-dimensional vectors, that’s 7.68 billion floating-point operations per query—far too slow for production. Early solutions like Faiss (Facebook, 2017) introduced approximate nearest neighbor (ANN) algorithms that trade perfect accuracy for 100x+ speedups. By 2020, specialized vector databases emerged (Pinecone, Weaviate, Milvus) combining ANN indexes with database features like CRUD operations, filtering, and horizontal scaling.

The explosion of LLMs and RAG architectures made vector databases essential infrastructure. When ChatGPT needs to retrieve relevant documents to answer your question, it’s performing vector similarity search against millions of embeddings. Spotify uses vector databases to find songs that “feel similar” even if they share no metadata. Airbnb matches listings to search queries based on semantic understanding, not just keyword overlap. Vector databases transformed “search” from keyword matching to semantic understanding.

Traditional vs Vector Database Query Patterns

graph LR
    subgraph Traditional Database
        A["Query: id=42"] --> B["B-tree Index"]
        B --> C["Exact Match<br/>O(log N)"]
        C --> D["Single Result"]
    end
    
    subgraph Vector Database
        E["Query Vector<br/>[0.23, -0.41, 0.87...]"] --> F["ANN Index<br/>HNSW/IVF"]
        F --> G["Similarity Search<br/>O(log N) approx"]
        G --> H["Top-K Results<br/>95-99% recall"]
    end

Traditional databases answer ‘find exact match’ using B-trees with perfect accuracy. Vector databases answer ‘find similar items’ using approximate algorithms that trade 1-5% accuracy for 100x+ speed, enabling semantic search at scale.

Architecture

A production vector database has four core layers. The ingestion layer receives vectors (typically from embedding models like OpenAI’s text-embedding-ada-002 or open-source alternatives) and assigns them unique IDs. Vectors arrive with metadata—the original text, image URL, or product attributes—stored separately for filtering and retrieval.

The index layer builds specialized data structures optimized for similarity search. Unlike B-trees that organize data for range queries, vector indexes like HNSW create graph structures where each node connects to its nearest neighbors, enabling logarithmic search through high-dimensional space. Multiple index types coexist: HNSW for speed, IVF for memory efficiency, or hybrid approaches.

The query layer accepts a query vector and parameters (top-k results, distance threshold, metadata filters) and orchestrates search across indexes. It handles the critical trade-off between recall (finding all relevant results) and latency. A query might specify “find 10 nearest neighbors with distance < 0.8 AND category=‘electronics’” combining vector similarity with traditional filtering.

The storage layer persists vectors, indexes, and metadata. Vectors themselves are large—1 million 768-dimensional float32 vectors consume 3GB. Indexes add 2-5x overhead. Production systems use tiered storage: hot indexes in memory, warm vectors on SSD, cold data in object storage. Distributed architectures shard vectors across nodes, with replication for availability. A query coordinator routes requests to relevant shards and merges results.

Production Vector Database Architecture

graph TB
    Client["Client Application"] --"1. Generate embedding"--> Embed["Embedding Model<br/><i>OpenAI/HuggingFace</i>"]
    Embed --"2. Vector + metadata"--> Ingest["Ingestion Layer<br/><i>Assign ID, validate</i>"]
    
    subgraph Vector Database
        Ingest --"3. Store"--> Storage[("Storage Layer<br/>Vectors + Metadata")]
        Ingest --"4. Build index"--> Index["Index Layer<br/><i>HNSW/IVF graphs</i>"]
        
        Query["Query Layer<br/><i>Coordinator</i>"] --"Read index"--> Index
        Query --"Fetch vectors"--> Storage
        
        subgraph Distributed Shards
            Shard1["Shard 1<br/>Vectors 0-10M"]
            Shard2["Shard 2<br/>Vectors 10-20M"]
            Shard3["Shard 3<br/>Vectors 20-30M"]
        end
        
        Query --"Route query"--> Shard1
        Query --"Route query"--> Shard2
        Query --"Route query"--> Shard3
    end
    
    Client --"5. Query vector<br/>+ filters"--> Query
    Shard1 & Shard2 & Shard3 --"6. Partial results"--> Query
    Query --"7. Merged top-K<br/>results"--> Client
    
    Cache["Memory Cache<br/><i>Hot indexes</i>"] -."Fast access".-> Index
    SSD["SSD Storage<br/><i>Warm vectors</i>"] -."Medium access".-> Storage
    S3[("Object Storage<br/><i>Cold data</i>")] -."Slow access".-> Storage

A production vector database has four layers: ingestion (receives vectors and metadata), index (builds HNSW/IVF graphs), query (coordinates search and filtering), and storage (tiered memory/SSD/object storage). Distributed systems shard vectors across nodes and merge results at query time.

Internals

Vector embeddings are the foundation. An embedding model (neural network) transforms unstructured data into fixed-length arrays where semantic similarity maps to geometric proximity. The sentence “dogs are loyal pets” and “canines make great companions” produce vectors with small cosine distance despite sharing no words. Embeddings capture meaning, not syntax. Dimensionality matters: 384-dim models (like all-MiniLM-L6) are fast but less nuanced; 1536-dim models (OpenAI ada-002) capture finer semantic distinctions but require 4x storage and compute.

Approximate Nearest Neighbor (ANN) algorithms make vector search tractable. Exact k-NN requires O(N) distance calculations—unacceptable at scale. ANN algorithms accept small accuracy losses (missing 1-2% of true nearest neighbors) for massive speedups.

HNSW (Hierarchical Navigable Small World) is the dominant algorithm. It builds a multi-layer graph where each vector is a node. Layer 0 contains all vectors densely connected to nearest neighbors. Higher layers are progressively sparser, containing only a subset of nodes with long-range connections—like highway exits connecting distant cities. Search starts at the top layer, greedily following edges toward the query vector, then descends layers for finer-grained search. HNSW achieves O(log N) query time with high recall (95-99%). The trade-off: memory overhead (each node stores 16-64 neighbor pointers) and slow inserts (adding a node requires updating neighbor connections across layers).

IVF (Inverted File Index) partitions vector space into clusters using k-means. Each cluster has a centroid. At query time, compute distance to centroids, search only the nearest few clusters (nprobe parameter), then scan vectors within those clusters. IVF uses less memory than HNSW but has lower recall unless nprobe is large (which increases latency). It’s ideal when memory is constrained and you can tolerate 85-90% recall.

Product Quantization (PQ) compresses vectors by splitting dimensions into subspaces and quantizing each subspace to a codebook. A 768-dim vector becomes 96 bytes (768 × 4 bytes) uncompressed; PQ reduces it to 96 bytes → 12 bytes (8x compression) with minimal accuracy loss. Combined with IVF (creating IVFPQ), this enables billion-scale indexes on commodity hardware.

HNSW Hierarchical Graph Structure and Search Path

graph TB
    subgraph Layer 2 - Sparse Long-Range
        L2A(("Node A"))
        L2B(("Node F"))
        L2A -."highway<br/>connection".-> L2B
    end
    
    subgraph Layer 1 - Medium Density
        L1A(("Node A"))
        L1B(("Node C"))
        L1C(("Node F"))
        L1D(("Node H"))
        L1A --> L1B
        L1B --> L1C
        L1C --> L1D
    end
    
    subgraph Layer 0 - Dense All Nodes
        L0A(("Node A"))
        L0B(("Node B"))
        L0C(("Node C"))
        L0D(("Node D"))
        L0E(("Node E"))
        L0F(("Node F"))
        L0G(("Node G"))
        L0H(("Node H"))
        L0I(("Node I"))
        L0A --> L0B
        L0B --> L0C
        L0C --> L0D
        L0D --> L0E
        L0E --> L0F
        L0F --> L0G
        L0G --> L0H
        L0H --> L0I
        L0A -.-> L0C
        L0C -.-> L0F
        L0F -.-> L0I
    end
    
    Query["🎯 Query Vector"] -."1. Start top layer".-> L2A
    L2A -."2. Jump to region".-> L2B
    L2B -."3. Descend".-> L1C
    L1C -."4. Refine search".-> L1D
    L1D -."5. Final layer".-> L0H
    L0H -."6. Find neighbors".-> L0I

HNSW builds a multi-layer graph where Layer 0 contains all vectors densely connected, and higher layers are progressively sparser with long-range connections. Search starts at the top, greedily navigates toward the query vector, then descends layers for fine-grained search, achieving O(log N) complexity.

Distance Metrics

Choosing the right distance metric is critical—it defines what “similar” means. Cosine similarity measures angle between vectors, ignoring magnitude. It’s the default for text embeddings where “dog” and “DOG DOG DOG” should be equally similar despite different vector norms. Formula: 1 - (A·B)/(||A|| ||B||). Range: 0 (identical) to 2 (opposite). Use for: normalized embeddings, text, semantic search.

Euclidean distance (L2) measures straight-line distance in space: sqrt(Σ(A_i - B_i)²). It’s sensitive to magnitude—longer vectors are “farther” even if directions align. Use for: image embeddings, recommendation systems where magnitude encodes confidence or importance. Many systems normalize vectors first, making L2 equivalent to cosine.

Dot product is cosine similarity without normalization: Σ(A_i × B_i). Faster to compute (no square root) but magnitude-dependent. Use for: pre-normalized embeddings, maximum inner product search (MIPS) where you want to favor high-magnitude vectors (e.g., popular items in recommendations).

Real-world impact: Spotify uses cosine for song similarity (genre matters, not play count magnitude). E-commerce recommendations use dot product to boost popular items. Choosing wrong metrics causes subtle bugs—your “similar products” might just be the most-viewed products.

Distance Metrics Comparison and Use Cases

graph TB
    Start["Choose Distance Metric"] --> Q1{"Are vectors<br/>normalized?"}
    
    Q1 -->|Yes| Q2{"Need to favor<br/>magnitude?"}
    Q1 -->|No| Q3{"Does magnitude<br/>matter?"}
    
    Q2 -->|No| Cosine["✓ Cosine Similarity<br/>1 - (A·B)/(||A|| ||B||)<br/><i>Range: 0-2</i>"]
    Q2 -->|Yes| Dot["✓ Dot Product<br/>Σ(A_i × B_i)<br/><i>Faster, magnitude-aware</i>"]
    
    Q3 -->|Yes| L2["✓ Euclidean (L2)<br/>sqrt(Σ(A_i - B_i)²)<br/><i>Geometric distance</i>"]
    Q3 -->|No| Normalize["Normalize first<br/>then use Cosine"]
    
    Cosine --> UC1["Use Cases:<br/>• Text embeddings<br/>• Semantic search<br/>• Document similarity"]
    Dot --> UC2["Use Cases:<br/>• Recommendations<br/>• Boost popular items<br/>• MIPS problems"]
    L2 --> UC3["Use Cases:<br/>• Image embeddings<br/>• Spatial data<br/>• Magnitude = confidence"]
    
    Example1["Example: 'dog' vs 'DOG DOG'<br/>Cosine: same (angle identical)<br/>L2: different (magnitude differs)"] -.-> Cosine
    Example2["Example: Popular songs<br/>Higher magnitude = more plays<br/>Dot product favors them"] -.-> Dot

Cosine similarity measures angle (ignoring magnitude), ideal for normalized text embeddings. Euclidean distance measures geometric distance (magnitude-sensitive), used for image embeddings. Dot product is faster and magnitude-aware, perfect for recommendations where popularity matters.

Performance Characteristics

Modern vector databases achieve sub-10ms p99 latency for k=10 queries against 10M vectors on a single node. HNSW with 768-dim vectors and M=16 (edges per node) delivers 95%+ recall at 5-8ms latency. Scaling to 100M vectors requires sharding across 10-20 nodes; query latency increases to 15-25ms due to network overhead and result merging.

Throughput scales with parallelism. A 32-core server handles 5,000-10,000 QPS for HNSW queries (memory-bound, not CPU-bound). Distributed systems achieve 50,000+ QPS by routing queries to shards. Insert throughput is lower: HNSW inserts are 10-100x slower than queries due to graph updates. Batch inserts (1000+ vectors) amortize overhead, reaching 10,000-50,000 inserts/sec per node.

Memory is the bottleneck. 100M 768-dim float32 vectors = 307GB raw. HNSW index adds 2-3x overhead → 900GB total. Product quantization reduces this to 150GB (6x compression) at the cost of 2-5% recall. Billion-scale systems require distributed memory or disk-based indexes (10-50x slower queries).

Recall vs latency trade-off: HNSW with ef_search=100 achieves 98% recall at 8ms; ef_search=500 reaches 99.5% at 25ms. Production systems tune this per use case—search needs high recall, recommendations tolerate 90% recall for lower latency.

Trade-offs

Strengths: Vector databases excel at semantic similarity where traditional search fails. They find conceptually related content even with zero keyword overlap. They’re essential for AI applications: RAG systems retrieve relevant context for LLMs, recommendation engines surface similar products, and fraud detection identifies unusual patterns. They handle multi-modal search—find images similar to text descriptions or songs matching a mood.

Weaknesses: Exact match queries are slower than traditional databases. Filtering on metadata (“find similar products under $50”) is inefficient—you can’t index both vector similarity and price range optimally. Updates are expensive; changing a vector requires rebuilding graph connections. Vector databases don’t replace traditional databases—they complement them. Most architectures store metadata in Postgres/MongoDB and only vectors in the vector database.

Cost is significant. Memory requirements are 10-100x higher than traditional databases for the same data volume. A 10M row Postgres table might be 5GB; 10M vectors are 300GB+. Cloud vector databases charge $0.10-0.50 per GB-month—10x more than object storage.

Accuracy is probabilistic. ANN algorithms miss true nearest neighbors 1-10% of the time. For applications requiring perfect recall (regulatory compliance, medical diagnosis), you need exact search or hybrid approaches. Debugging is harder—why did the system return these results? Vector similarity is less interpretable than SQL WHERE clauses.

When to Use (and When Not To)

Choose vector databases when you need semantic similarity search over exact matching. Clear use cases: (1) RAG systems where LLMs need relevant context from knowledge bases, (2) recommendation engines finding similar products/content without explicit features, (3) semantic search where users describe what they want conceptually (“comfortable shoes for hiking”), (4) duplicate detection across images/documents, (5) anomaly detection in high-dimensional data.

Don’t use vector databases for: (1) transactional workloads requiring ACID guarantees—vector DBs prioritize read performance over consistency, (2) exact match queries (“find user_id=12345”)—traditional indexes are 100x faster, (3) complex joins and aggregations—vector DBs lack SQL expressiveness, (4) small datasets (<100K vectors)—in-memory libraries like Faiss are simpler and cheaper.

Hybrid architectures are common. Store structured data (user profiles, product catalogs) in Postgres. Generate embeddings for searchable fields (descriptions, reviews) and store in a vector database. At query time, perform vector search to get candidate IDs, then fetch full records from Postgres. This combines semantic search with traditional filtering and joins.

Consider managed services (Pinecone, Weaviate Cloud) for faster iteration; self-hosted (Milvus, Qdrant) for cost optimization at scale. Postgres extensions (pgvector) work for <1M vectors but don’t scale to production AI workloads.

Hybrid Architecture: Vector DB + Traditional Database

graph LR
    User["User Query<br/>'comfortable hiking shoes'"] --"1. Embed query"--> Embed["Embedding Model"]
    
    Embed --"2. Query vector<br/>[0.23, -0.41...]"--> VectorDB[("Vector Database<br/><i>Pinecone/Weaviate</i>")]
    
    VectorDB --"3. Top 1000<br/>similar product IDs<br/>(semantic match)"--> App["Application Layer<br/><i>Merge & rank</i>"]
    
    subgraph Traditional Database
        Postgres[("PostgreSQL<br/><i>Product catalog</i>")]
        Meta["Metadata:<br/>• Price<br/>• Stock<br/>• Ratings<br/>• Category"]
    end
    
    App --"4. Fetch details<br/>WHERE id IN (...)<br/>AND price < 100<br/>AND in_stock = true"--> Postgres
    
    Postgres --"5. Full product data<br/>(filtered results)"--> App
    
    App --"6. Final ranked<br/>results"--> User
    
    VectorDB -."Stores only:<br/>• Embeddings<br/>• Product IDs".-> VectorDB
    Postgres -."Stores:<br/>• Structured data<br/>• Metadata".-> Meta

Hybrid architectures combine vector databases (semantic search) with traditional databases (structured queries). Vector DB returns semantically similar candidates, then Postgres filters by price/stock/category and provides full product details. This approach leverages strengths of both systems.

Real-World Examples

company: Spotify context: Music recommendation system implementation: Spotify generates embeddings for 100M+ songs using audio features (tempo, key, timbre) and user behavior (skip rates, playlist co-occurrence). They use vector similarity to power “Discover Weekly” and “Song Radio” features. When you play a song, Spotify queries the vector database for nearest neighbors in embedding space, filtered by freshness and user preferences. interesting_detail: Spotify uses a hybrid approach: coarse-grained IVF for initial candidate retrieval (1000 songs from 100M in <5ms), then HNSW for fine-grained reranking. This two-stage pipeline balances latency and recall. They also maintain separate indexes for different contexts—“workout music” embeddings differ from “focus music” embeddings even for the same song.

company: Airbnb context: Listing search and recommendations implementation: Airbnb embeds listing descriptions, photos, and reviews into 256-dimensional vectors. When users search “cozy cabin near hiking trails,” the query is embedded and matched against 7M+ listings using vector similarity. They combine vector search (semantic matching) with traditional filters (price, dates, location radius) using a two-stage retrieval: vector search returns top 1000 candidates, then SQL filters and ranks them. interesting_detail: Airbnb discovered that raw text embeddings performed poorly because listing descriptions are marketing copy, not user intent. They fine-tuned their embedding model on click-through data—listings users actually booked after searching. This domain-specific tuning improved relevance by 30% compared to off-the-shelf models. They also use separate embeddings for photos, enabling visual similarity search.

Interview Essentials

Mid-Level

Explain what vector embeddings are and why traditional indexes don’t work for similarity search. Describe the difference between exact and approximate nearest neighbor search. Explain one ANN algorithm (HNSW or IVF) at a high level—how it reduces search space. Discuss when to use cosine vs Euclidean distance. Be ready to design a simple semantic search system: user query → embedding model → vector DB → results.

Senior

Deep dive into HNSW: explain the hierarchical graph structure, how search navigates layers, and why it achieves O(log N) complexity. Compare HNSW vs IVF trade-offs (memory vs recall vs latency). Discuss product quantization and compression techniques. Design a production RAG system: handle embedding generation, vector storage, metadata filtering, and result reranking. Explain how to scale to billions of vectors—sharding strategies, replication, and query routing. Discuss recall/latency trade-offs and how to tune ef_search and nprobe parameters.

Staff+

Architect a multi-tenant vector database with isolation and resource limits. Discuss cold start problems—how to build indexes incrementally without downtime. Explain hybrid search combining vector similarity with traditional filters (“similar products under $50 in electronics”)—why naive approaches fail and how to optimize. Debate build vs buy: when to use managed services vs self-hosted solutions. Discuss cost optimization: tiered storage, compression, and query caching strategies. Explain how to handle embedding model updates—reindexing billions of vectors without service disruption.

Common Interview Questions

How does HNSW achieve logarithmic search time? (Answer: hierarchical layers create a navigable small-world graph where search starts coarse and refines)

Why can’t you use a B-tree for vector search? (Answer: B-trees organize data for range queries on ordered keys; vectors have no natural ordering, and similarity requires distance calculations)

How do you handle metadata filtering in vector search? (Answer: post-filtering after vector search, or pre-filtering to reduce search space—trade-offs between recall and latency)

What’s the difference between cosine similarity and dot product? (Answer: cosine normalizes by magnitude, dot product doesn’t—matters when vector length encodes information)

How do you scale vector search to billions of vectors? (Answer: sharding by vector ID or clustering, distributed query routing, and result merging)

Red Flags to Avoid

Confusing embeddings with feature vectors or one-hot encodings—embeddings are dense, learned representations

Claiming vector databases replace traditional databases—they’re complementary, not substitutes

Not understanding the recall/latency trade-off in ANN algorithms—thinking you can have both perfect accuracy and millisecond queries

Suggesting brute-force search for production systems—shows lack of understanding of computational complexity

Ignoring memory constraints—not considering compression or tiered storage for large-scale systems

Key Takeaways

Vector databases enable semantic similarity search by storing high-dimensional embeddings and using specialized indexes (HNSW, IVF) to find nearest neighbors in milliseconds, not hours.

HNSW is the dominant algorithm: it builds a hierarchical graph achieving O(log N) search time with 95-99% recall, but requires 2-3x memory overhead and slow inserts.

Distance metrics matter: use cosine similarity for normalized text embeddings, Euclidean for magnitude-sensitive data, and dot product for pre-normalized vectors or when magnitude encodes importance.

Vector databases don’t replace traditional databases—they complement them. Store structured data in Postgres, vectors in a vector DB, and combine them in hybrid architectures.

Production systems face memory constraints (100M vectors = 300GB+), requiring compression (product quantization), tiered storage, and sharding to scale cost-effectively.