Vector Databases

How vector databases store and search embeddings — HNSW, IVF, product quantization, and a comparison of Pinecone, Weaviate, ChromaDB, and pgvector.

Starting from Zero — A Physical Intuition

Before looking at database indexes, let's understand vector search through a physical warehouse analogy:

The Traditional Catalog (Keyword Search): Imagine you run a massive warehouse. Traditional database indices organize items by alphabetized labels or exact numbers: "Row 5, Bin A: Screws". If a customer asks for "fasteners", the system scans the catalog but misses "screws" because the exact word "fasteners" isn't in the label.
The Semantic Map (Vector Space): Instead, you hire a spatial organizer who measures the concepts of items and plots them on a massive 3D grid. Items that serve similar purposes are placed in adjacent coordinates: hammers, screwdrivers, and drills are in the top-left corner; notebooks, pens, and paper are in the bottom-right.
Vector Querying — Finding Close Neighbors: When a customer asks for "something to attach wood pieces," you translate this query into its coordinates on the grid. The organizer walks directly to that spot and pulls the 10 closest physical items (the Nearest Neighbors), which happen to be nails and wood glue—even though the query didn't mention either word.

Why Vector Databases Exist

Traditional databases answer exact queries: WHERE id = 42 or WHERE name LIKE 'alice%'. They can't answer semantic queries: "find me the 10 most similar documents to this paragraph."

Traditional DB query:
  SELECT * FROM docs WHERE content = 'machine learning'
  → Only exact matches. Misses "deep learning", "neural networks", "AI".

Vector DB query:
  query_vector = embed("machine learning")
  search(query_vector, top_k=10)
  → Returns "neural networks tutorial", "deep learning guide",
    "introduction to AI" — semantically similar content.

Vector databases are optimized for one specific operation: Approximate Nearest Neighbor (ANN) search in high-dimensional spaces (typically 384–3072 dimensions).

How Vector Search Works

Exact Nearest Neighbor (Slow)

For each document vector:
  distance = cosine_similarity(query_vector, document_vector)
Sort by distance, return top-K

Complexity: O(n × d) — n documents, d dimensions
For 10M docs × 1536 dims: ~15 billion operations per query
Way too slow.

Approximate Nearest Neighbor (ANN)

Sacrifice a tiny bit of accuracy for massive speed gains (1000x+):

HNSW (Hierarchical Navigable Small World) — the most common index:

Analogy: HNSW is like a highway system.
  Top layer: interstate highways — few nodes, long-range connections
  Middle layers: state highways — more nodes
  Bottom layer: every neighborhood — all documents

Query navigation:
  1. Start at top layer (few connections, fast traversal)
  2. Find the approximate best direction to travel
  3. Move to a denser layer and refine
  4. Repeat until at the bottom layer
  5. Return the nearby nodes as results

Query time: O(log n) — logarithmic, incredibly fast

IVF (Inverted File Index) — cluster-based approach:

1. K-means cluster all vectors into N clusters (e.g., 1000 clusters)
2. At query time: find the nearest K clusters, search only within those
3. Trade-off: faster, but can miss vectors in neighboring clusters

Key Similarity Metrics

Python

import numpy as np

def cosine_similarity(a: np.ndarray, b: np.ndarray) -> float:
    """Most common for text embeddings. Range: -1 to 1."""
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

def dot_product(a: np.ndarray, b: np.ndarray) -> float:
    """Fast when vectors are normalized (same as cosine similarity for unit vectors)."""
    return np.dot(a, b)

def euclidean_distance(a: np.ndarray, b: np.ndarray) -> float:
    """L2 distance. Good for image embeddings."""
    return np.linalg.norm(a - b)

# Most text embedding models output normalized vectors
# → dot product = cosine similarity, and it's faster to compute

Think it through like the interview

Don't just state that HNSW is fast — walk through how a search query navigates the multi-layer graph structurally.

Think it through: HNSW Graphs vs Flat IndexesGraph Traversal & Indexing0/3 stages

PROBLEMDesign a search strategy to find the nearest neighbor to a query vector among 10 million documents. Explain why a flat brute-force index is impractical.

1
Evaluate the flat index cost
“If we store our 10 million vectors in a flat array and perform Cosine similarity on each for every query, what is the time complexity and why does it fail in production?”
2
Navigate the Hierarchical Graph layers
“How does HNSW utilize layers to traverse these 10 million vectors in logarithmic O(log N) time?”
unlocks after the stage above
3
Address index build-time constraints
“If HNSW is so fast for queries, what is the drawback during document ingestion/updates?”
unlocks after the stage above

Semantic Search from Scratch

To demystify vector databases, let's look at how we can implement a flat (brute-force) semantic search engine from scratch using only Python and numpy.

Python

import numpy as np

# A simulated database of documents with pre-computed vectors (3-dimensional for simplicity)
documents = [
    {
        "id": "doc1", 
        "text": "Deploying a Docker container on AWS EC2 or ECS", 
        "vector": np.array([0.85, 0.40, 0.10])
    },
    {
        "id": "doc2", 
        "text": "Cooking delicious Italian pasta carbonara with egg yolks", 
        "vector": np.array([0.10, 0.90, 0.35])
    },
    {
        "id": "doc3", 
        "text": "Linear algebra foundations: eigenvalues, eigenvectors, and SVD", 
        "vector": np.array([0.20, 0.15, 0.95])
    },
]

# Query: "How to host a web application on cloud infrastructure"
# The embedding model maps this query to a vector close to the tech topic
query_vector = np.array([0.80, 0.35, 0.15])

def cosine_similarity(v1: np.ndarray, v2: np.ndarray) -> float:
    # Formula: (A . B) / (||A|| * ||B||)
    return np.dot(v1, v2) / (np.linalg.norm(v1) * np.linalg.norm(v2))

def search_scratch(query_vector: np.ndarray, db: list[dict], top_k: int = 2) -> list[tuple[float, str]]:
    results = []
    for doc in db:
        sim = cosine_similarity(query_vector, doc["vector"])
        results.append((sim, doc["text"]))
    
    # Sort descending by similarity score
    results.sort(key=lambda x: x[0], reverse=True)
    return results[:top_k]

# Execute query
hits = search_scratch(query_vector, documents, top_k=2)
for score, text in hits:
    print(f"[{score:.4f}] {text}")
# Output:
# [0.9996] Deploying a Docker container on AWS EC2 or ECS
# [0.3809] Cooking delicious Italian pasta carbonara with egg yolks

Embedding Models Comparison

Before feeding vectors to a database, you must choose an embedding model. Different models yield different vector dimensions, semantic recall quality, and operational costs:

Model	Provider	Dimensions	Max Tokens (Input)	Key Strength	Best For
text-embedding-3-small	OpenAI	1536 (default)	8,191	Highly cost-efficient, standard performance, supports dimension shortening	Standard RAG applications, budget-friendly search
text-embedding-3-large	OpenAI	3072 (default)	8,191	Advanced semantic search, handles complex relations, dimension shortening	High-accuracy retrieval, multi-lingual matching
bge-large-en-v1.5	BAAI (Open Source)	1024	512	Top-tier open-source retrieval performance	Private cloud hosting, retrieval benchmarks
e5-large-v2	Microsoft (Open Source)	1024	512	Strong zero-shot performance across tasks	General-purpose open-source embeddings
cohere-embed-english-v3	Cohere	1024	512	Optimized for retrieval search, supports compression (binary embeddings)	Enterprise search, compressed vector databases

Vector Database Comparison

	ChromaDB	Pinecone	Weaviate	pgvector	Qdrant
Type	Open-source	Managed cloud	Open-source / cloud	PostgreSQL extension	Open-source
Best for	Dev, prototypes	Production, scale	Multi-modal, semantic search	Already on Postgres	High performance, filtering
Setup	1 line of Python	API key	Docker / cloud	`CREATE EXTENSION`	Docker / cloud
Filtering	Metadata WHERE	Metadata filter	GraphQL	SQL WHERE	Rich filter
Scale	Millions	Hundreds of millions	Large scale	Millions (with tuning)	Large scale
Cost	Free (self-hosted)	Paid (usage-based)	Free/paid	Free (self-hosted)	Free/paid

ChromaDB — Start Here for Development

Python

import chromadb
from openai import OpenAI

client = OpenAI()
chroma = chromadb.Client()                        # in-memory
# chroma = chromadb.PersistentClient("./chroma") # persist to disk

collection = chroma.get_or_create_collection(
    name="documents",
    metadata={"hnsw:space": "cosine"}
)

def add_documents(docs: list[dict]):
    """docs: list of {id, text, metadata}"""
    embeddings = [
        client.embeddings.create(model="text-embedding-3-small", input=d["text"]).data[0].embedding
        for d in docs
    ]
    collection.add(
        ids=[d["id"] for d in docs],
        embeddings=embeddings,
        documents=[d["text"] for d in docs],
        metadatas=[d.get("metadata", {}) for d in docs],
    )

def search(query: str, top_k: int = 5, filter: dict = None) -> list[dict]:
    query_embedding = client.embeddings.create(
        model="text-embedding-3-small", input=query
    ).data[0].embedding

    results = collection.query(
        query_embeddings=[query_embedding],
        n_results=top_k,
        where=filter,  # e.g., {"source": "Q3-report"}
    )
    return [
        {"text": doc, "metadata": meta, "distance": dist}
        for doc, meta, dist in zip(
            results["documents"][0],
            results["metadatas"][0],
            results["distances"][0]
        )
    ]

# Usage
add_documents([
    {"id": "1", "text": "Python is a dynamically-typed language", "metadata": {"source": "docs"}},
    {"id": "2", "text": "TypeScript adds static typing to JavaScript", "metadata": {"source": "docs"}},
    {"id": "3", "text": "The Eiffel Tower is in Paris, France", "metadata": {"source": "travel"}},
])

results = search("statically typed programming languages", top_k=2)
# Returns: TypeScript doc, Python doc (both about typing)
# NOT the Eiffel Tower (different semantic space)

Pinecone — Managed Cloud Vector Database

For production applications with millions or billions of vectors, a dedicated cloud-native vector database like Pinecone is widely used. It manages indexing, scaling, and high-availability serverlessly.

Python

from pinecone import Pinecone, ServerlessSpec

# Initialize the Pinecone client
pc = Pinecone(api_key="your_pinecone_api_key")

index_name = "knowledge-base"

# Create index if it doesn't exist
if index_name not in pc.list_indexes().names():
    pc.create_index(
        name=index_name,
        dimension=1536, # Must match the embedding model (e.g. OpenAI text-embedding-3-small)
        metric="cosine",
        spec=ServerlessSpec(
            cloud="aws",
            region="us-east-1"
        )
    )

# Connect to the index
index = pc.Index(index_name)

# Upsert vectors with metadata
index.upsert(
    vectors=[
        {
            "id": "doc_1", 
            "values": [0.015, -0.023, 0.091, ...], # 1536 float values
            "metadata": {"category": "coding", "source": "docs"}
        },
        {
            "id": "doc_2", 
            "values": [0.088, 0.124, -0.005, ...],
            "metadata": {"category": "travel", "source": "blogs"}
        }
    ]
)

# Search the index
results = index.query(
    vector=[0.012, -0.020, 0.085, ...], # Query embedding
    top_k=2,
    include_metadata=True,
    filter={"category": "coding"} # Optional metadata filtering (Pre-filtering)
)

for match in results["matches"]:
    print(f"Doc: {match['id']} | Score: {match['score']:.4f} | Meta: {match['metadata']}")

pgvector — Vectors in PostgreSQL

If you're already using PostgreSQL, pgvector is often the simplest option — no new infrastructure:

-- Install the extension
CREATE EXTENSION IF NOT EXISTS vector;

-- Create table with a vector column
CREATE TABLE documents (
    id          SERIAL PRIMARY KEY,
    content     TEXT NOT NULL,
    embedding   vector(1536),           -- OpenAI text-embedding-3-small dimension
    source      TEXT,
    created_at  TIMESTAMPTZ DEFAULT NOW()
);

-- Create HNSW index for fast ANN search
CREATE INDEX idx_documents_embedding ON documents
USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 64);
-- m: connections per node (more = better recall, more memory)
-- ef_construction: build-time search depth (higher = better quality, slower build)

-- Insert a document with its embedding
INSERT INTO documents (content, embedding, source)
VALUES ('Python is a dynamically typed language', '[0.1, 0.2, ...]'::vector, 'docs');

-- Similarity search (cosine distance — lower = more similar)
SELECT content, source, embedding <=> query_embedding AS distance
FROM documents
ORDER BY embedding <=> '[0.05, 0.18, ...]'::vector  -- your query embedding
LIMIT 5;

-- Hybrid: combine with metadata filtering
SELECT content, embedding <=> $1 AS distance
FROM documents
WHERE source = 'docs'
  AND created_at > NOW() - INTERVAL '30 days'
ORDER BY distance
LIMIT 10;

Python

# Python with psycopg2 + pgvector
import psycopg2
from pgvector.psycopg2 import register_vector

conn = psycopg2.connect(DATABASE_URL)
register_vector(conn)  # enables the vector type

def search_postgres(query_embedding: list[float], top_k: int = 5) -> list[dict]:
    with conn.cursor() as cur:
        cur.execute("""
            SELECT content, source, embedding <=> %s AS distance
            FROM documents
            ORDER BY embedding <=> %s
            LIMIT %s
        """, (query_embedding, query_embedding, top_k))
        return [{"content": row[0], "source": row[1], "distance": row[2]}
                for row in cur.fetchall()]

Basic: Create a semantic search engine over 100 Wikipedia articles using ChromaDB + OpenAI embeddings. Compare results with keyword search.
pgvector: Set up pgvector in PostgreSQL, store product descriptions as embeddings, and build "similar products" search.
RAG integration: Combine your vector DB with an LLM to build a Q&A bot over a document collection. Add metadata filtering by document date.

Next: Fine-tuning — customizing LLMs for specific domains.

Starting from Zero — A Physical Intuition

Why Vector Databases Exist

How Vector Search Works

Exact Nearest Neighbor (Slow)

Approximate Nearest Neighbor (ANN)

Key Similarity Metrics

Think it through like the interview

Semantic Search from Scratch

Embedding Models Comparison

Vector Database Comparison

ChromaDB — Start Here for Development

Pinecone — Managed Cloud Vector Database

pgvector — Vectors in PostgreSQL

Common Interview Questions

Interactive Quiz

Practice