Embeddings — definition, models, and how to choose

Definition

An embedding is a dense numerical vector that represents an input — text chunk, image, audio clip, code snippet — such that semantically similar inputs produce numerically similar vectors.

Concretely: text-embedding-3-small (OpenAI) maps any input string to a 1536-dimensional vector of floats. Two sentences about the same topic produce vectors with high cosine similarity (typically > 0.7). Two unrelated sentences produce vectors with low cosine similarity (typically < 0.3).

Why it matters

Before embeddings, "does this query relate to this document?" required either keyword overlap (BM25, TF-IDF) or hand-engineered features. Embeddings learn the answer.

Pretrained embedding models give you semantic retrieval for free: embed every document in your corpus once, embed each query at runtime, similarity-search returns the most relevant documents. This is the retrieval half of RAG.

A short history

The lineage:

Word2Vec (Mikolov et al., Google 2013) — first widely-used neural word embeddings. Local-window objective; learns word-level vectors.
GloVe (Pennington, Socher, Manning, Stanford NLP 2014) — global co-occurrence matrix factorization. Different objective, similar shape.
ELMo (Peters et al. 2018) — first contextual embeddings; same word produces different vectors in different sentences.
BERT (Devlin et al., Google 2018-2019) — bidirectional transformer encoder; CLS-token output became the default sentence embedding for classification + retrieval.
Sentence-BERT, sentence-transformers(Reimers + Gurevych 2019) — fine-tuned BERT specifically for sentence-level similarity. Made embeddings practical for RAG.
OpenAI text-embedding-ada-002 (2022) → text-embedding-3-small/large (2024). API-served, no self-hosting. Took over production.

How to use them

Three primary use cases:

Semantic search. Embed all your documents, embed a query, return top-K nearest neighbors. The retrieval half of RAG.
Classification. Embed labeled examples, train a small classifier (logistic regression, kNN) on the embeddings. Cheap; works surprisingly well.
Clustering. Embed your corpus, run k-means or HDBSCAN. Discover thematic groups without labels.

How to choose an embedding model

Three trade-offs:

Quality vs cost. text-embedding-3-large (3072 dims) outperforms text-embedding-3-small (1536 dims) on most benchmarks but costs ~6× per token. Cohere embed-english-v3 is competitive. Open-weight: BGE-large, e5-large, gte-large.
API vs self-host. OpenAI/Cohere/Voyage APIs are easiest. Self-hosting open-weight models (BGE, e5) saves money + keeps data on-prem; cost: GPU infrastructure.
Dimensions vs storage. Higher dims = better quality but more storage + slower nearest-neighbor search. Matryoshka-style models (text-embedding-3) let you truncate dimensions if cost matters more than quality.

Benchmarks

The standard evaluation is MTEB (Massive Text Embedding Benchmark, Muennighoff et al. 2022) — 56 tasks across retrieval, classification, clustering. Check the live leaderboard at huggingface.co/spaces/mteb/leaderboard before picking. Top performers move around monthly.

One caveat: MTEB is English-heavy. For multilingual production, test on your specific languages first. Some English-leaders underperform on lower-resource languages.

Storing + searching embeddings

You need a vector database (or vector index) to scale retrieval past ~10k documents. Options:

FAISS (Johnson, Douze, Jégou, Facebook AI 2017) — library, not a database. Embed it in your app. Fastest; simplest.
Pinecone (founded 2019) — managed cloud vector database. Easiest production deployment.
Weaviate, Qdrant, Milvus, Chroma — open source + managed cloud. Trade-offs differ; for solo developers Qdrant + Chroma are the easiest local options.
Postgres + pgvector — if you already have Postgres, the extension gives you vector search without adding another service.

Common anti-patterns

Embedding raw documents whole. Use chunked embeddings (500-2000 token chunks); the larger the chunk the more semantically diluted the vector.
Cosine similarity threshold = absolute relevance. 0.7 cosine in one corpus means something different from 0.7 in another. Calibrate per-corpus.
Storing embeddings at full precision forever. Quantize old embeddings (8-bit, 4-bit) to save storage. Quality loss is small.
Re-embedding everything when changing models. True, but expensive. Plan model upgrades with downtime budget allocated.

What embeddings don't do

Embeddings are a similarity tool, not a verification tool. Two sentences can have high cosine similarity but contradict each other. The classic example:

"The model has 7B parameters."
"The model has 7B billion parameters."

Cosine similarity ~ 0.99. Factual relationship: one is wrong. Embeddings retrieve; they don't verify. That's where RAG vs VERITAS enters: combine semantic retrieval (embeddings) with claim-level verification (VERITAS) to cover both axes.

LLM grounding — the broader frame
RAG vs VERITAS — when embeddings aren't enough
RAG + retrieval topic hub
Foundational papers — Word2Vec, GloVe, BERT
LlamaIndex integration — embedding-first RAG