Concept · 2026-05-16
Embeddings — definition, models, and how to choose
Embeddings turn text (or images, audio, code) into dense numerical vectors. Similar inputs produce similar vectors. The retrieval backbone of RAG, semantic search, classification, and most LLM-era infrastructure.
Definition
An embedding is a dense numerical vector that represents an input — text chunk, image, audio clip, code snippet — such that semantically similar inputs produce numerically similar vectors.
Concretely: text-embedding-3-small (OpenAI) maps any input string to a 1536-dimensional vector of floats. Two sentences about the same topic produce vectors with high cosine similarity (typically > 0.7). Two unrelated sentences produce vectors with low cosine similarity (typically < 0.3).
Why it matters
Before embeddings, "does this query relate to this document?" required either keyword overlap (BM25, TF-IDF) or hand-engineered features. Embeddings learn the answer.
Pretrained embedding models give you semantic retrieval for free: embed every document in your corpus once, embed each query at runtime, similarity-search returns the most relevant documents. This is the retrieval half of RAG.
A short history
The lineage:
- Word2Vec (Mikolov et al., Google 2013) — first widely-used neural word embeddings. Local-window objective; learns word-level vectors.
- GloVe (Pennington, Socher, Manning, Stanford NLP 2014) — global co-occurrence matrix factorization. Different objective, similar shape.
- ELMo (Peters et al. 2018) — first contextual embeddings; same word produces different vectors in different sentences.
- BERT (Devlin et al., Google 2018-2019) — bidirectional transformer encoder; CLS-token output became the default sentence embedding for classification + retrieval.
- Sentence-BERT, sentence-transformers(Reimers + Gurevych 2019) — fine-tuned BERT specifically for sentence-level similarity. Made embeddings practical for RAG.
- OpenAI text-embedding-ada-002 (2022) → text-embedding-3-small/large (2024). API-served, no self-hosting. Took over production.
How to use them
Three primary use cases:
- Semantic search. Embed all your documents, embed a query, return top-K nearest neighbors. The retrieval half of RAG.
- Classification. Embed labeled examples, train a small classifier (logistic regression, kNN) on the embeddings. Cheap; works surprisingly well.
- Clustering. Embed your corpus, run k-means or HDBSCAN. Discover thematic groups without labels.
How to choose an embedding model
Three trade-offs:
- Quality vs cost. text-embedding-3-large (3072 dims) outperforms text-embedding-3-small (1536 dims) on most benchmarks but costs ~6× per token. Cohere embed-english-v3 is competitive. Open-weight: BGE-large, e5-large, gte-large.
- API vs self-host. OpenAI/Cohere/Voyage APIs are easiest. Self-hosting open-weight models (BGE, e5) saves money + keeps data on-prem; cost: GPU infrastructure.
- Dimensions vs storage. Higher dims = better quality but more storage + slower nearest-neighbor search. Matryoshka-style models (text-embedding-3) let you truncate dimensions if cost matters more than quality.
Benchmarks
The standard evaluation is MTEB (Massive Text Embedding Benchmark, Muennighoff et al. 2022) — 56 tasks across retrieval, classification, clustering. Check the live leaderboard at huggingface.co/spaces/mteb/leaderboard before picking. Top performers move around monthly.
One caveat: MTEB is English-heavy. For multilingual production, test on your specific languages first. Some English-leaders underperform on lower-resource languages.
Storing + searching embeddings
You need a vector database (or vector index) to scale retrieval past ~10k documents. Options:
- FAISS (Johnson, Douze, Jégou, Facebook AI 2017) — library, not a database. Embed it in your app. Fastest; simplest.
- Pinecone (founded 2019) — managed cloud vector database. Easiest production deployment.
- Weaviate, Qdrant, Milvus, Chroma — open source + managed cloud. Trade-offs differ; for solo developers Qdrant + Chroma are the easiest local options.
- Postgres + pgvector — if you already have Postgres, the extension gives you vector search without adding another service.
Common anti-patterns
- Embedding raw documents whole. Use chunked embeddings (500-2000 token chunks); the larger the chunk the more semantically diluted the vector.
- Cosine similarity threshold = absolute relevance. 0.7 cosine in one corpus means something different from 0.7 in another. Calibrate per-corpus.
- Storing embeddings at full precision forever. Quantize old embeddings (8-bit, 4-bit) to save storage. Quality loss is small.
- Re-embedding everything when changing models. True, but expensive. Plan model upgrades with downtime budget allocated.
What embeddings don't do
Embeddings are a similarity tool, not a verification tool. Two sentences can have high cosine similarity but contradict each other. The classic example:
- "The model has 7B parameters."
- "The model has 7B billion parameters."
Cosine similarity ~ 0.99. Factual relationship: one is wrong. Embeddings retrieve; they don't verify. That's where RAG vs VERITAS enters: combine semantic retrieval (embeddings) with claim-level verification (VERITAS) to cover both axes.
Related
- LLM grounding — the broader frame
- RAG vs VERITAS — when embeddings aren't enough
- RAG + retrieval topic hub
- Foundational papers — Word2Vec, GloVe, BERT
- LlamaIndex integration — embedding-first RAG