Context window — what it is, why size matters, and the 2018-2025 explosion

What is a context window?

The context window of an LLM is the maximum number of tokens the model can process in a single forward pass — input prompt + retrieved documents + system instructions + the output generation, all counted together. Measured in tokens (roughly 0.75 English words per token). Bounded by two things: the position-encoding architecture decided at training time, and the inference-time memory budget.

Every modern LLM has a context window. The number is one of the most marketed-and-misunderstood specs in AI.

The 2018-2025 explosion

2018 · GPT-1: 512 tokens (~380 words). Could read a paragraph.
2019 · GPT-2: 1,024 tokens. Could read a page.
2020 · GPT-3: 2,048 tokens.
2022 · GPT-3.5 / ChatGPT: 4,096 tokens (~3,000 words).
2023-03 · GPT-4: 8,192 tokens base / 32,768 with -32k variant.
2023-05 · Anthropic Claude 1.3 100k — first jump to 6-figure tokens.
2023-07 · Llama 2: 4,096. The open ecosystem lagged early on.
2023-11 · GPT-4 Turbo: 128,000 tokens (~96,000 words — a short novel).
2024-02 · Gemini 1.5 Pro: 1 million tokens at launch, then 2 million at I/O 2024. Crossed the book-length threshold.
2024-04 · Llama 3: 8,192 → Llama 3.1: 128,000.
2024-07 · Mistral Nemo: 128,000 (Mistral + NVIDIA collab).
2025-02 · Claude 3.7 Sonnet: 200,000.
2025-03 · Gemma 3: 128,000 — open-weight 128k joins the field.

The 5 architectural enablers

RoPE (Rotary Position Embedding). Most 2023+ LLMs use RoPE. Enables context extension via position-interpolation (PI), NTK-aware scaling, YaRN. Foundational to Llama 2/3, Mistral, Qwen.
ALiBi (Attention with Linear Biases). Alternative to RoPE. Used by MPT, BLOOM. Cleaner extrapolation but slightly worse base quality.
FlashAttention + variants. I/O-aware attention (Dao et al. Stanford 2022) makes attention tractable at long lengths. FlashAttention-2 (2023) and FlashAttention-3 (2024) extend further.
Sliding-window attention. Mistral 7B (2023) introduced 4k effective sliding window with 32k total context. Trades global access for tractable compute.
Ring Attention + Striped Attention. Distributed attention computation across multiple GPUs, enabling Gemini 1.5 Pro's 2M tokens (Liu et al. 2023).

The "Lost in the Middle" problem

Liu et al. (Stanford 2023) showed empirically that even when a model nominally has a 100k+ context, recall on facts buried in the middle of long inputs degrades sharply. Position-of-fact-in-context vs accuracy plots typically show a U-shape: best at start, second-best at end, worst at the middle 50%.

The 2024 follow-ups (Gemini 1.5 Pro's "multi-needle in a haystack") show major improvement but the gap hasn't closed: nominal context window ≠ effective context. Treat 200k as "can read 200k tokens without OOM" not "will use 200k tokens equally well."

The Needle in a Haystack benchmark

Greg Kamradt's NIAH benchmark (2023) became the de facto stress test. Setup: place a distinctive fact at varying positions in a long input; ask the model to retrieve it. Plot accuracy as function of (context-length, needle-position). Most pre-2024 long-context claims fall apart on NIAH. Gemini 1.5 Pro (2024-02) was the first model to pass NIAH to 1M+ at high accuracy; Claude 3 Opus and Claude 3.5 Sonnet improved further on the variant "multi-needle" + "haystack of related distractors" tests.

7 failure modes that limit usable context

Effective context shrinks with task complexity. Single-needle retrieval works at 200k+. Multi-hop reasoning across long context collapses much earlier.
Tokenizer inflation for non-English. The same 100k tokens stores ~75k English words but only ~30-40k Chinese/Japanese characters. Effective context differs by language.
Inference-cost quadratic at long lengths. Naive attention is O(n²); the marketed Gemini 1.5 Pro 2M might be 100× slower than a 200k query. Pricing sometimes hides this.
Recency bias. Models lean toward content at the end of the input — often whichever instruction came last wins, regardless of earlier constraints.
System prompt leak through long context. Long retrieved contexts can dilute system-prompt instructions — model forgets it was supposed to format output a certain way.
KV cache memory explosion. Inference memory grows linearly with context length. A 200k query on Llama 3 70B uses ~40GB of KV cache alone.
Position-encoding break beyond training length. A model trained on 8k and stretched to 32k via naive RoPE-scaling often produces incoherent output past the training distribution boundary. YaRN + Position Interpolation + continued-pretraining fix this but don't eliminate it.

When long context wins vs when RAG wins

The pragmatic question: should you stuff your knowledge into a long context window, or retrieve-then-generate with RAG?

Long context wins: single complex document (legal contract, codebase, scientific paper) where every chunk could matter; multi-document reasoning across a small set of related documents; conversation history that must be preserved exactly.
RAG wins: large knowledge corpus (10k+ docs); knowledge that updates frequently; need citation back to source; cost-sensitive at scale (1k tokens per query × 1M queries = vastly cheaper than 200k tokens per query).
Both: production AI assistants that need both stable knowledge (RAG) and per-session memory (long context). Most modern chat applications use this hybrid.

For verification specifically: see RAG vs VERITAS — signed-claim verification works regardless of context length choice.

LLM grounding — long context is one grounding strategy
Embeddings — what powers retrieval when long context isn't enough
Quantization — reduces KV-cache memory cost at long contexts
Fine-tuning — continued-pretraining on long sequences fixes position-encoding break
Topic hub: Inference optimization — FlashAttention + sliding-window attention + KV-cache strategies