Concept · Reference · 11th pillar
Quantization — running large models on small hardware (GGUF, GPTQ, AWQ, bitsandbytes)
Definition, the 5 canonical quantization techniques (post-training quantization, GPTQ, AWQ, GGUF/GGML, bitsandbytes 4/8-bit), the precision-quality-speed tradeoff, when each format wins, and the 2022-2024 timeline that made 7B-on-laptop / 70B-on-workstation realistic.
What is quantization?
Quantization is the process of representing a model's weights (and sometimes activations) at lower numerical precision than its native training format — typically 16-bit float → 8-bit int → 4-bit int — to reduce memory footprint and increase inference speed.
The math: a 70B-parameter model at FP16 needs 140GB of memory (2 bytes per parameter). At 4-bit, that drops to ~35GB — fits on a single A100 80GB or two RTX 4090s. At 2-bit, ~17GB — fits on consumer hardware. Quantization is the single technique that made open-weight LLMs deployable outside data centers.
The 5 canonical quantization techniques
- Post-Training Quantization (PTQ). Apply quantization after training, no retraining required. Fast, zero training cost. Trade-off: more quality loss than quantization-aware training. The foundation for all modern LLM quantization formats.
- GPTQ. Approximate second-order optimization quantizes weights one column at a time, using the inverse Hessian to compensate for errors. 3-4 bit with <1% accuracy drop on most benchmarks. GPU-optimized via auto-gptq + ExLlama (Frantar et al. ICLR 2023).
- AWQ. Activation-aware Weight Quantization identifies salient weights (top ~1%) based on activation magnitude and protects them at higher precision. Typically beats GPTQ on instruction-tuned models (Lin et al., MIT 2023).
- GGUF (formerly GGML). Single-file format for quantized LLMs targeting CPU + GPU + Apple Silicon inference. Supports K-quants (Q2_K, Q3_K_M, Q4_K_M, Q5_K_M, Q6_K, Q8_0). Standard for llama.cpp + Ollama + LM Studio (Gerganov, 2023).
- bitsandbytes 4-bit + 8-bit. Dynamic quantization at load-time via PyTorch CUDA kernels (Dettmers et al. 2022). Used by Hugging Face Transformers
load_in_4bit=True. Foundational to QLoRA (4-bit base + LoRA adapters).
Format comparison — which to pick
- GGUF (Q4_K_M): best for CPU + Apple Silicon inference; widest tooling support (llama.cpp, Ollama, LM Studio); single file. Pick when: on-device, no GPU, hobbyist deployment.
- GPTQ (4-bit): best for GPU inference with mature kernels (ExLlama, ExLlamaV2). Pick when:consumer/prosumer GPU (4090, A6000), production throughput.
- AWQ (4-bit): often beats GPTQ on instruction-tuned chat models; slightly less tooling. Pick when: deploying chat-tuned 13B+ on GPU and quality matters more than tooling familiarity.
- bitsandbytes 4-bit: dynamic load-in-4bit via Transformers. Pick when: research iteration, QLoRA fine-tuning, or one-shot inference where you don't want to pre-quantize.
- FP8 (H100+): hardware-native FP8 on H100 and newer; minimal quality loss. Pick when:running on H100 + you need full quality near FP16.
The precision-quality-speed tradeoff
A rough rule of thumb (varies by model family):
- FP16 → 8-bit: ~0.1-0.5% benchmark drop; 2× memory reduction; ~10-30% speedup
- FP16 → 4-bit: ~1-3% benchmark drop; 4× memory reduction; ~50-150% speedup
- FP16 → 3-bit: ~3-6% benchmark drop; ~5× memory; speedup similar
- FP16 → 2-bit: ~10-20% benchmark drop; 8× memory; depends heavily on technique (IQ2 + K-quant mitigate)
Below 3-bit, perplexity climbs sharply for general LLMs (2024-12 community measurements). The 4-bit sweet spot is where most production deployments land.
Timeline — 2022 to 2024
- 2022-08 · LLM.int8() (Dettmers et al.) — first practical 8-bit LLM inference; basis for bitsandbytes.
- 2022-10 · GPTQ paper (Frantar et al.) — sets the 4-bit PTQ benchmark.
- 2023-03 · llama.cpp released (Gerganov) — Llama runs on a MacBook M1 in ~24 hours of development.
- 2023-04 · QLoRA (Dettmers et al., U Washington) — 4-bit base + LoRA fine-tuning lets 65B train on a single 48GB GPU.
- 2023-06 · AWQ paper (Lin et al., MIT) — activation-aware variant beats GPTQ on chat models.
- 2023-08 · GGUF replaces GGML in llama.cpp — adds metadata, model-format versioning, named-tensor support.
- 2023-10 · ExLlamaV2 GPTQ kernel — 2-3× speedup vs auto-gptq on Llama-class models.
- 2024-02 · NVIDIA TensorRT-LLM ships production-grade quantization kernels including FP8 and INT4 GPTQ/AWQ paths.
- 2024-06 · IQ2 + IQ3 quants in llama.cpp — mixed-precision K-quants beat naive 2/3-bit by 10-30%.
- 2024-07 · K-quants become the new GGUF default — Q4_K_M now standard for community releases.
5 common failure modes
- Outlier-induced collapse. At <4 bits, outlier activations can blow up. Mitigation: AWQ outlier-protection, IQ-quants, increase bit-precision.
- Chat-template misalignment. Quantizing without preserving the chat template & system prompt structure can degrade instruction-following. Always test on the actual chat format the model expects.
- Tokenizer drift. If the quantization pipeline strips/rebuilds the tokenizer, special tokens (BOS, EOS, padding) can mis-render. Compare token IDs before+after.
- KV-cache precision mismatch. Quantizing weights to 4-bit but leaving KV cache at FP16 wastes memory; quantizing KV cache too aggressively kills long-context quality. Q5_K_M cache is a common balance.
- Benchmark on wrong distribution. Most published quant-quality metrics use perplexity on WikiText — irrelevant for chat. Run your domain-specific eval before deploying.
When NOT to quantize
- You have FP16/BF16 budget on H100. No need to quantize; native precision is the highest quality.
- Safety-critical applications — medical, legal, financial — where the 1-3% quality drop matters. Run full-precision unless cost or latency forces otherwise.
- Speculative decoding draft models. Already small; quantizing further trades quality for negligible memory savings.
- Training (not inference). Training requires FP32/BF16/FP16 gradients. Quantization is a post-training technique (QLoRA quantizes the FROZEN base; adapter weights stay fp16/bf16).
Related
- Fine-tuning — QLoRA uses 4-bit base + LoRA adapters
- LLM grounding — the grounding pattern works regardless of quantization level
- Multimodal — quantizing VLMs preserves text quality but vision-tower precision matters more
- Topic hub: Inference optimization — quantization is one of many techniques (alongside FlashAttention, KV cache, batching, speculative decoding)
- Topic hub: Open-weight models — what you'd quantize and run locally