SourceScore

Concept · Reference · 11th pillar

Quantization — running large models on small hardware (GGUF, GPTQ, AWQ, bitsandbytes)

Definition, the 5 canonical quantization techniques (post-training quantization, GPTQ, AWQ, GGUF/GGML, bitsandbytes 4/8-bit), the precision-quality-speed tradeoff, when each format wins, and the 2022-2024 timeline that made 7B-on-laptop / 70B-on-workstation realistic.

What is quantization?

Quantization is the process of representing a model's weights (and sometimes activations) at lower numerical precision than its native training format — typically 16-bit float → 8-bit int → 4-bit int — to reduce memory footprint and increase inference speed.

The math: a 70B-parameter model at FP16 needs 140GB of memory (2 bytes per parameter). At 4-bit, that drops to ~35GB — fits on a single A100 80GB or two RTX 4090s. At 2-bit, ~17GB — fits on consumer hardware. Quantization is the single technique that made open-weight LLMs deployable outside data centers.

The 5 canonical quantization techniques

  1. Post-Training Quantization (PTQ). Apply quantization after training, no retraining required. Fast, zero training cost. Trade-off: more quality loss than quantization-aware training. The foundation for all modern LLM quantization formats.
  2. GPTQ. Approximate second-order optimization quantizes weights one column at a time, using the inverse Hessian to compensate for errors. 3-4 bit with <1% accuracy drop on most benchmarks. GPU-optimized via auto-gptq + ExLlama (Frantar et al. ICLR 2023).
  3. AWQ. Activation-aware Weight Quantization identifies salient weights (top ~1%) based on activation magnitude and protects them at higher precision. Typically beats GPTQ on instruction-tuned models (Lin et al., MIT 2023).
  4. GGUF (formerly GGML). Single-file format for quantized LLMs targeting CPU + GPU + Apple Silicon inference. Supports K-quants (Q2_K, Q3_K_M, Q4_K_M, Q5_K_M, Q6_K, Q8_0). Standard for llama.cpp + Ollama + LM Studio (Gerganov, 2023).
  5. bitsandbytes 4-bit + 8-bit. Dynamic quantization at load-time via PyTorch CUDA kernels (Dettmers et al. 2022). Used by Hugging Face Transformers load_in_4bit=True. Foundational to QLoRA (4-bit base + LoRA adapters).

Format comparison — which to pick

  • GGUF (Q4_K_M): best for CPU + Apple Silicon inference; widest tooling support (llama.cpp, Ollama, LM Studio); single file. Pick when: on-device, no GPU, hobbyist deployment.
  • GPTQ (4-bit): best for GPU inference with mature kernels (ExLlama, ExLlamaV2). Pick when:consumer/prosumer GPU (4090, A6000), production throughput.
  • AWQ (4-bit): often beats GPTQ on instruction-tuned chat models; slightly less tooling. Pick when: deploying chat-tuned 13B+ on GPU and quality matters more than tooling familiarity.
  • bitsandbytes 4-bit: dynamic load-in-4bit via Transformers. Pick when: research iteration, QLoRA fine-tuning, or one-shot inference where you don't want to pre-quantize.
  • FP8 (H100+): hardware-native FP8 on H100 and newer; minimal quality loss. Pick when:running on H100 + you need full quality near FP16.

The precision-quality-speed tradeoff

A rough rule of thumb (varies by model family):

  • FP16 → 8-bit: ~0.1-0.5% benchmark drop; 2× memory reduction; ~10-30% speedup
  • FP16 → 4-bit: ~1-3% benchmark drop; 4× memory reduction; ~50-150% speedup
  • FP16 → 3-bit: ~3-6% benchmark drop; ~5× memory; speedup similar
  • FP16 → 2-bit: ~10-20% benchmark drop; 8× memory; depends heavily on technique (IQ2 + K-quant mitigate)

Below 3-bit, perplexity climbs sharply for general LLMs (2024-12 community measurements). The 4-bit sweet spot is where most production deployments land.

Timeline — 2022 to 2024

  • 2022-08 · LLM.int8() (Dettmers et al.) — first practical 8-bit LLM inference; basis for bitsandbytes.
  • 2022-10 · GPTQ paper (Frantar et al.) — sets the 4-bit PTQ benchmark.
  • 2023-03 · llama.cpp released (Gerganov) — Llama runs on a MacBook M1 in ~24 hours of development.
  • 2023-04 · QLoRA (Dettmers et al., U Washington) — 4-bit base + LoRA fine-tuning lets 65B train on a single 48GB GPU.
  • 2023-06 · AWQ paper (Lin et al., MIT) — activation-aware variant beats GPTQ on chat models.
  • 2023-08 · GGUF replaces GGML in llama.cpp — adds metadata, model-format versioning, named-tensor support.
  • 2023-10 · ExLlamaV2 GPTQ kernel — 2-3× speedup vs auto-gptq on Llama-class models.
  • 2024-02 · NVIDIA TensorRT-LLM ships production-grade quantization kernels including FP8 and INT4 GPTQ/AWQ paths.
  • 2024-06 · IQ2 + IQ3 quants in llama.cpp — mixed-precision K-quants beat naive 2/3-bit by 10-30%.
  • 2024-07 · K-quants become the new GGUF default — Q4_K_M now standard for community releases.

5 common failure modes

  1. Outlier-induced collapse. At <4 bits, outlier activations can blow up. Mitigation: AWQ outlier-protection, IQ-quants, increase bit-precision.
  2. Chat-template misalignment. Quantizing without preserving the chat template & system prompt structure can degrade instruction-following. Always test on the actual chat format the model expects.
  3. Tokenizer drift. If the quantization pipeline strips/rebuilds the tokenizer, special tokens (BOS, EOS, padding) can mis-render. Compare token IDs before+after.
  4. KV-cache precision mismatch. Quantizing weights to 4-bit but leaving KV cache at FP16 wastes memory; quantizing KV cache too aggressively kills long-context quality. Q5_K_M cache is a common balance.
  5. Benchmark on wrong distribution. Most published quant-quality metrics use perplexity on WikiText — irrelevant for chat. Run your domain-specific eval before deploying.

When NOT to quantize

  • You have FP16/BF16 budget on H100. No need to quantize; native precision is the highest quality.
  • Safety-critical applications — medical, legal, financial — where the 1-3% quality drop matters. Run full-precision unless cost or latency forces otherwise.
  • Speculative decoding draft models. Already small; quantizing further trades quality for negligible memory savings.
  • Training (not inference). Training requires FP32/BF16/FP16 gradients. Quantization is a post-training technique (QLoRA quantizes the FROZEN base; adapter weights stay fp16/bf16).

Related