Fine-tuning — the complete reference (LoRA, QLoRA, DPO, RLHF, PEFT)

What is fine-tuning?

Fine-tuning is the process of taking a pre-trained foundation model and continuing training on a smaller, task-specific dataset to adapt the model's weights toward a specialized capability or behavior. The base model is fixed-cost (someone else trained it); the fine-tune is variable-cost (you train this one).

Three things distinguish fine-tuning from prompting and from retrieval (RAG):

Persistent change. The weights of the model change. The skill survives without the prompt.
Inference cost is unchanged. A fine-tuned model is the same size as the base; same latency, same cost-per-token at inference.
Data scale moderate. Hundreds to hundreds of thousands of examples — not the trillions used for pretraining.

The seven canonical fine-tuning techniques

Full supervised fine-tuning (SFT). Train all parameters end-to-end on (input, target) pairs. Most expressive, most expensive — needs the full GPU memory for the model + gradients + optimizer state.
Instruction tuning. SFT specifically on (instruction, response) pairs to make the model follow commands instead of continuing prefixes. Foundational to T5-Flan (2022), InstructGPT (Ouyang et al. 2022), Alpaca (Stanford 2023).
LoRA. Freeze base weights; inject small trainable low-rank matrices (rank 4-64 typical) into each attention layer. ~0.1% of full SFT parameters trained, within ~95% of SFT quality (Hu et al., Microsoft 2021).
QLoRA. LoRA on top of a 4-bit-quantized base. Enables fine-tuning 65B-parameter models on a single 48GB GPU — democratized hobbyist + research fine-tuning (Dettmers et al., U Washington 2023).
RLHF. Three-stage: SFT → reward model on preference pairs → PPO RL against reward model. Foundational to ChatGPT, Claude, Llama-2-chat (Christiano et al. 2017; Ouyang et al. OpenAI 2022).
DPO. Direct Preference Optimization — skips the reward-model + RL step. Optimizes preference directly via a closed-form loss. Simpler, more stable, now default in many open-weight chat models (Rafailov et al., Stanford 2023).
Constitutional AI. Anthropic's RLAIF (RL from AI Feedback) — base model critiques and revises its own outputs against a constitution of principles, replacing human raters (Bai et al., Anthropic 2022).

Fine-tune vs RAG: the decision tree

The most common AI-engineering question. Fine-tuning changes behavior. RAG changes context.

Fine-tune when: the model needs to learn a style (tone, voice, format), a structured output schema, domain-specific vocabulary, or a multi-step reasoning pattern that no amount of prompting reliably elicits. The knowledge is small, stable, and worth burning into weights.
RAG when: the knowledge is large, changes frequently, requires citation, or is too long to fit in context budget. Examples: customer-support docs that update weekly, legal corpus, internal knowledge base.
Both when: domain-specific assistant (fine-tune for tone + reasoning) that retrieves domain-specific facts (RAG for current knowledge). Most production systems use both.

See RAG vs VERITAS for the verification layer that augments both approaches.

Timeline — 2017 to 2024

2017 · Deep Reinforcement Learning from Human Preferences (Christiano et al., OpenAI + DeepMind) — foundational paper introducing the preference-modeling pattern that becomes RLHF.
2019 · Parameter-Efficient Transfer Learning (Houlsby et al., Google ICML 2019) — adapter layers, the conceptual ancestor of LoRA.
2021 · LoRA: Low-Rank Adaptation of Large Language Models (Hu et al., Microsoft) — the paper that made PEFT mainstream.
2022 · Training language models to follow instructions with human feedback (Ouyang et al., OpenAI) — InstructGPT, the paper that productionized RLHF and led directly to ChatGPT.
2022 · Constitutional AI: Harmlessness from AI Feedback (Bai et al., Anthropic) — RLAIF and Constitutional AI introduced.
2023 · QLoRA: Efficient Finetuning of Quantized LLMs (Dettmers et al., U Washington) — 65B fine-tuning on a single GPU.
2023 · Direct Preference Optimization: Your Language Model is Secretly a Reward Model(Rafailov et al., Stanford) — DPO replaces the reward-model + RL stage of RLHF.
2024-11 · Tülu 3 (Allen AI) — open recipe replicating Llama 3 Instruct quality with full data + code + training-script transparency.

5 common fine-tuning failure modes

Catastrophic forgetting. Fine-tune teaches new skill but model loses general capabilities. Mitigation: include diverse general-capability data in training mix; use LoRA so base weights stay frozen.
Overfitting on small data. <500 examples + full SFT = memorization, not generalization. Use LoRA + early stopping + held-out eval set.
Reward hacking (RLHF/DPO). Model learns superficial features that score well on the reward model but produce worse outputs to humans. Mitigation: rotate preference annotators, KL penalty against the SFT checkpoint.
Distribution mismatch. Training data is English-academic but production users speak casual multilingual. Eval shifts before behavior shifts.
Hidden capability degradation. Fine-tune improves measured task but quietly degrades safety, reasoning, or multilingual abilities. Run a broad eval suite (per evaluation harness), not just the target task.

When NOT to fine-tune

You have <100 examples. Use prompting + few-shot in-context examples instead.
Your knowledge changes weekly. Use RAG — you don't want to re-train every Tuesday.
You need citation. Fine-tuning bakes facts into weights; can't cite a weight. Use RAG + verification.
The model already does the task. Many tasks people fine-tune for are solved by a better prompt or a different base model.
You don't have eval data. Without an eval set you can't tell if fine-tuning helped or hurt. Skip the training; build the eval first.

Cost reality check (2024 pricing)

OpenAI gpt-4o fine-tune: ~$25 per 1M training tokens. Typical 10k-example dataset = ~$30-100 run.
Anthropic Claude fine-tune: available through Bedrock; pricing varies by base model. Custom partner agreements.
Open-weight LoRA on rented GPU: Lambda Labs A100-80GB ~$1.10/hour. A 7B-parameter LoRA on 10k examples = ~2-4 hours = $3-5.
QLoRA on consumer hardware: RTX 4090 24GB VRAM fine-tunes 13B-parameter models in QLoRA mode. Marginal-cost dominant for hobbyists.

LLM grounding — the broader pattern, of which fine-tuning is one approach
RAG vs VERITAS — the other knowledge-injection approach
Hallucination — the failure mode fine-tuning tries to reduce
Evaluation harness — how to test whether fine-tuning helped
Topic hub: Alignment + RLHF — catalog of alignment-related claims
Use case: customer-support bot — where fine-tuning + RAG + verification combine