Topic hub · 11 claims
Alignment, RLHF, and Constitutional AI — the safety stack
Reinforcement learning from human feedback, constitutional rules, direct preference optimization. The alignment techniques that took raw LLMs from research toys to production assistants.
Why alignment matters
A pretrained language model maximizes next-token likelihood over its training corpus. That doesn't make it helpful, harmless, or honest. The first three years of frontier-LLM work (GPT-1 through GPT-3) demonstrated capability; the alignment work that followed (2020-2024) made those capabilities usable. Without RLHF + safety training, ChatGPT would still be the curiosity that GPT-3 was — impressive but unfit for production.
RLHF — the InstructGPT pattern
Reinforcement learning from human feedback was first popularized at scale by InstructGPT (Ouyang et al., OpenAI 2022). Three stages: supervised fine-tuning on instruction-response pairs, reward model training on human preference comparisons, PPO-based RL using the reward model as feedback. This recipe became the alignment baseline every frontier lab now ships variants of.
Constitutional AI and the alternative
Anthropic's Constitutional AI (Bai et al. 2022) replaces some of the human-preference data with AI-generated critiques against a written constitution. DPO (Rafailov et al. 2023) collapses the three-stage RLHF process into a single direct-optimization step. Each method targets the same end (alignment) with different cost + transparency trade-offs.
Defined terms (4)
- RLHF
- Reinforcement learning from human feedback. Trains a reward model on human preference comparisons, then fine-tunes the LLM with PPO to maximize the reward model's score.
- Constitutional AI
- Anthropic's alignment approach using a written constitution + AI-generated critiques rather than purely human-preference data.
- DPO
- Direct Preference Optimization. Skips the reward-model stage of RLHF by directly optimizing the model on preference pairs.
- InstructGPT
- OpenAI's instruction-tuned GPT-3 variant that popularized the RLHF pipeline. Direct ancestor of ChatGPT.
All claims in this topic (11)
- AlphaGo·defeated Lee Sedol 4-1 in March 2016(1.00 · 2 sources)
- AlphaZero·published in Science journal December 2018(1.00 · 2 sources)
- Anthropic Constitutional AI Harmlessness·introduced in paper Bai et al. 2022 — training a helpful and harmless assistant(1.00 · 2 sources)
- Constitutional AI (CAI)·introduced in paper Constitutional AI: Harmlessness from AI Feedback (Bai et al., 2022)(1.00 · 2 sources)
- DeepSeek-R1·released on 2025-01-20 with reasoning chain-of-thought capabilities(1.00 · 2 sources)
- Direct Preference Optimization (DPO)·introduced in paper Direct Preference Optimization: Your Language Model is Secretly a Reward Model (Rafailov et al., 2023)(1.00 · 2 sources)
- InstructGPT·introduced in Ouyang et al. 2022 — RLHF-tuned GPT-3, direct ancestor of ChatGPT(1.00 · 2 sources)
- InstructGPT methodology·introduced in paper Training language models to follow instructions with human feedback (Ouyang et al., 2022)(1.00 · 2 sources)
- Proximal Policy Optimization (PPO)·introduced in paper Proximal Policy Optimization Algorithms (Schulman et al., 2017)(1.00 · 2 sources)
- Reinforcement Learning from Human Feedback (RLHF)·introduced in paper Deep Reinforcement Learning from Human Preferences (Christiano et al., 2017)(1.00 · 3 sources)
- Stanford Alpaca·publicly released on 2023-03-13 — instruction-tuned LLaMA 7B from Stanford CRFM(1.00 · 2 sources)