SourceScore

Topic hub · 11 claims

Alignment, RLHF, and Constitutional AI — the safety stack

Reinforcement learning from human feedback, constitutional rules, direct preference optimization. The alignment techniques that took raw LLMs from research toys to production assistants.

Why alignment matters

A pretrained language model maximizes next-token likelihood over its training corpus. That doesn't make it helpful, harmless, or honest. The first three years of frontier-LLM work (GPT-1 through GPT-3) demonstrated capability; the alignment work that followed (2020-2024) made those capabilities usable. Without RLHF + safety training, ChatGPT would still be the curiosity that GPT-3 was — impressive but unfit for production.

RLHF — the InstructGPT pattern

Reinforcement learning from human feedback was first popularized at scale by InstructGPT (Ouyang et al., OpenAI 2022). Three stages: supervised fine-tuning on instruction-response pairs, reward model training on human preference comparisons, PPO-based RL using the reward model as feedback. This recipe became the alignment baseline every frontier lab now ships variants of.

Constitutional AI and the alternative

Anthropic's Constitutional AI (Bai et al. 2022) replaces some of the human-preference data with AI-generated critiques against a written constitution. DPO (Rafailov et al. 2023) collapses the three-stage RLHF process into a single direct-optimization step. Each method targets the same end (alignment) with different cost + transparency trade-offs.

Defined terms (4)

RLHF
Reinforcement learning from human feedback. Trains a reward model on human preference comparisons, then fine-tunes the LLM with PPO to maximize the reward model's score.
Constitutional AI
Anthropic's alignment approach using a written constitution + AI-generated critiques rather than purely human-preference data.
DPO
Direct Preference Optimization. Skips the reward-model stage of RLHF by directly optimizing the model on preference pairs.
InstructGPT
OpenAI's instruction-tuned GPT-3 variant that popularized the RLHF pipeline. Direct ancestor of ChatGPT.

All claims in this topic (11)

Related

Framework integrations