Alignment, RLHF, and Constitutional AI — the safety stack

Why alignment matters

A pretrained language model maximizes next-token likelihood over its training corpus. That doesn't make it helpful, harmless, or honest. The first three years of frontier-LLM work (GPT-1 through GPT-3) demonstrated capability; the alignment work that followed (2020-2024) made those capabilities usable. Without RLHF + safety training, ChatGPT would still be the curiosity that GPT-3 was — impressive but unfit for production.

RLHF — the InstructGPT pattern

Reinforcement learning from human feedback was first popularized at scale by InstructGPT (Ouyang et al., OpenAI 2022). Three stages: supervised fine-tuning on instruction-response pairs, reward model training on human preference comparisons, PPO-based RL using the reward model as feedback. This recipe became the alignment baseline every frontier lab now ships variants of.

Constitutional AI and the alternative

Anthropic's Constitutional AI (Bai et al. 2022) replaces some of the human-preference data with AI-generated critiques against a written constitution. DPO (Rafailov et al. 2023) collapses the three-stage RLHF process into a single direct-optimization step. Each method targets the same end (alignment) with different cost + transparency trade-offs.

Defined terms (4)

RLHF

Reinforcement learning from human feedback. Trains a reward model on human preference comparisons, then fine-tunes the LLM with PPO to maximize the reward model's score.

Constitutional AI

Anthropic's alignment approach using a written constitution + AI-generated critiques rather than purely human-preference data.

DPO

Direct Preference Optimization. Skips the reward-model stage of RLHF by directly optimizing the model on preference pairs.

InstructGPT

OpenAI's instruction-tuned GPT-3 variant that popularized the RLHF pipeline. Direct ancestor of ChatGPT.

Alignment, RLHF, and Constitutional AI — the safety stack

Why alignment matters

RLHF — the InstructGPT pattern

Constitutional AI and the alternative

Defined terms (4)

All claims in this topic (11)

Related

Other topic hubs

Concept pillars

Framework integrations