Tag

alignment

4 verified claims carrying this tag. Each has 2+ primary sources and an HMAC-SHA256 signature.

Reinforcement Learning from Human Feedback (RLHF) introduced in paper: Deep Reinforcement Learning from Human Preferences (Christiano et al., 2017).
67866330cd60e54d · 3 sources · 100% confidence
Direct Preference Optimization (DPO) introduced in paper: Direct Preference Optimization: Your Language Model is Secretly a Reward Model (Rafailov et al., 2023).
a3e691683a4577af · 2 sources · 100% confidence
Constitutional AI (CAI) introduced in paper: Constitutional AI: Harmlessness from AI Feedback (Bai et al., 2022).
ba1eb83c14795107 · 2 sources · 100% confidence
InstructGPT methodology introduced in paper: Training language models to follow instructions with human feedback (Ouyang et al., 2022).
5da8f8dffc038b8e · 2 sources · 100% confidence

Related tags

foundational2 20222 nips2 rlhf2 openai1 20231 anthropic1 20171 stanford1 bai1