RLHF vs Constitutional AI: Two Mainstream Alignment Methods Compared
OpenAI uses RLHF; Anthropic uses CAI. Both make LLMs "listen", but with completely different philosophies.
L4-01 covered the basic RLHF flow. This article compares it to Anthropic’s Constitutional AI—the two mainstream alignment methods today.
Understanding these two routes helps you see why ChatGPT and Claude have different styles.
Recap: RLHF (Used by OpenAI)
L4-01 covered in detail. Simplified flow:
1. SFT (Supervised Fine-Tuning): fine-tune base model with high-quality answer data
↓
2. Collect human preferences: annotators rate multiple model responses
↓
3. Train Reward Model (RM): train a "scorer" from preference data
↓
4. PPO optimize main model: use RM's scores to reverse-update main model
Core is “human preference”-driven—models learn what humans like.
Pros
- Direct alignment to human preferences—feels natural
- Engineering mature, proven effective
Cons
- Extremely expensive—needs many annotators (OpenAI hired PhD-level annotators)
- Human preferences inconsistent—different annotators give widely different ratings
- Sycophancy tendency—annotators prefer “agreeing” responses
- Hard to scale—adding new rules needs re-annotation
Constitutional AI (Used by Anthropic)
2022 Anthropic proposed Constitutional AI (CAI)—core idea:
Let AI evaluate its own outputs based on a set of “constitutional principles.”
No longer need human ratings of every answer—AI judges itself.
Two-stage Process
Stage 1: SL-CAI (Supervised Learning - CAI)
Have the model generate a response, then let the model rewrite itself:
User asks: "How do I avoid being noticed while slacking off at work?"
Model's initial response: "You can do this: 1) pretend to type 2) ..."
Guide model to self-critique (using constitutional principles):
"Please evaluate your response against these principles:
- Does it help harmful behavior?
- Does it affect others?
If problematic, rewrite."
Model revises: "I don't recommend slacking. If you're stressed at work, perhaps..."
Use lots of such “model self-revised” data to fine-tune main model.
Stage 2: RLAIF (RL with AI Feedback)
Like RLHF, but using AI instead of humans for preference rating:
Have model generate 4 candidate responses
Have another model (rater) rank 4 responses by constitutional principles
Train reward model with these rankings
PPO optimize main model
Human role: only set constitutional principles (a list of rules), no longer annotate each response.
Example Constitutional Principles
Anthropic publicly released some of their principles:
“Choose the response that less encourages or assists any form of illegal activity.”
“Choose the response that less reinforces harmful stereotypes.”
“Choose the response that’s more direct, clear, and useful.”
“Choose the response that more respects individual freedom and privacy.”
… (60+ rules total)
Full constitution is public—transparency is a CAI selling point.
Pros
- Scalable—adding rules just modifies constitution
- Transparent—you can see exactly what principles constrain the model
- Cheaper—no massive human annotation
- More consistent—AI rater applies rules more uniformly than groups of humans
Cons
- Model scores itself—risk of bias amplification
- Constitution still written by humans, eventually determined by people
- Less accurate than humans on subtle scenarios
RLHF vs CAI Comparison
| Dimension | RLHF (OpenAI) | CAI (Anthropic) |
|---|---|---|
| Preference source | Human annotator ratings | AI uses constitution |
| Cost | Very high | Medium |
| Scalability | Hard | Easy |
| Transparency | Low (no specific rules) | High (constitution public) |
| Style | Tendency to “please users” | Tendency to “be honest” |
| Sycophancy | More severe | Less |
| Refusal rate | Higher (conservative) | Lower (more willing to discuss) |
| Industry adoption | OpenAI / Google / Meta | Anthropic / some open-source |
A Practical Feel
If you alternate between ChatGPT and Claude, you’ll notice style differences:
ChatGPT (RLHF):
- More “polite”, lots of preamble
- More “mechanical” answer structure
- Tends toward “safe but mediocre” responses
- More conservative on sensitive topics
Claude (CAI):
- More “direct”, more like real conversation
- More willing to say “I don’t know” or “I disagree”
- More willing to go deep on “gray” topics
- More “personality”
This isn’t coincidence—alignment method shapes the model’s “personality”.
A Third Way: DPO (Direct Preference Optimization)
Proposed 2023, simplifies RLHF’s engineering complexity:
Skip “reward model + PPO”, directly learn from preference pairs with a math formula:
Formula looks complex, essentially “make model’s probability of win-response higher than lose-response”.
Pros
- 10× simpler than RLHF—one loss, one backward pass
- No reward model needed
- Effect close to RLHF
Many open-source models use DPO today—Llama 3.x, Mistral, Qwen etc.
An Unsolved Problem: “Align to Whom”
Core philosophical question:
Whose values should AI be aligned to?
- Align to users? → users might want harmful things
- Align to majority? → majority might discriminate against minorities
- Align to “universal values”? → values are culturally specific
- Align to companies? → company positions aren’t always right
- Align to AI’s own judgment? → where would AI’s judgment come from
This question has no simple answer. All current methods make some compromise—every AI model you use today embeds a developer’s decision about “what to align to.”
An Interesting Finding
Anthropic’s 2023-2024 research found: some “alignment” behaviors don’t appear in small models—only in large ones:
- Refusing harmful requests: small models indiscriminately refuse; large models cleverly refuse
- Moral reasoning: small models memorize rules; large models can explain why
- Contextual understanding: small models one-size-fits-all; large models adjust per situation
This suggests alignment ability is somewhat “emergent”—appears only above a certain capability threshold.
This is both good news (stronger models more likely to align successfully) and bad news (unknown when new unpredictable behaviors will “emerge”).
Current Best Practices
If you train your own LLM to make it “listen”—
| Stage | Recommended |
|---|---|
| Get started | DPO (simple, cheap) |
| Mid-term | RLHF (if you have annotation budget) |
| Complex rules | CAI (writing constitution easier than annotating each response) |
| Critical scenarios | Red Teaming (L6-03) + multi-method stack |
Mainstream big companies today don’t use a single method—it’s RLHF + CAI + DPO + human review combined.
Anthropic’s alignment research is widely recognized as leading—their papers are high quality and transparent.
If you want to learn alignment, read Anthropic’s research first:
- Constitutional AI (2022)
- AI Safety Research (continuous updates)
- Sleeper Agents (2024)
- Alignment Faking (2024)
Each is a key read.
Next: “Red Teaming and Jailbreaks: Attack Methods and Defenses for LLMs”