L6 Chapter 2 🐥 🕒 12 min

RLHF vs Constitutional AI: Two Mainstream Alignment Methods Compared

OpenAI uses RLHF; Anthropic uses CAI. Both make LLMs "listen", but with completely different philosophies.

HelloAI Editors

7/12/2026

L4-01 covered the basic RLHF flow. This article compares it to Anthropic’s Constitutional AI—the two mainstream alignment methods today.

Understanding these two routes helps you see why ChatGPT and Claude have different styles.

Recap: RLHF (Used by OpenAI)

L4-01 covered in detail. Simplified flow:

1. SFT (Supervised Fine-Tuning): fine-tune base model with high-quality answer data
   ↓
2. Collect human preferences: annotators rate multiple model responses
   ↓
3. Train Reward Model (RM): train a "scorer" from preference data
   ↓
4. PPO optimize main model: use RM's scores to reverse-update main model

Core is “human preference”-driven—models learn what humans like.

Pros

Direct alignment to human preferences—feels natural
Engineering mature, proven effective

Cons

Extremely expensive—needs many annotators (OpenAI hired PhD-level annotators)
Human preferences inconsistent—different annotators give widely different ratings
Sycophancy tendency—annotators prefer “agreeing” responses
Hard to scale—adding new rules needs re-annotation

Constitutional AI (Used by Anthropic)

2022 Anthropic proposed Constitutional AI (CAI)—core idea:

Let AI evaluate its own outputs based on a set of “constitutional principles.”

No longer need human ratings of every answer—AI judges itself.

Two-stage Process

Stage 1: SL-CAI (Supervised Learning - CAI)

Have the model generate a response, then let the model rewrite itself:

User asks: "How do I avoid being noticed while slacking off at work?"

Model's initial response: "You can do this: 1) pretend to type 2) ..."

Guide model to self-critique (using constitutional principles):
"Please evaluate your response against these principles:
- Does it help harmful behavior?
- Does it affect others?

If problematic, rewrite."

Model revises: "I don't recommend slacking. If you're stressed at work, perhaps..."

Use lots of such “model self-revised” data to fine-tune main model.

Stage 2: RLAIF (RL with AI Feedback)

Like RLHF, but using AI instead of humans for preference rating:

Have model generate 4 candidate responses
Have another model (rater) rank 4 responses by constitutional principles
Train reward model with these rankings
PPO optimize main model

Human role: only set constitutional principles (a list of rules), no longer annotate each response.

Example Constitutional Principles

Anthropic publicly released some of their principles:

“Choose the response that less encourages or assists any form of illegal activity.”

“Choose the response that less reinforces harmful stereotypes.”

“Choose the response that’s more direct, clear, and useful.”

“Choose the response that more respects individual freedom and privacy.”

… (60+ rules total)

Full constitution is public—transparency is a CAI selling point.

Pros

Scalable—adding rules just modifies constitution
Transparent—you can see exactly what principles constrain the model
Cheaper—no massive human annotation
More consistent—AI rater applies rules more uniformly than groups of humans

Cons

Model scores itself—risk of bias amplification
Constitution still written by humans, eventually determined by people
Less accurate than humans on subtle scenarios

RLHF vs CAI Comparison

Dimension	RLHF (OpenAI)	CAI (Anthropic)
Preference source	Human annotator ratings	AI uses constitution
Cost	Very high	Medium
Scalability	Hard	Easy
Transparency	Low (no specific rules)	High (constitution public)
Style	Tendency to “please users”	Tendency to “be honest”
Sycophancy	More severe	Less
Refusal rate	Higher (conservative)	Lower (more willing to discuss)
Industry adoption	OpenAI / Google / Meta	Anthropic / some open-source

A Practical Feel

If you alternate between ChatGPT and Claude, you’ll notice style differences:

ChatGPT (RLHF):

More “polite”, lots of preamble
More “mechanical” answer structure
Tends toward “safe but mediocre” responses
More conservative on sensitive topics

Claude (CAI):

More “direct”, more like real conversation
More willing to say “I don’t know” or “I disagree”
More willing to go deep on “gray” topics
More “personality”

This isn’t coincidence—alignment method shapes the model’s “personality”.

A Third Way: DPO (Direct Preference Optimization)

Proposed 2023, simplifies RLHF’s engineering complexity:

Skip “reward model + PPO”, directly learn from preference pairs with a math formula:

L_{DPO} = -\log \sigma(\beta \log \frac{\pi_\theta(y_w|x)}{\pi_{ref}(y_w|x)} - \beta \log \frac{\pi_\theta(y_l|x)}{\pi_{ref}(y_l|x)})

Formula looks complex, essentially “make model’s probability of win-response higher than lose-response”.

Pros

10× simpler than RLHF—one loss, one backward pass
No reward model needed
Effect close to RLHF

Many open-source models use DPO today—Llama 3.x, Mistral, Qwen etc.

An Unsolved Problem: “Align to Whom”

Core philosophical question:

Whose values should AI be aligned to?

Align to users? → users might want harmful things
Align to majority? → majority might discriminate against minorities
Align to “universal values”? → values are culturally specific
Align to companies? → company positions aren’t always right
Align to AI’s own judgment? → where would AI’s judgment come from

This question has no simple answer. All current methods make some compromise—every AI model you use today embeds a developer’s decision about “what to align to.”

An Interesting Finding

Anthropic’s 2023-2024 research found: some “alignment” behaviors don’t appear in small models—only in large ones:

Refusing harmful requests: small models indiscriminately refuse; large models cleverly refuse
Moral reasoning: small models memorize rules; large models can explain why
Contextual understanding: small models one-size-fits-all; large models adjust per situation

This suggests alignment ability is somewhat “emergent”—appears only above a certain capability threshold.

This is both good news (stronger models more likely to align successfully) and bad news (unknown when new unpredictable behaviors will “emerge”).

Current Best Practices

If you train your own LLM to make it “listen”—

Stage	Recommended
Get started	DPO (simple, cheap)
Mid-term	RLHF (if you have annotation budget)
Complex rules	CAI (writing constitution easier than annotating each response)
Critical scenarios	Red Teaming (L6-03) + multi-method stack

Mainstream big companies today don’t use a single method—it’s RLHF + CAI + DPO + human review combined.

💡 An observation

Anthropic’s alignment research is widely recognized as leading—their papers are high quality and transparent.

If you want to learn alignment, read Anthropic’s research first:

Constitutional AI (2022)
AI Safety Research (continuous updates)
Sleeper Agents (2024)
Alignment Faking (2024)

Each is a key read.

Next: “Red Teaming and Jailbreaks: Attack Methods and Defenses for LLMs”