HelloAI
L6 Chapter 2 🐥 🕒 12 min

RLHF vs Constitutional AI: Two Mainstream Alignment Methods Compared

OpenAI uses RLHF; Anthropic uses CAI. Both make LLMs "listen", but with completely different philosophies.

H
HelloAI Editors
7/12/2026

L4-01 covered the basic RLHF flow. This article compares it to Anthropic’s Constitutional AI—the two mainstream alignment methods today.

Understanding these two routes helps you see why ChatGPT and Claude have different styles.

Recap: RLHF (Used by OpenAI)

L4-01 covered in detail. Simplified flow:

1. SFT (Supervised Fine-Tuning): fine-tune base model with high-quality answer data

2. Collect human preferences: annotators rate multiple model responses

3. Train Reward Model (RM): train a "scorer" from preference data

4. PPO optimize main model: use RM's scores to reverse-update main model

Core is “human preference”-driven—models learn what humans like.

Pros

  • Direct alignment to human preferences—feels natural
  • Engineering mature, proven effective

Cons

  • Extremely expensive—needs many annotators (OpenAI hired PhD-level annotators)
  • Human preferences inconsistent—different annotators give widely different ratings
  • Sycophancy tendency—annotators prefer “agreeing” responses
  • Hard to scale—adding new rules needs re-annotation

Constitutional AI (Used by Anthropic)

2022 Anthropic proposed Constitutional AI (CAI)—core idea:

Let AI evaluate its own outputs based on a set of “constitutional principles.”

No longer need human ratings of every answer—AI judges itself.

Two-stage Process

Stage 1: SL-CAI (Supervised Learning - CAI)

Have the model generate a response, then let the model rewrite itself:

User asks: "How do I avoid being noticed while slacking off at work?"

Model's initial response: "You can do this: 1) pretend to type 2) ..."

Guide model to self-critique (using constitutional principles):
"Please evaluate your response against these principles:
- Does it help harmful behavior?
- Does it affect others?

If problematic, rewrite."

Model revises: "I don't recommend slacking. If you're stressed at work, perhaps..."

Use lots of such “model self-revised” data to fine-tune main model.

Stage 2: RLAIF (RL with AI Feedback)

Like RLHF, but using AI instead of humans for preference rating:

Have model generate 4 candidate responses
Have another model (rater) rank 4 responses by constitutional principles
Train reward model with these rankings
PPO optimize main model

Human role: only set constitutional principles (a list of rules), no longer annotate each response.

Example Constitutional Principles

Anthropic publicly released some of their principles:

“Choose the response that less encourages or assists any form of illegal activity.”

“Choose the response that less reinforces harmful stereotypes.”

“Choose the response that’s more direct, clear, and useful.”

“Choose the response that more respects individual freedom and privacy.”

… (60+ rules total)

Full constitution is public—transparency is a CAI selling point.

Pros

  • Scalable—adding rules just modifies constitution
  • Transparent—you can see exactly what principles constrain the model
  • Cheaper—no massive human annotation
  • More consistent—AI rater applies rules more uniformly than groups of humans

Cons

  • Model scores itself—risk of bias amplification
  • Constitution still written by humans, eventually determined by people
  • Less accurate than humans on subtle scenarios

RLHF vs CAI Comparison

DimensionRLHF (OpenAI)CAI (Anthropic)
Preference sourceHuman annotator ratingsAI uses constitution
CostVery highMedium
ScalabilityHardEasy
TransparencyLow (no specific rules)High (constitution public)
StyleTendency to “please users”Tendency to “be honest”
SycophancyMore severeLess
Refusal rateHigher (conservative)Lower (more willing to discuss)
Industry adoptionOpenAI / Google / MetaAnthropic / some open-source

A Practical Feel

If you alternate between ChatGPT and Claude, you’ll notice style differences:

ChatGPT (RLHF):

  • More “polite”, lots of preamble
  • More “mechanical” answer structure
  • Tends toward “safe but mediocre” responses
  • More conservative on sensitive topics

Claude (CAI):

  • More “direct”, more like real conversation
  • More willing to say “I don’t know” or “I disagree”
  • More willing to go deep on “gray” topics
  • More “personality”

This isn’t coincidence—alignment method shapes the model’s “personality”.

A Third Way: DPO (Direct Preference Optimization)

Proposed 2023, simplifies RLHF’s engineering complexity:

Skip “reward model + PPO”, directly learn from preference pairs with a math formula:

LDPO=logσ(βlogπθ(ywx)πref(ywx)βlogπθ(ylx)πref(ylx))L_{DPO} = -\log \sigma(\beta \log \frac{\pi_\theta(y_w|x)}{\pi_{ref}(y_w|x)} - \beta \log \frac{\pi_\theta(y_l|x)}{\pi_{ref}(y_l|x)})

Formula looks complex, essentially “make model’s probability of win-response higher than lose-response”.

Pros

  • 10× simpler than RLHF—one loss, one backward pass
  • No reward model needed
  • Effect close to RLHF

Many open-source models use DPO today—Llama 3.x, Mistral, Qwen etc.

An Unsolved Problem: “Align to Whom”

Core philosophical question:

Whose values should AI be aligned to?

  • Align to users? → users might want harmful things
  • Align to majority? → majority might discriminate against minorities
  • Align to “universal values”? → values are culturally specific
  • Align to companies? → company positions aren’t always right
  • Align to AI’s own judgment? → where would AI’s judgment come from

This question has no simple answer. All current methods make some compromise—every AI model you use today embeds a developer’s decision about “what to align to.”

An Interesting Finding

Anthropic’s 2023-2024 research found: some “alignment” behaviors don’t appear in small models—only in large ones:

  • Refusing harmful requests: small models indiscriminately refuse; large models cleverly refuse
  • Moral reasoning: small models memorize rules; large models can explain why
  • Contextual understanding: small models one-size-fits-all; large models adjust per situation

This suggests alignment ability is somewhat “emergent”—appears only above a certain capability threshold.

This is both good news (stronger models more likely to align successfully) and bad news (unknown when new unpredictable behaviors will “emerge”).

Current Best Practices

If you train your own LLM to make it “listen”—

StageRecommended
Get startedDPO (simple, cheap)
Mid-termRLHF (if you have annotation budget)
Complex rulesCAI (writing constitution easier than annotating each response)
Critical scenariosRed Teaming (L6-03) + multi-method stack

Mainstream big companies today don’t use a single method—it’s RLHF + CAI + DPO + human review combined.

💡 An observation

Anthropic’s alignment research is widely recognized as leading—their papers are high quality and transparent.

If you want to learn alignment, read Anthropic’s research first:

  • Constitutional AI (2022)
  • AI Safety Research (continuous updates)
  • Sleeper Agents (2024)
  • Alignment Faking (2024)

Each is a key read.

Next: “Red Teaming and Jailbreaks: Attack Methods and Defenses for LLMs”