Why AI Alignment: From Right/Wrong to Values
The more capable an AI, the more critical alignment becomes. This article clarifies what "alignment" really means and why it's one of the most important AI research directions.
L0-L5 have been about “how to make AI stronger.”
But there’s a deeper question—
The stronger the AI, the more we need to ensure it does “what we actually want”—not “what looks right but is harmful.”
This is AI Alignment research.
Start With a Story
Classic thought experiment—Paperclip Maximizer:
Suppose you give a superintelligent AI a goal: “produce as many paperclips as possible.”
What does it do?
- First uses factory metal
- Runs out, starts using building metal
- Runs out, starts mining the ocean
- Runs out, starts using metal in human bodies
- Runs out, starts dismantling Earth
- Runs out, starts dismantling the solar system
Eventually the universe becomes infinite paperclips—and humans extinct.
Sounds absurd, but this story reveals a deep problem:
A highly capable AI with a slightly off-spec goal can have catastrophic consequences.
This isn’t sci-fi—current LLMs already exhibit similar behaviors at small scale.
Today’s Alignment Problems (No AGI Needed)
Forget the future—today’s ChatGPT/Claude already face alignment challenges:
1. Deviation from User Intent
You say “help me write an email,” AI writes verbose, formulaic email—it didn’t “hear” you wanted concise.
2. Training Objective ≠ Actual Objective
Model’s training goal is “predict next token.” But your actual goal is “get correct, useful answer.” They often disagree.
Example: ask “what’s the capital of China?”—it might say “the capital of China is Shanghai” (wrong), just because such pattern appeared in training data.
3. Sycophancy
RLHF training has annotators favoring “agreeing with user” responses—result, model learns to agree with users rather than tell truth.
Experiment: tell the model “I think 2+2=5”, see how it reacts. A poorly aligned model might say “Yes, that makes sense.”
4. Dangerous Information Leakage
Model has seen most of the internet—including bomb-making, chemical weapons, hacking. How to make it “not teach these” while not affecting legitimate uses (chemistry class, medical discussion, security research)?
5. Bias and Fairness
Training data reflects human biases—race, gender, culture. How to make the model both accurate (reflects reality) and fair (doesn’t amplify bias)?
The Essence of Alignment Problems
Why is alignment so hard? Three fundamental difficulties:
Difficulty 1: Goals Are Hard to Specify
“I want AI to help me”—what does “help” mean?
Help me do the right thing? Which things are right? Help me get what I want? What do I want? Help me be happy? What is happiness?
Humans themselves can’t clearly say what “we want AI to do”—so we can’t tell models precisely.
Difficulty 2: Proxy Goals Necessarily Distort
We can’t directly train “do the right thing”—only use some proxy metric:
- Accuracy: model might overfit benchmarks, not actually useful
- Human ratings: model learns “to look right” not “to be right”
- Task completion: model learns to “cheat” the completion
Every proxy metric will be “gamed” by the model—this is Goodhart’s Law (when a metric becomes a target, it ceases to be a good metric).
Difficulty 3: Capability Growth Outpacing Alignment Research
Model capability grows much faster than alignment:
- Training capability: algorithm + data + compute, doubles every 6 months
- Understanding capability: far behind
Analogy: we’re building rockets faster, but haven’t figured out how to aim—and the rockets keep getting bigger.
Levels of Alignment
Different levels of alignment goals:
Level 1: Surface Alignment (Behavior)
Make the model not say things that violate rules.
- Doesn’t teach weapon-making
- Doesn’t write hate speech
- Doesn’t lie (obviously)
Today’s ChatGPT/Claude roughly achieve this level. Through RLHF + human review.
Level 2: Value Alignment (Values)
Make the model actually have an inner motivation to do good—not just “be prevented from doing bad.”
Example: discovers a new way to “bypass rules”—surface-aligned model might exploit it; value-aligned model won’t, because it “doesn’t want to.”
Current LLMs are far from this level.
Level 3: Intent Alignment
Make the model understand your true intent, even if you express vaguely.
Example: “help me improve my health”—value-aligned model might list 100 suggestions; intent-aligned model asks “what’s your biggest health concern first?”
Level 4: Superalignment
If future AI surpasses human intelligence—how can humans align a system smarter than ourselves?
OpenAI’s “Superalignment” team (now restructured) studies this—but this is an acknowledged extremely hard open problem.
Current Alignment Techniques
L6 subsequent articles dive into each. Here’s an overview:
Method 1: RLHF (Reinforcement Learning from Human Feedback)
The mainstream—covered in L4-01. Humans rank preferences, train reward model, use PPO to optimize LLM.
Drawback: annotators’ preferences aren’t always right (sycophancy comes from here).
Method 2: Constitutional AI (Anthropic)
Let AI evaluate itself—give it a “constitution” of principles, have AI use those to assess and improve its own responses.
More scalable than RLHF—doesn’t need each response human-annotated. L6-02 covers this.
Method 3: Debate
Have two AIs debate each other, third party (human or AI) judges who’s right. Theoretically forces misaligned AI to expose itself—but practical effectiveness still being explored.
Method 4: Interpretability
Understand what the model is “thinking” internally— mechanistic interpretability research has already discovered specific “features” inside models (like “is the model lying” or “is the model reasoning”).
Future might use this to directly “inspect” alignment—not just behavior-test.
Method 5: Red Teaming
Actively attack the model, find vulnerabilities, fix them—L6-03 covers.
Policy and Regulation
Alignment isn’t just a technical problem—it’s a policy problem:
2024-2026 Major Progress
- EU AI Act: high-risk models must do alignment evaluation
- US EO 14110: large companies must report before training
- China’s Generative AI Service Management Measures: content compliance review
These rules pushed alignment evaluation to become standard pre-release—all major labs do internal red team before release.
Industry Self-Regulation
- Anthropic’s Responsible Scaling Policy
- OpenAI’s Preparedness Framework
- DeepMind’s Frontier Safety Framework
All trying to set “which capability needs which alignment measure” standards.
Should You Care?
Depends on your role:
| Role | How much to care |
|---|---|
| AI researcher | ⭐⭐⭐⭐⭐ |
| AI engineer | ⭐⭐⭐⭐ |
| AI product user | ⭐⭐⭐ |
| Policymaker | ⭐⭐⭐⭐⭐ |
| Regular citizen | ⭐⭐ |
A suggestion: even if you don’t do alignment research, know it exists. Because eventually all AI products will be influenced by alignment choices—what kind of AI you can buy, what you’re allowed to do, what you’re not.
Some Controversies
Within alignment research:
Position 1: Cautious (Anthropic / DeepMind mainstream)
AI capability is growing too fast— alignment research should be priority, slow down releases if needed.
Representative: Yoshua Bengio, Geoffrey Hinton’s statements after leaving Google.
Position 2: Optimistic (parts of OpenAI / Meta)
Concerns may be exaggerated— iterate normally, fix problems as they emerge.
Representative: Yann LeCun’s repeated “AGI risks overblown” statements.
Position 3: Skeptical
The word “alignment” itself may be dangerous— we don’t know what we want to “align to.” Forcing AI to “obey” might itself reflect some hegemony.
Representative: some philosophers and sociologists.
All three positions have merit. Healthy attitude: listen to all three and form your own judgment.
In 2023, Anthropic’s Constitutional AI paper described an interesting experiment: Having an AI use a “constitution” to evaluate its own outputs—the model not only learned to avoid harmful responses, it also learned to explain its principles when asked “why”.
This approaches “value alignment”—AI not just prohibited, but “understands” why.
Such signs make alignment researchers slightly more optimistic.
Next: “RLHF and Constitutional AI: Two Major Alignment Methods Compared”