How LLMs Are Trained: Pretrain → SFT → RLHF, the Full Pipeline
Training an LLM isn't one step—it's three completely different stages. Each uses different data, different objectives. After this, you can talk to ML engineers.
After L0/L1/L3 you have all the prerequisites. This article opens the LLM training black box—you’ll see how ChatGPT goes from random parameters to “all-knowing assistant.”
A secret: training an LLM is 3 completely different stages, each with different data, different objectives, and different costs.
Stage 1: Pre-training
Goal
Make the model learn “language itself”—grammar, facts, reasoning, world knowledge.
Data
The entire internet. Literally.
- Common Crawl (internet scrape, tens of TB)
- Wikipedia (high-quality encyclopedia)
- Books (millions)
- Code (GitHub)
- Academic papers (arXiv)
- …
GPT-3’s training data was about 45TB of text (570GB after dedup). GPT-4 is estimated to be several times that.
Objective: predict the next token
Just one thing—given some text, predict the next token.
"Today the weather is really" → model predicts probability of "good" / "nice" / "bad" / ...
Loss: cross entropy (covered in L1-04).
This is self-supervised learning—the data itself constructs the labels (every position’s “real next token”). No human annotation needed—this is the core reason LLMs scale.
Training Scale
- Compute: measured in FLOPs (floating point ops)—GPT-3 about FLOPs
- Time: weeks to months (thousands of GPUs in parallel)
- Cost: GPT-3 estimated 100M
This is the most expensive stage of LLM training—almost all compute cost is here.
Output: Base Model
After pretraining, you get a model that only knows how to “continue text”. It can write decent sentences, but doesn’t follow your instructions.
If you ask a base model “write me a poem”—it might continue “write me a poem requires inspiration and…” It treats your instruction as the start of text to continue.
So it’s not directly usable yet.
Stage 2: SFT (Supervised Fine-Tuning)
Goal
Make the model learn to follow instructions.
Data
Human-written high-quality “instruction-response” pairs:
Instruction: Write a short poem about spring.
Response: Flowers bloom in March's warm,
Gentle breeze stirs swallows' wings,
Where do green hills end?
Springtime fills each line of verse.
Instruction: Write a Python function to check if a number is prime.
Response: def is_prime(n):
if n < 2: return False
for i in range(2, int(n**0.5)+1):
if n % i == 0: return False
return True
Typically 50K-500K of these.
Key: this data must be high quality—usually professional writers and domain experts. OpenAI hired PhDs for GPT-4’s SFT data.
Training
Still “predict next token”—but only compute loss on the response part, not the instruction (you don’t want the model to learn “imitate user prompts”).
Cost and scale:
- Data volume 10000+ times smaller than pretrain
- Training time: hours to days
- Cost: thousands to tens of thousands of dollars
Output: SFT Model
At this point, the model follows instructions—can write poems, code, answer questions.
But still issues:
- May output harmful/unhelpful answers
- Has no “style preferences”—correct but mechanical
So we need step three.
Stage 3: RLHF (Reinforcement Learning from Human Feedback)
This is the core step that makes ChatGPT good.
Goal
Make the model learn “what humans like”—not just correct, but useful, safe, polite, appropriately detailed.
Process (three sub-steps)
Step 3.1: Collect Preference Data
For the same instruction, have the model generate 4-5 candidate responses and have human annotators rank them:
Instruction: Explain why the sky is blue.
Candidate A: The sky is blue because of light and atmosphere.
Candidate B: The sky appears blue because sunlight is made of various colors.
Short-wavelength light (blue) scatters more easily in the atmosphere...
(detailed, accurate)
Candidate C: Because of water reflection. (wrong)
Candidate D: The blue sky reminds me of childhood summers, those carefree afternoons...
(off-topic)
Human ranking: B > A > D > C
Collect tens of thousands of such ranking data.
Step 3.2: Train Reward Model
Use preference data to train a small model—input “instruction + response”, output “how good” (a numerical score):
RewardModel(instruction, response) → score
This reward model is trained on preference data—it learned “what kind of response humans like.”
Step 3.3: Optimize Main Model with PPO
Now use reinforcement learning (PPO algorithm)—main model generates responses, reward model scores, encourage high-scoring behavior, discourage low-scoring.
LLM generates → RewardModel scores → high → reinforce this behavior
↓
low → reduce this behavior
Technically extremely complex—PPO stability for LLMs is an engineering challenge.
Cost
RLHF is 10-100x more expensive than SFT—mostly human annotation cost.
GPT-4 RLHF stage estimated to cost several million dollars—mostly annotator wages.
Output: Your ChatGPT / Claude / Gemini
At this stage, the model becomes:
- Obedient (follows instructions)
- Polite (no swearing)
- Detailed (doesn’t lazy answer in one line)
- Safe (refuses harmful requests)
- Stylistically consistent (“How may I help you?” pattern)
A Diagram
Original Transformer architecture
↓
[Stage 1 · Pretrain]
Data: massive internet text
Goal: predict next token
Output: base model (continues but doesn't obey)
Cost: $10M - $100M+
↓
[Stage 2 · SFT]
Data: 50-500K high-quality instructions+responses
Goal: learn to follow instructions
Output: SFT model (obeys but mechanical)
Cost: $10K-100K
↓
[Stage 3 · RLHF]
Data: tens of thousands of preference pairs
Goal: stylization, align to human preferences
Output: your ChatGPT / Claude
Cost: $1M-10M
Pretraining teaches 99% of the capability; SFT + RLHF tunes 1% of “behavior”.
This is why simple SFT on GPT-3 turns it into ChatGPT—it already “knew” everything, just didn’t know how to use it.
Modern Variants: DPO / ORPO etc.
RLHF is engineering-complex and expensive. 2023-2024 simpler alternatives:
DPO (Direct Preference Optimization)
Core idea: skip “reward model + PPO”, directly use a simplified loss to train from preference data.
Effect close to RLHF but engineering simpler by 10x. Llama 3, Mistral, Qwen all use DPO.
ORPO / KTO etc.
More radical simplification—do SFT and alignment in one stage.
Trend: RLHF remains “gold standard,” but open-source ecosystem increasingly uses DPO due to cost pressure.
Open vs Closed Training
| Stage | Open (Llama 3) | Closed (GPT-4) |
|---|---|---|
| Pretrain data | 15T tokens | est. 13T tokens |
| Model params | 70B / 405B | est. 1.7T (MoE) |
| SFT | tens of thousands manual + synthetic | hundreds of thousands |
| Preference alignment | DPO mainly | RLHF + multi-round iteration |
| Total cost | ~$50M | $100M+ |
Closed models’ advantage is mainly in RLHF polish—architecture and pretrain data not as mysterious.
Some “Strange” Facts
LLMs Don’t Know They Were Trained
When you ask ChatGPT “how were you trained”—its answers are what it read in OpenAI’s public blog, not its actual training memory. It has no “autobiography.”
Training Data Determines Everything
Why is GPT slightly better at Chinese than Claude? Why is Llama’s Chinese weaker?—entirely depends on Chinese data proportion in training.
Model architectures are similar; differences are in data.
Fine-tuning Is 10,000x Cheaper Than Pretrain
This is why you can also “fine-tune an LLM” (L4-05 covers LoRA)—no re-pretraining, just hours of fine-tuning on a base model.
Big companies release base models (like Llama 3) so the world helps them do SFT and applications—they provide “the foundation,” community does “verticalization.”
One-Line Summary
Training an LLM = make it “read the entire internet” first, then “learn to listen”, finally “get groomed by humans to be likable”.
Each step uses completely different data, objectives, methods. All three are essential.
Next: “RAG from 0 to 1” — you know how LLMs are made; next, how to build apps with them.