L4 Chapter 1 🐣 🕒 11 min

How LLMs Are Trained: Pretrain → SFT → RLHF, the Full Pipeline

Training an LLM isn't one step—it's three completely different stages. Each uses different data, different objectives. After this, you can talk to ML engineers.

HelloAI Editors

7/4/2026

After L0/L1/L3 you have all the prerequisites. This article opens the LLM training black box—you’ll see how ChatGPT goes from random parameters to “all-knowing assistant.”

A secret: training an LLM is 3 completely different stages, each with different data, different objectives, and different costs.

Stage 1: Pre-training

Goal

Make the model learn “language itself”—grammar, facts, reasoning, world knowledge.

Data

The entire internet. Literally.

Common Crawl (internet scrape, tens of TB)
Wikipedia (high-quality encyclopedia)
Books (millions)
Code (GitHub)
Academic papers (arXiv)
…

GPT-3’s training data was about 45TB of text (570GB after dedup). GPT-4 is estimated to be several times that.

Objective: predict the next token

Just one thing—given some text, predict the next token.

"Today the weather is really" → model predicts probability of "good" / "nice" / "bad" / ...

Loss: cross entropy (covered in L1-04).

This is self-supervised learning—the data itself constructs the labels (every position’s “real next token”). No human annotation needed—this is the core reason LLMs scale.

Training Scale

Compute: measured in FLOPs (floating point ops)—GPT-3 about $3 \times 10^{23}$ FLOPs
Time: weeks to months (thousands of GPUs in parallel)
Cost: GPT-3 estimated $4.6M**, GPT-4 estimated **$ 100M

This is the most expensive stage of LLM training—almost all compute cost is here.

Output: Base Model

After pretraining, you get a model that only knows how to “continue text”. It can write decent sentences, but doesn’t follow your instructions.

If you ask a base model “write me a poem”—it might continue “write me a poem requires inspiration and…” It treats your instruction as the start of text to continue.

So it’s not directly usable yet.

Stage 2: SFT (Supervised Fine-Tuning)

Goal

Make the model learn to follow instructions.

Data

Human-written high-quality “instruction-response” pairs:

Instruction: Write a short poem about spring.
Response: Flowers bloom in March's warm,
          Gentle breeze stirs swallows' wings,
          Where do green hills end?
          Springtime fills each line of verse.

Instruction: Write a Python function to check if a number is prime.
Response: def is_prime(n):
              if n < 2: return False
              for i in range(2, int(n**0.5)+1):
                  if n % i == 0: return False
              return True

Typically 50K-500K of these.

Key: this data must be high quality—usually professional writers and domain experts. OpenAI hired PhDs for GPT-4’s SFT data.

Training

Still “predict next token”—but only compute loss on the response part, not the instruction (you don’t want the model to learn “imitate user prompts”).

Cost and scale:

Data volume 10000+ times smaller than pretrain
Training time: hours to days
Cost: thousands to tens of thousands of dollars

Output: SFT Model

At this point, the model follows instructions—can write poems, code, answer questions.

But still issues:

May output harmful/unhelpful answers
Has no “style preferences”—correct but mechanical

So we need step three.

Stage 3: RLHF (Reinforcement Learning from Human Feedback)

This is the core step that makes ChatGPT good.

Goal

Make the model learn “what humans like”—not just correct, but useful, safe, polite, appropriately detailed.

Process (three sub-steps)

Step 3.1: Collect Preference Data

For the same instruction, have the model generate 4-5 candidate responses and have human annotators rank them:

Instruction: Explain why the sky is blue.

Candidate A: The sky is blue because of light and atmosphere.
Candidate B: The sky appears blue because sunlight is made of various colors.
            Short-wavelength light (blue) scatters more easily in the atmosphere...
            (detailed, accurate)
Candidate C: Because of water reflection. (wrong)
Candidate D: The blue sky reminds me of childhood summers, those carefree afternoons...
            (off-topic)

Human ranking: B > A > D > C

Collect tens of thousands of such ranking data.

Step 3.2: Train Reward Model

Use preference data to train a small model—input “instruction + response”, output “how good” (a numerical score):

RewardModel(instruction, response) → score

This reward model is trained on preference data—it learned “what kind of response humans like.”

Step 3.3: Optimize Main Model with PPO

Now use reinforcement learning (PPO algorithm)—main model generates responses, reward model scores, encourage high-scoring behavior, discourage low-scoring.

LLM generates → RewardModel scores → high → reinforce this behavior
                                  ↓
                                  low → reduce this behavior

Technically extremely complex—PPO stability for LLMs is an engineering challenge.

Cost

RLHF is 10-100x more expensive than SFT—mostly human annotation cost.

GPT-4 RLHF stage estimated to cost several million dollars—mostly annotator wages.

Output: Your ChatGPT / Claude / Gemini

At this stage, the model becomes:

Obedient (follows instructions)
Polite (no swearing)
Detailed (doesn’t lazy answer in one line)
Safe (refuses harmful requests)
Stylistically consistent (“How may I help you?” pattern)

A Diagram

Original Transformer architecture
        ↓
[Stage 1 · Pretrain]
   Data: massive internet text
   Goal: predict next token
   Output: base model (continues but doesn't obey)
   Cost: $10M - $100M+
        ↓
[Stage 2 · SFT]
   Data: 50-500K high-quality instructions+responses
   Goal: learn to follow instructions
   Output: SFT model (obeys but mechanical)
   Cost: $10K-100K
        ↓
[Stage 3 · RLHF]
   Data: tens of thousands of preference pairs
   Goal: stylization, align to human preferences
   Output: your ChatGPT / Claude
   Cost: $1M-10M

💡 A counter-intuitive note

Pretraining teaches 99% of the capability; SFT + RLHF tunes 1% of “behavior”.

This is why simple SFT on GPT-3 turns it into ChatGPT—it already “knew” everything, just didn’t know how to use it.

Modern Variants: DPO / ORPO etc.

RLHF is engineering-complex and expensive. 2023-2024 simpler alternatives:

DPO (Direct Preference Optimization)

Core idea: skip “reward model + PPO”, directly use a simplified loss to train from preference data.

Effect close to RLHF but engineering simpler by 10x. Llama 3, Mistral, Qwen all use DPO.

ORPO / KTO etc.

More radical simplification—do SFT and alignment in one stage.

Trend: RLHF remains “gold standard,” but open-source ecosystem increasingly uses DPO due to cost pressure.

Open vs Closed Training

Stage	Open (Llama 3)	Closed (GPT-4)
Pretrain data	15T tokens	est. 13T tokens
Model params	70B / 405B	est. 1.7T (MoE)
SFT	tens of thousands manual + synthetic	hundreds of thousands
Preference alignment	DPO mainly	RLHF + multi-round iteration
Total cost	~$50M	$100M+

Closed models’ advantage is mainly in RLHF polish—architecture and pretrain data not as mysterious.

Some “Strange” Facts

LLMs Don’t Know They Were Trained

When you ask ChatGPT “how were you trained”—its answers are what it read in OpenAI’s public blog, not its actual training memory. It has no “autobiography.”

Training Data Determines Everything

Why is GPT slightly better at Chinese than Claude? Why is Llama’s Chinese weaker?—entirely depends on Chinese data proportion in training.

Model architectures are similar; differences are in data.

Fine-tuning Is 10,000x Cheaper Than Pretrain

This is why you can also “fine-tune an LLM” (L4-05 covers LoRA)—no re-pretraining, just hours of fine-tuning on a base model.

Big companies release base models (like Llama 3) so the world helps them do SFT and applications—they provide “the foundation,” community does “verticalization.”

One-Line Summary

Training an LLM = make it “read the entire internet” first, then “learn to listen”, finally “get groomed by humans to be likable”.

Each step uses completely different data, objectives, methods. All three are essential.

Next: “RAG from 0 to 1” — you know how LLMs are made; next, how to build apps with them.