Probability and Maximum Likelihood: Why ML Is a Probability Problem
All ML models are actually doing one thing: finding parameters that maximize the probability of seeing the data. Once this clicks, all of ML becomes transparent.
By now you know:
- AI models are essentially matrix transformations (L1-02)
- Training = adjusting matrix entries to minimize loss (L1-03)
But there’s one unanswered question:
How exactly is “loss” defined? Why so many different loss functions?
The answer lies in probability theory. This article changes the lens—viewing all ML as a probability problem—and you’ll see all “loss functions” share one mathematical structure.
Re-examining ML’s Goal
Concrete example. Task: spam classification.
You show the model an email, and it outputs “spam” or “not spam.”
Naive description: “Make the model output the right label.”
A more accurate description:
Make the model output the probability of “this is spam.”
The model outputs a number 0-1: 0.92 means “92% likely spam.”
This has huge benefits:
- You see the model’s confidence
- You can set thresholds (>0.5 marks as spam)
- You can compare different models’ outputs (which is more confident)
Likelihood vs Probability
These two sound alike but mean very different things.
Probability
Given parameters, what’s the chance of seeing the data?
Example: given a fair coin (p=0.5), what’s the chance of getting “heads, tails, heads” in 3 flips?
Likelihood
Given data, what parameters best explain it?
Example: I flipped 3 times and got “heads, tails, heads”—what’s the coin’s ?
We try different values of , see which makes the data most “reasonable”:
| | = | |---|---| | 0.1 | | | 0.3 | | | 0.5 | | | 0.667 | | | 0.8 | | | 0.9 | |
maximizes the data’s probability—this is maximum likelihood estimation.
Intuition matches—3 flips, 2 heads, 1 tail, best estimate is .
Maximum Likelihood Estimation (MLE) = pick parameters that make your observed data most likely.
Generalizing to ML
Back to spam classification.
- Data: emails with true labels ()
- Parameters: all model matrices (NN weights)
- Goal: find that makes “model’s output probability” match “true label”
Mathematically:
Meaning: find that maximizes the product of all data’s “explained probabilities”.
But products have a problem—multiplying lots of 0-1 numbers gets tiny, computers lose precision.
So we take a log:
Products become sums. Then we usually like “minimize” instead of “maximize,” so add a negative:
This is ML’s famous loss—Cross Entropy.
”Cross Entropy” = “Negative Log Likelihood”
Yes—they’re the same thing, just different names.
- Probability view: maximize log likelihood
- ML engineering view: minimize cross entropy
Negative flip flips max to min—but the objective is identical.
Standard binary cross-entropy in PyTorch:
Where is model’s predicted probability, is true label (0 or 1).
Translation:
- When truth is spam (), the closer is to 1, the closer is to 0 (small loss)
- When truth is spam, near 0 means (huge loss explosion)
This penalizes “confidently wrong”.
Why LLMs Are the Same Math
Back to ChatGPT. How is it trained?
ChatGPT is predicting the next token. Given prior text, it outputs a probability distribution over the vocabulary:
Input: "Today the weather is really"
Model output: { "good": 0.45, "nice": 0.22, "bad": 0.10, "cold": 0.08, ... }
When training:
- Data: massive text, each position has “real next token”
- Loss: cross entropy—higher probability on “real word” the better
This one formula trains all large models:
Loss = -log P(real next word | prior text, model params)
Sum over trillions of tokens—that’s GPT-3’s training loss.
All LLMs are essentially maximum likelihood estimation. “It’s predicting the next word” + “It’s minimizing cross entropy” + “It’s doing MLE”—these are three ways to say the same thing.
A Glimpse of Bayesian
MLE’s “rival” is Bayesian—it not only considers data but also “priors.”
Meaning:
- : likelihood (how reasonable the data is)
- : prior (your belief about parameters themselves)
- : posterior (updated belief after seeing data)
This is the other big school of statistics. Modern deep learning mostly uses MLE, but Bayesian thinking still matters in:
- Model uncertainty (“how confident is it”)
- Small data scenarios
- Specialized research (Bayesian NNs, variational inference)
L2 covers Bayesian methods specifically.
Let’s See It in Python
import numpy as np
# Spam prediction example
# y is true label (0/1), y_hat is predicted probability (0-1)
y = np.array([1, 0, 1, 1, 0])
y_hat = np.array([0.9, 0.2, 0.8, 0.65, 0.4])
# Binary cross entropy
ce = -np.mean(y * np.log(y_hat) + (1-y) * np.log(1 - y_hat))
print(f"Cross entropy loss: {ce:.4f}")
# Suppose model improved
y_hat_better = np.array([0.95, 0.05, 0.95, 0.85, 0.1])
ce_better = -np.mean(y * np.log(y_hat_better) + (1-y) * np.log(1 - y_hat_better))
print(f"Better model: {ce_better:.4f}") # Smaller
You’ll see: more accurate, more confident predictions → smaller cross entropy. That’s why ML training targets it.
KL Divergence (a glimpse)
A related concept—KL divergence. Measures “how different two probability distributions are”:
Key properties:
- (no difference with self)
- always
- Asymmetric:
KL appears everywhere in ML:
- VAE (Variational Autoencoder) has it in its loss
- Reinforcement learning PPO uses it for “policy constraints”
- Knowledge distillation uses it to make small models learn big models’ distributions
Trivia: Cross entropy = entropy + KL divergence. Training: “entropy” is constant (labels fixed), so minimizing cross entropy = minimizing KL divergence—essentially making model’s predicted distribution close to true distribution.
One-line Summary
Machine learning = assume a probability model + find parameters that best explain the data.
Cross entropy / Maximum likelihood / KL divergence—all the same thing from different angles.
When you see any future loss function, it’s essentially “aligning two probability distributions”—just in different ways.
Want More
👀 LLM sampling visualization — see the model’s output probability distribution, play with temperature / top-k / top-p.
Reaching this point, you’ve mastered:
- Linear algebra (the form of data flow)
- Calculus (the mechanism of learning)
- Probability (the essence of loss)
L1 has 4 more articles on information theory, matrix multiplication’s engineering meaning, chain rule’s detailed derivation, etc. Finishing all gives you the ability to read 90% of formulas in AI papers.
Recommended next: L2-09 Optimizer Deep Dive — you’ve understood gradient descent; next is why Adam is so strong.