L1 Chapter 4 🐣 🕒 9 min

Probability and Maximum Likelihood: Why ML Is a Probability Problem

All ML models are actually doing one thing: finding parameters that maximize the probability of seeing the data. Once this clicks, all of ML becomes transparent.

HelloAI Editors

6/13/2026

By now you know:

AI models are essentially matrix transformations (L1-02)
Training = adjusting matrix entries to minimize loss (L1-03)

But there’s one unanswered question:

How exactly is “loss” defined? Why so many different loss functions?

The answer lies in probability theory. This article changes the lens—viewing all ML as a probability problem—and you’ll see all “loss functions” share one mathematical structure.

Re-examining ML’s Goal

Concrete example. Task: spam classification.

You show the model an email, and it outputs “spam” or “not spam.”

Naive description: “Make the model output the right label.”

A more accurate description:

Make the model output the probability of “this is spam.”

The model outputs a number 0-1: 0.92 means “92% likely spam.”

This has huge benefits:

You see the model’s confidence
You can set thresholds (>0.5 marks as spam)
You can compare different models’ outputs (which is more confident)

Likelihood vs Probability

These two sound alike but mean very different things.

Probability

Given parameters, what’s the chance of seeing the data?

Example: given a fair coin (p=0.5), what’s the chance of getting “heads, tails, heads” in 3 flips?

P(\text{HTH} | p=0.5) = 0.5 \times 0.5 \times 0.5 = 0.125

Likelihood

Given data, what parameters best explain it?

Example: I flipped 3 times and got “heads, tails, heads”—what’s the coin’s $p$ ?

We try different values of $p$ , see which makes the data most “reasonable”:

| $p$ | $P(\text{HTH} | p)$ = $p \times (1-p) \times p$ | |---|---| | 0.1 | $0.1 \times 0.9 \times 0.1 = 0.009$ | | 0.3 | $0.3 \times 0.7 \times 0.3 = 0.063$ | | 0.5 | $0.5 \times 0.5 \times 0.5 = 0.125$ | | 0.667 | $0.667 \times 0.333 \times 0.667 = \mathbf{0.148}$ | | 0.8 | $0.8 \times 0.2 \times 0.8 = 0.128$ | | 0.9 | $0.9 \times 0.1 \times 0.9 = 0.081$ |

$p = 0.667$ maximizes the data’s probability—this is maximum likelihood estimation.

Intuition matches—3 flips, 2 heads, 1 tail, best estimate is $p \approx 2/3$ .

💡 One-line summary

Maximum Likelihood Estimation (MLE) = pick parameters that make your observed data most likely.

Generalizing to ML

Back to spam classification.

Data: emails with true labels ( $x_i, y_i$ )
Parameters: all model matrices $\theta$ (NN weights)
Goal: find $\theta$ that makes “model’s output probability” match “true label”

Mathematically:

\theta^* = \arg\max_\theta \prod_i P(y_i | x_i, \theta)

Meaning: find $\theta$ that maximizes the product of all data’s “explained probabilities”.

But products have a problem—multiplying lots of 0-1 numbers gets tiny, computers lose precision.

So we take a log:

\theta^* = \arg\max_\theta \sum_i \log P(y_i | x_i, \theta)

Products become sums. Then we usually like “minimize” instead of “maximize,” so add a negative:

\theta^* = \arg\min_\theta -\sum_i \log P(y_i | x_i, \theta)

This $-\sum \log P$ is ML’s famous loss—Cross Entropy.

”Cross Entropy” = “Negative Log Likelihood”

Yes—they’re the same thing, just different names.

Probability view: maximize log likelihood
ML engineering view: minimize cross entropy

Negative flip flips max to min—but the objective is identical.

Standard binary cross-entropy in PyTorch:

L = -\sum_i [y_i \log \hat{y}_i + (1-y_i) \log(1-\hat{y}_i)]

Where $\hat{y}_i$ is model’s predicted probability, $y_i$ is true label (0 or 1).

Translation:

When truth is spam ( $y=1$ ), the closer $\hat{y}$ is to 1, the closer $\log \hat{y}$ is to 0 (small loss)
When truth is spam, $\hat{y}$ near 0 means $\log \hat{y} \to -\infty$ (huge loss explosion)

This penalizes “confidently wrong”.

Why LLMs Are the Same Math

Back to ChatGPT. How is it trained?

ChatGPT is predicting the next token. Given prior text, it outputs a probability distribution over the vocabulary:

Input: "Today the weather is really"
Model output: { "good": 0.45, "nice": 0.22, "bad": 0.10, "cold": 0.08, ... }

When training:

Data: massive text, each position has “real next token”
Loss: cross entropy—higher probability on “real word” the better

This one formula trains all large models:

Loss = -log P(real next word | prior text, model params)

Sum over trillions of tokens—that’s GPT-3’s training loss.

🔬 An aha moment

All LLMs are essentially maximum likelihood estimation. “It’s predicting the next word” + “It’s minimizing cross entropy” + “It’s doing MLE”—these are three ways to say the same thing.

A Glimpse of Bayesian

MLE’s “rival” is Bayesian—it not only considers data but also “priors.”

P(\theta | \text{data}) \propto P(\text{data} | \theta) \cdot P(\theta)

Meaning:

$P(\text{data} | \theta)$ : likelihood (how reasonable the data is)
$P(\theta)$ : prior (your belief about parameters themselves)
$P(\theta | \text{data})$ : posterior (updated belief after seeing data)

This is the other big school of statistics. Modern deep learning mostly uses MLE, but Bayesian thinking still matters in:

Model uncertainty (“how confident is it”)
Small data scenarios
Specialized research (Bayesian NNs, variational inference)

L2 covers Bayesian methods specifically.

Let’s See It in Python

import numpy as np

# Spam prediction example
# y is true label (0/1), y_hat is predicted probability (0-1)

y      = np.array([1, 0, 1, 1, 0])
y_hat  = np.array([0.9, 0.2, 0.8, 0.65, 0.4])

# Binary cross entropy
ce = -np.mean(y * np.log(y_hat) + (1-y) * np.log(1 - y_hat))
print(f"Cross entropy loss: {ce:.4f}")

# Suppose model improved
y_hat_better = np.array([0.95, 0.05, 0.95, 0.85, 0.1])
ce_better = -np.mean(y * np.log(y_hat_better) + (1-y) * np.log(1 - y_hat_better))
print(f"Better model: {ce_better:.4f}")  # Smaller

You’ll see: more accurate, more confident predictions → smaller cross entropy. That’s why ML training targets it.

KL Divergence (a glimpse)

A related concept—KL divergence. Measures “how different two probability distributions are”:

KL(P \| Q) = \sum_i P(i) \log \frac{P(i)}{Q(i)}

Key properties:

$KL(P \| P) = 0$ (no difference with self)
$KL > 0$ always
Asymmetric: $KL(P \| Q) \ne KL(Q \| P)$

KL appears everywhere in ML:

VAE (Variational Autoencoder) has it in its loss
Reinforcement learning PPO uses it for “policy constraints”
Knowledge distillation uses it to make small models learn big models’ distributions

Trivia: Cross entropy = entropy + KL divergence. Training: “entropy” is constant (labels fixed), so minimizing cross entropy = minimizing KL divergence—essentially making model’s predicted distribution close to true distribution.

One-line Summary

Machine learning = assume a probability model + find parameters that best explain the data.

Cross entropy / Maximum likelihood / KL divergence—all the same thing from different angles.

When you see any future loss function, it’s essentially “aligning two probability distributions”—just in different ways.

Want More

👀 LLM sampling visualization — see the model’s output probability distribution, play with temperature / top-k / top-p.

💡 L1 Math Block Recap

Reaching this point, you’ve mastered:

Linear algebra (the form of data flow)
Calculus (the mechanism of learning)
Probability (the essence of loss)

L1 has 4 more articles on information theory, matrix multiplication’s engineering meaning, chain rule’s detailed derivation, etc. Finishing all gives you the ability to read 90% of formulas in AI papers.

Recommended next: L2-09 Optimizer Deep Dive — you’ve understood gradient descent; next is why Adam is so strong.