L1 Chapter 3 🐣 🕒 13 min

Derivatives and Gradients: The Math Definition of "Learning"

The essence of "training a neural network" is derivatives. Once you understand this word, AI training stops being mysterious.

HelloAI Editors

6/12/2026

Back to the question at the end of L1-02: where do the numbers in matrix $W$ come from?

Answer: training.

More specifically: adjusted via “gradient descent”.

This article makes “gradient” crystal clear. By the end you’ll see, it’s just the math word for “which way is downhill”.

Station 1: What Is a Derivative

Let me use an example. You’re driving, speedometer shows 60 km/h.

What is “60”? It’s “distance’s derivative with respect to time”—

“In another second, I’ll travel 60/3600 km.”

A derivative is a rate of change. It answers:

“If the input changes a tiny bit, how much does the output change?”

Math notation:

f'(x) = \frac{df}{dx} = \lim_{\Delta x \to 0} \frac{f(x + \Delta x) - f(x)}{\Delta x}

Don’t be intimidated by this formula—it’s saying “when $x$ changes by $\Delta x$ , $f$ changes by how much, divided by $\Delta x$ “—a ratio.

Some Intuitions

Function $f(x) = x^2$ , at $x = 3$ , derivative = 6 (because $f' = 2x$ ). Meaning: when $x$ increases by a tiny bit, $f$ increases by about 6 times that.
Function $f(x) = 5$ (constant), derivative = 0. Meaning: no matter how $x$ moves, $f$ stays the same.
Function $f(x) = 3x + 7$ , derivative = 3. Meaning: $x$ increases by 1, $f$ increases by 3.

Geometric Meaning

Derivative = the slope of the function graph’s tangent line at that point

Tangent pointing up → positive derivative (value increasing) Tangent pointing down → negative derivative (value decreasing) Horizontal tangent → zero derivative (local max or min)

💡 A key intuition

Derivative = 0 marks an optimum. Finding a function’s minimum is equivalent to finding where the derivative is 0. This is the core of upcoming “gradient descent.”

Station 2: What Is a Gradient

If a function has one input (like $f(x)$ ), its “derivative” is one number.

But functions in AI have millions or billions of inputs. E.g., a neural network’s loss function:

L(W) = L(w_1, w_2, w_3, \ldots, w_{175000000000})

—GPT-3’s 175B parameters all count.

How do we define “derivative” for such functions?

Answer: take the derivative with respect to each parameter separately. The collection is called the gradient.

\nabla L = \left[ \frac{\partial L}{\partial w_1}, \frac{\partial L}{\partial w_2}, \ldots, \frac{\partial L}{\partial w_n} \right]

Each component tells you: “if I only tune this one parameter, how does the loss change?”

The gradient itself is a vector.

Graphical Intuition

Imagine standing in mountainous terrain. Your altitude = loss value. Your position (latitude/longitude) = model parameters.

Gradient = “the steepest uphill direction”.

To find the lowest point (minimum loss), walk in the opposite direction of the gradient—this is “gradient descent.”

Each step:

W_{new} = W_{old} - \alpha \cdot \nabla L(W_{old})

$\alpha$ is step size (also called “learning rate”)
$\nabla L$ is current gradient
The negative sign = “opposite direction”—downhill

This is the core algorithm of almost all AI training.

Station 3: Chain Rule

But a neural network isn’t $f(x)$ that simple. It’s nested layer by layer—

y = f_3(f_2(f_1(x)))

Input $x$ passes through layer $f_1$ , then $f_2$ , then $f_3$ —that’s a “3-layer neural network.”

Now the question: how do we take a derivative of this nested function?

Answer: chain rule. This is taught in high school but most people don’t realize how important it is.

\frac{dy}{dx} = \frac{dy}{df_2} \cdot \frac{df_2}{df_1} \cdot \frac{df_1}{dx}

Translation: each layer’s “influence on input” = product of each layer’s own derivative.

Backpropagation

If you compute derivatives starting from the final layer and going backward—this process is called backpropagation.

It’s the core algorithm of neural network training. Popularized in 1986 by Hinton et al. (see L0-02 AI history).

Key insight: chain rule lets us avoid “directly computing” the whole network’s derivative—we can compute layer by layer, mechanically, cheaply.

This is why 100-layer, 1000-layer networks can still train—thanks to chain rule.

A Complete Example

Let me walk through the simplest example.

Task: given data points $(1, 3), (2, 5), (3, 7)$ , find a line $y = wx$ to fit them.

We’re looking for parameter $w$ .

Intuitive answer: $w = 2.something$ (data looks like y ≈ 2x + 1, simplified).

How the machine learns:

1. Define Loss Function

L(w) = \sum_i (w \cdot x_i - y_i)^2

Meaning: predict with our $w$ , sum of squared differences from actual.

Plug in data:

L(w) = (w \cdot 1 - 3)^2 + (w \cdot 2 - 5)^2 + (w \cdot 3 - 7)^2

2. Take Derivative

\frac{dL}{dw} = 2(w-3) + 2 \cdot 2(2w-5) + 2 \cdot 3(3w-7) = 2w - 6 + 8w - 20 + 18w - 42 = 28w - 68

3. Gradient Descent

Initialize $w = 0$ , learning rate $\alpha = 0.05$ :

Step	$w$	Gradient	New $w$
0	0.00	$28(0)-68 = -68$	$0 - 0.05 \times (-68) = 3.40$
1	3.40	$28(3.4)-68 = 27.2$	$3.40 - 0.05 \times 27.2 = 2.04$
2	2.04	$28(2.04)-68 = -10.88$	$2.04 + 0.54 = 2.58$
3	2.58	$28(2.58)-68 = 4.24$	$2.58 - 0.21 = 2.37$
4	2.37	… approaching 0	≈ $2.43$
⋯
Converged	$w ≈ 2.43$	gradient ≈ 0	Found optimum

This is gradient descent in full—start from a random point, keep walking opposite the gradient, until the gradient is 0.

Run with NumPy

import numpy as np

# Data
X = np.array([1, 2, 3])
y = np.array([3, 5, 7])

# Initialize parameter and learning rate
w = 0.0
lr = 0.05

# Training loop
for step in range(50):
    # 1. Forward: predict
    y_pred = w * X

    # 2. Compute loss (for printing)
    loss = np.mean((y_pred - y) ** 2)

    # 3. Compute gradient (dL/dw)
    grad = 2 * np.mean((y_pred - y) * X)

    # 4. Gradient descent update
    w = w - lr * grad

    if step % 10 == 0:
        print(f"step {step}: w={w:.4f}, loss={loss:.4f}, grad={grad:.4f}")

print(f"\nFinal w = {w:.4f}")

Run it; you’ll see $w$ go from 0, gradually approaching 2.43. This is what “learned” means.

Real Neural Network Training

Real neural networks are way more complex:

Parameters aren’t 1, they’re billions
Loss isn’t squared error; it’s cross-entropy (next article)
Not full-batch gradient; it’s mini-batch
Not vanilla gradient descent; it’s Adam / Momentum (see L2-09 Optimizers Deep Dive)

But the core idea is identical:

Forward: predict with current parameters
Compute loss
Backward: compute gradient via chain rule
Update parameters with gradient
Repeat millions of times

🔬 An aha moment

“Training” sounds mystical—but at heart it’s repeating four things until convergence. The machine doesn’t learn “knowledge”; it learns numbers that minimize loss.

Want to “See” It

👀 Open the Gradient Descent Hiker visualization — click anywhere on the map to set a start point, watch SGD / Momentum / Adam trajectories descend simultaneously.

One-Line Summary

Derivatives tell you “which direction is better”; gradient descent makes you “take one step that way.”

Repeat millions of times, and the machine has learned.

Next: “Probability and Maximum Likelihood: Why ML Is a Probability Problem”