Derivatives and Gradients: The Math Definition of "Learning"
The essence of "training a neural network" is derivatives. Once you understand this word, AI training stops being mysterious.
Back to the question at the end of L1-02: where do the numbers in matrix come from?
Answer: training.
More specifically: adjusted via “gradient descent”.
This article makes “gradient” crystal clear. By the end you’ll see, it’s just the math word for “which way is downhill”.
Station 1: What Is a Derivative
Let me use an example. You’re driving, speedometer shows 60 km/h.
What is “60”? It’s “distance’s derivative with respect to time”—
“In another second, I’ll travel 60/3600 km.”
A derivative is a rate of change. It answers:
“If the input changes a tiny bit, how much does the output change?”
Math notation:
Don’t be intimidated by this formula—it’s saying “when changes by , changes by how much, divided by “—a ratio.
Some Intuitions
- Function , at , derivative = 6 (because ). Meaning: when increases by a tiny bit, increases by about 6 times that.
- Function (constant), derivative = 0. Meaning: no matter how moves, stays the same.
- Function , derivative = 3. Meaning: increases by 1, increases by 3.
Geometric Meaning
Derivative = the slope of the function graph’s tangent line at that point
Tangent pointing up → positive derivative (value increasing) Tangent pointing down → negative derivative (value decreasing) Horizontal tangent → zero derivative (local max or min)
Derivative = 0 marks an optimum. Finding a function’s minimum is equivalent to finding where the derivative is 0. This is the core of upcoming “gradient descent.”
Station 2: What Is a Gradient
If a function has one input (like ), its “derivative” is one number.
But functions in AI have millions or billions of inputs. E.g., a neural network’s loss function:
—GPT-3’s 175B parameters all count.
How do we define “derivative” for such functions?
Answer: take the derivative with respect to each parameter separately. The collection is called the gradient.
Each component tells you: “if I only tune this one parameter, how does the loss change?”
The gradient itself is a vector.
Graphical Intuition
Imagine standing in mountainous terrain. Your altitude = loss value. Your position (latitude/longitude) = model parameters.
Gradient = “the steepest uphill direction”.
To find the lowest point (minimum loss), walk in the opposite direction of the gradient—this is “gradient descent.”
Each step:
- is step size (also called “learning rate”)
- is current gradient
- The negative sign = “opposite direction”—downhill
This is the core algorithm of almost all AI training.
Station 3: Chain Rule
But a neural network isn’t that simple. It’s nested layer by layer—
Input passes through layer , then , then —that’s a “3-layer neural network.”
Now the question: how do we take a derivative of this nested function?
Answer: chain rule. This is taught in high school but most people don’t realize how important it is.
Translation: each layer’s “influence on input” = product of each layer’s own derivative.
Backpropagation
If you compute derivatives starting from the final layer and going backward—this process is called backpropagation.
It’s the core algorithm of neural network training. Popularized in 1986 by Hinton et al. (see L0-02 AI history).
Key insight: chain rule lets us avoid “directly computing” the whole network’s derivative—we can compute layer by layer, mechanically, cheaply.
This is why 100-layer, 1000-layer networks can still train—thanks to chain rule.
A Complete Example
Let me walk through the simplest example.
Task: given data points , find a line to fit them.
We’re looking for parameter .
Intuitive answer: (data looks like y ≈ 2x + 1, simplified).
How the machine learns:
1. Define Loss Function
Meaning: predict with our , sum of squared differences from actual.
Plug in data:
2. Take Derivative
3. Gradient Descent
Initialize , learning rate :
| Step | Gradient | New | |
|---|---|---|---|
| 0 | 0.00 | ||
| 1 | 3.40 | ||
| 2 | 2.04 | ||
| 3 | 2.58 | ||
| 4 | 2.37 | … approaching 0 | ≈ |
| ⋯ | |||
| Converged | gradient ≈ 0 | Found optimum |
This is gradient descent in full—start from a random point, keep walking opposite the gradient, until the gradient is 0.
Run with NumPy
import numpy as np
# Data
X = np.array([1, 2, 3])
y = np.array([3, 5, 7])
# Initialize parameter and learning rate
w = 0.0
lr = 0.05
# Training loop
for step in range(50):
# 1. Forward: predict
y_pred = w * X
# 2. Compute loss (for printing)
loss = np.mean((y_pred - y) ** 2)
# 3. Compute gradient (dL/dw)
grad = 2 * np.mean((y_pred - y) * X)
# 4. Gradient descent update
w = w - lr * grad
if step % 10 == 0:
print(f"step {step}: w={w:.4f}, loss={loss:.4f}, grad={grad:.4f}")
print(f"\nFinal w = {w:.4f}")
Run it; you’ll see go from 0, gradually approaching 2.43. This is what “learned” means.
Real Neural Network Training
Real neural networks are way more complex:
- Parameters aren’t 1, they’re billions
- Loss isn’t squared error; it’s cross-entropy (next article)
- Not full-batch gradient; it’s mini-batch
- Not vanilla gradient descent; it’s Adam / Momentum (see L2-09 Optimizers Deep Dive)
But the core idea is identical:
- Forward: predict with current parameters
- Compute loss
- Backward: compute gradient via chain rule
- Update parameters with gradient
- Repeat millions of times
“Training” sounds mystical—but at heart it’s repeating four things until convergence. The machine doesn’t learn “knowledge”; it learns numbers that minimize loss.
Want to “See” It
👀 Open the Gradient Descent Hiker visualization — click anywhere on the map to set a start point, watch SGD / Momentum / Adam trajectories descend simultaneously.
One-Line Summary
Derivatives tell you “which direction is better”; gradient descent makes you “take one step that way.”
Repeat millions of times, and the machine has learned.
Next: “Probability and Maximum Likelihood: Why ML Is a Probability Problem”