CNN Convolution Principles: From Filters to ResNet
Before Transformer dominated, CNN ruled computer vision. Today it remains the default for image processing. This article explains "convolution" clearly.
In L0-02 we mentioned: 2012, AlexNet’s 8-layer CNN slashed ImageNet error in half, igniting the deep learning revolution.
That was CNN’s spotlight moment. Today, although Transformer has stolen attention in many fields, for processing images, CNN remains the default—its inductive bias is naturally suited for vision.
This article makes the core of CNN—convolution—crystal clear.
🎮 Recommended: first go play with CNN Convolution Scanner visualization for 5 minutes. Manually dragging a few kernels beats reading 1000 words.
Why Can’t We Use Fully-Connected for Images
The naivest neural network is Fully Connected (FC)—every input connects to every output.
But FC for images has three fatal problems:
Problem 1: Parameter Explosion
A 224×224 color image has pixels. If the first layer has 1000 neurons—parameters = .
For one layer!
Problem 2: Losing “Locality”
Key info in images is local—a cat’s ears are in a small region. FC treats all pixels equally, shredding local structure.
Problem 3: No “Translation Invariance”
Cat in top-left or bottom-right should make no difference for “is this a cat?”. But FC learns “what does top-left pixel 0 look like” + “what does top-left pixel 1 look like”—position-sensitive. Cat learned in top-left can’t be recognized in bottom-right.
CNN solves all three perfectly—through convolution.
What Is Convolution
If you’ve ever adjusted photo filters (Photoshop, Lightroom, Snapseed), you’ve used convolution.
Convolution = a small window (kernel) slides over the image, computing one number per position.
Simplest example: blur filter.
Input image:
Blur kernel (3×3 average):
Operation: slide kernel over image; at each position multiply element-wise then sum. E.g., top-left output:
Final output is smoother (blurred) than input.
This is convolution.
Different Kernels = Different “Filters”
Classic kernels:
Horizontal Edge Detection (Sobel-Y)
Intuition: top row minus bottom—if vertical change is strong (bright→dark), result is big. Detects horizontal edges.
Vertical Edge Detection (Sobel-X)
Similar idea—detects vertical edges.
Sharpen
Center ×5, neighbors ×-1—enhances center vs neighbor contrast, edges “pop.”
Blur
Averages 3×3 neighbors, smoothing the image.
In classical image processing, these kernels are hand-designed.
CNN’s Revolution—Letting Kernels Learn
CNN’s key insight:
Don’t write the numbers in kernels—let the model learn them from data.
Mathematically: treat kernels as trainable parameters, optimize via gradient descent.
- A CNN layer has dozens to hundreds of kernels
- Each kernel learns a different “feature detector”
- Shallow layers learn edges, textures; deep layers learn eyes, faces, wheels
Complete CNN Layer Pseudocode
import torch.nn as nn
class CNNBlock(nn.Module):
def __init__(self, in_channels, out_channels):
super().__init__()
# Conv layer: 64 of 3×3 kernels
self.conv = nn.Conv2d(in_channels, out_channels, kernel_size=3, padding=1)
# Activation
self.relu = nn.ReLU()
# Pooling to reduce size
self.pool = nn.MaxPool2d(2)
def forward(self, x):
x = self.conv(x) # convolution
x = self.relu(x) # nonlinearity
x = self.pool(x) # pooling
return x
This is CNN’s “lego brick”.
CNN Solved FC’s Three Problems
✅ Parameter Explosion → Parameter Sharing
A 3×3 kernel has only 9 parameters. It slides over the whole image—all positions share these 9 parameters.
CNN layer params = n_kernels × 9 + bias—thousands of times fewer than FC.
✅ Locality → Convolution Is Naturally Local
Each output position only looks at nearby 3×3. Local info captured precisely.
✅ Translation Invariance → Same Kernel Everywhere
Learned features don’t depend on position—an “ear kernel” recognizes ears wherever they scan.
CNN = convolution + parameter sharing + translation invariance. These three make it perfect for “data with local structure + position-independent”—typically images.
CNN’s Standard Lego Bricks
A real CNN usually looks like:
Input image (3×224×224)
↓ [Conv 64 of 3×3] → 64×224×224
↓ [ReLU + Pool 2×2] → 64×112×112
↓ [Conv 128 of 3×3] → 128×112×112
↓ [ReLU + Pool 2×2] → 128×56×56
↓ [Conv 256 of 3×3] → 256×56×56
↓ [ReLU + Pool 2×2] → 256×28×28
↓ ... (more layers)
↓ [Flatten] → a long vector
↓ [Fully Connected] → 1000-class probability
Each layer sees “wider” gradually (kernels are small but stacked = deep neurons have huge “receptive fields”).
A Few Important Concepts
Padding: pad zeros at image edges to keep size unchanged after convolution. Stride: how far the kernel slides (default 1; 2 means every other pixel). Pooling: collapse 2×2 region into 1 (max or average), halving image size. Channel: color images have RGB 3 channels; each convolves independently, then summed.
Classic CNN Architecture Evolution
LeNet (1998) - CNN’s ancestor
LeCun’s 5-layer CNN for handwritten digit recognition.
AlexNet (2012) - Ignited deep learning
8 layers. Proved “deep learning works.”
VGG (2014) - Small and deep
All kernels 3×3, depth 16-19. Simple and elegant.
GoogLeNet / Inception (2014) - “Multiple paths”
Parallel kernels of different sizes; outputs concatenated.
ResNet (2015) - Game-changing “residual connections”
Solved “too deep to train” problem. Made 100, 1000 layers possible.
ResNet’s key is skip connections:
def residual_block(x):
y = conv(x)
y = relu(y)
y = conv(y)
return x + y # ← this "+" is the revolution
This simple ”+” lets gradients flow directly back from deep to shallow layers, making deep networks trainable.
Historical trivia: ResNet’s famous visualization—20 layers beating 56 layers—revealed the “deeper = harder to train” paradox. Residual connections solved it.
EfficientNet (2019) - Balance depth, width, resolution
Systematic scaling method, best cost-efficiency.
Vision Transformer (2020) - Disrupted CNN
“If attention is so good, why not for images?” Split image into 16×16 patches, feed to Transformer—turns out it works.
But in actual industry deployments, CNN remains the workhorse—it works better on small data, infers faster, uses less power.
CNN Does More Than Classification
Besides classification, CNN is the foundation for:
- Object detection (YOLO series, Faster R-CNN)—box objects
- Semantic segmentation (U-Net)—per-pixel class labels
- Image generation (GAN)—create new images
- Super-resolution—make blurry sharp
- Style transfer—make your photo look like Van Gogh
- Medical imaging—CT/MRI analysis
- Self-driving perception—lane, vehicle, pedestrian detection
All depend on convolution.
One-line Summary
CNN = let kernels learn.
Before the Transformer revolution, the entire vision field was stacking CNNs. Today it’s still the default starting point for vision—until you prove Transformer is better for your specific task.
Want to “See” It
👀 CNN Convolution Scanner — play 6 preset kernels, see how each extracts different features.
- Read ResNet, VGG, EfficientNet papers
- Build a simple image classifier in PyTorch
- Understand why “image + Transformer” = ViT, “image + CNN” = ResNet—both are inductive bias choices
Recommended next: L3-07 · Word Embedding from Word2Vec to BERT — CNN handles pixels; word embeddings handle words.