HelloAI
L3 Chapter 6 🐣 🕒 15 min

CNN Convolution Principles: From Filters to ResNet

Before Transformer dominated, CNN ruled computer vision. Today it remains the default for image processing. This article explains "convolution" clearly.

H
HelloAI Editors
6/16/2026

In L0-02 we mentioned: 2012, AlexNet’s 8-layer CNN slashed ImageNet error in half, igniting the deep learning revolution.

That was CNN’s spotlight moment. Today, although Transformer has stolen attention in many fields, for processing images, CNN remains the default—its inductive bias is naturally suited for vision.

This article makes the core of CNN—convolution—crystal clear.

🎮 Recommended: first go play with CNN Convolution Scanner visualization for 5 minutes. Manually dragging a few kernels beats reading 1000 words.

Why Can’t We Use Fully-Connected for Images

The naivest neural network is Fully Connected (FC)—every input connects to every output.

But FC for images has three fatal problems:

Problem 1: Parameter Explosion

A 224×224 color image has 224×224×3=150,528224 \times 224 \times 3 = 150{,}528 pixels. If the first layer has 1000 neurons—parameters = 150,528×1000150M150{,}528 \times 1000 ≈ 150M.

For one layer!

Problem 2: Losing “Locality”

Key info in images is local—a cat’s ears are in a small region. FC treats all pixels equally, shredding local structure.

Problem 3: No “Translation Invariance”

Cat in top-left or bottom-right should make no difference for “is this a cat?”. But FC learns “what does top-left pixel 0 look like” + “what does top-left pixel 1 look like”—position-sensitive. Cat learned in top-left can’t be recognized in bottom-right.

CNN solves all three perfectly—through convolution.

What Is Convolution

If you’ve ever adjusted photo filters (Photoshop, Lightroom, Snapseed), you’ve used convolution.

Convolution = a small window (kernel) slides over the image, computing one number per position.

Simplest example: blur filter.

Input image:

[1001005050100100505020202002002020200200]\begin{bmatrix} 100 & 100 & 50 & 50 \\ 100 & 100 & 50 & 50 \\ 20 & 20 & 200 & 200 \\ 20 & 20 & 200 & 200 \end{bmatrix}

Blur kernel (3×3 average):

19[111111111]\frac{1}{9} \begin{bmatrix} 1 & 1 & 1 \\ 1 & 1 & 1 \\ 1 & 1 & 1 \end{bmatrix}

Operation: slide kernel over image; at each position multiply element-wise then sum. E.g., top-left output:

100+100+50+100+100+50+20+20+200982.2\frac{100+100+50 + 100+100+50 + 20+20+200}{9} ≈ 82.2

Final output is smoother (blurred) than input.

This is convolution.

Different Kernels = Different “Filters”

Classic kernels:

Horizontal Edge Detection (Sobel-Y)

[121000121]\begin{bmatrix} -1 & -2 & -1 \\ 0 & 0 & 0 \\ 1 & 2 & 1 \end{bmatrix}

Intuition: top row minus bottom—if vertical change is strong (bright→dark), result is big. Detects horizontal edges.

Vertical Edge Detection (Sobel-X)

[101202101]\begin{bmatrix} -1 & 0 & 1 \\ -2 & 0 & 2 \\ -1 & 0 & 1 \end{bmatrix}

Similar idea—detects vertical edges.

Sharpen

[010151010]\begin{bmatrix} 0 & -1 & 0 \\ -1 & 5 & -1 \\ 0 & -1 & 0 \end{bmatrix}

Center ×5, neighbors ×-1—enhances center vs neighbor contrast, edges “pop.”

Blur

19[111111111]\frac{1}{9} \begin{bmatrix} 1 & 1 & 1 \\ 1 & 1 & 1 \\ 1 & 1 & 1 \end{bmatrix}

Averages 3×3 neighbors, smoothing the image.

In classical image processing, these kernels are hand-designed.

CNN’s Revolution—Letting Kernels Learn

CNN’s key insight:

Don’t write the numbers in kernels—let the model learn them from data.

Mathematically: treat kernels as trainable parameters, optimize via gradient descent.

  • A CNN layer has dozens to hundreds of kernels
  • Each kernel learns a different “feature detector”
  • Shallow layers learn edges, textures; deep layers learn eyes, faces, wheels

Complete CNN Layer Pseudocode

import torch.nn as nn

class CNNBlock(nn.Module):
    def __init__(self, in_channels, out_channels):
        super().__init__()
        # Conv layer: 64 of 3×3 kernels
        self.conv = nn.Conv2d(in_channels, out_channels, kernel_size=3, padding=1)
        # Activation
        self.relu = nn.ReLU()
        # Pooling to reduce size
        self.pool = nn.MaxPool2d(2)

    def forward(self, x):
        x = self.conv(x)   # convolution
        x = self.relu(x)   # nonlinearity
        x = self.pool(x)   # pooling
        return x

This is CNN’s “lego brick”.

CNN Solved FC’s Three Problems

✅ Parameter Explosion → Parameter Sharing

A 3×3 kernel has only 9 parameters. It slides over the whole image—all positions share these 9 parameters.

CNN layer params = n_kernels × 9 + bias—thousands of times fewer than FC.

✅ Locality → Convolution Is Naturally Local

Each output position only looks at nearby 3×3. Local info captured precisely.

✅ Translation Invariance → Same Kernel Everywhere

Learned features don’t depend on position—an “ear kernel” recognizes ears wherever they scan.

💡 CNN's core insight

CNN = convolution + parameter sharing + translation invariance. These three make it perfect for “data with local structure + position-independent”—typically images.

CNN’s Standard Lego Bricks

A real CNN usually looks like:

Input image (3×224×224)
    ↓ [Conv 64 of 3×3] → 64×224×224
    ↓ [ReLU + Pool 2×2] → 64×112×112
    ↓ [Conv 128 of 3×3] → 128×112×112
    ↓ [ReLU + Pool 2×2] → 128×56×56
    ↓ [Conv 256 of 3×3] → 256×56×56
    ↓ [ReLU + Pool 2×2] → 256×28×28
    ↓ ... (more layers)
    ↓ [Flatten] → a long vector
    ↓ [Fully Connected] → 1000-class probability

Each layer sees “wider” gradually (kernels are small but stacked = deep neurons have huge “receptive fields”).

A Few Important Concepts

Padding: pad zeros at image edges to keep size unchanged after convolution. Stride: how far the kernel slides (default 1; 2 means every other pixel). Pooling: collapse 2×2 region into 1 (max or average), halving image size. Channel: color images have RGB 3 channels; each convolves independently, then summed.

Classic CNN Architecture Evolution

LeNet (1998) - CNN’s ancestor

LeCun’s 5-layer CNN for handwritten digit recognition.

AlexNet (2012) - Ignited deep learning

8 layers. Proved “deep learning works.”

VGG (2014) - Small and deep

All kernels 3×3, depth 16-19. Simple and elegant.

GoogLeNet / Inception (2014) - “Multiple paths”

Parallel kernels of different sizes; outputs concatenated.

ResNet (2015) - Game-changing “residual connections”

Solved “too deep to train” problem. Made 100, 1000 layers possible.

ResNet’s key is skip connections:

def residual_block(x):
    y = conv(x)
    y = relu(y)
    y = conv(y)
    return x + y   # ← this "+" is the revolution

This simple ”+” lets gradients flow directly back from deep to shallow layers, making deep networks trainable.

Historical trivia: ResNet’s famous visualization—20 layers beating 56 layers—revealed the “deeper = harder to train” paradox. Residual connections solved it.

EfficientNet (2019) - Balance depth, width, resolution

Systematic scaling method, best cost-efficiency.

Vision Transformer (2020) - Disrupted CNN

“If attention is so good, why not for images?” Split image into 16×16 patches, feed to Transformer—turns out it works.

But in actual industry deployments, CNN remains the workhorse—it works better on small data, infers faster, uses less power.

CNN Does More Than Classification

Besides classification, CNN is the foundation for:

  • Object detection (YOLO series, Faster R-CNN)—box objects
  • Semantic segmentation (U-Net)—per-pixel class labels
  • Image generation (GAN)—create new images
  • Super-resolution—make blurry sharp
  • Style transfer—make your photo look like Van Gogh
  • Medical imaging—CT/MRI analysis
  • Self-driving perception—lane, vehicle, pedestrian detection

All depend on convolution.

One-line Summary

CNN = let kernels learn.

Before the Transformer revolution, the entire vision field was stacking CNNs. Today it’s still the default starting point for vision—until you prove Transformer is better for your specific task.

Want to “See” It

👀 CNN Convolution Scanner — play 6 preset kernels, see how each extracts different features.

🔬 What you unlocked
  • Read ResNet, VGG, EfficientNet papers
  • Build a simple image classifier in PyTorch
  • Understand why “image + Transformer” = ViT, “image + CNN” = ResNet—both are inductive bias choices

Recommended next: L3-07 · Word Embedding from Word2Vec to BERT — CNN handles pixels; word embeddings handle words.