L5 Chapter 3 🐣 🕒 13 min

ViT and CLIP: Teaching Transformers to See

Slice images into patches, feed to Transformer—the biggest paradigm shift in computer vision in 2020.

Alai

7/19/2026

In L3-06 we saw CNNs ruled computer vision for 8 years (2012-2020). In L3-08 we saw Transformers sweep through NLP.

A natural question in 2020: can we use Transformers on images too?

The answer: yes, and the results are stunning.

Stop 1: The Core Idea of ViT

Transformers process sequences—but images are 2D grids. How do we bridge this?

ViT (Vision Transformer, 2020) used a remarkably simple trick:

Split the image into small patches, treat each patch as a token.

A 224×224 image
   ↓ split into 16×16 patches (196 total)
   ↓ flatten each patch into a 256-dim vector (16×16)
   ↓ project to 768-dim via a linear layer
   ↓ add positional encoding
   ↓ feed to a standard Transformer

Copying NLP’s Transformer architecture verbatim—just replacing “words” with “patches”.

Why does this work?

CNN’s inductive biases:

Locality (neighbors matter)
Translation invariance (position-independent)
Hierarchy (low-level → high-level)

ViT threw all of these away—learning purely from data.

Question: without inductive biases, the model should be harder to train—why does it win?

Answer: with enough data, no inductive bias is actually better—the model learns finer patterns than what CNN designers hand-coded.

Classic Finding

Data scale	CNN performance	ViT performance
ImageNet (1.3M images)	Strong	Weak
JFT-300M (300M images)	Strong	Stronger + surpassing

ViT surpasses CNNs at large data scale—the core rationale for all subsequent vision Transformers.

Stop 2: ViT in PyTorch

import torch
import torch.nn as nn

class PatchEmbedding(nn.Module):
    def __init__(self, img_size=224, patch_size=16, in_channels=3, dim=768):
        super().__init__()
        self.n_patches = (img_size // patch_size) ** 2   # 196
        # Use Conv2d for patch splitting + projection (two birds, one stone)
        self.proj = nn.Conv2d(in_channels, dim, kernel_size=patch_size, stride=patch_size)

    def forward(self, x):
        # x: (B, 3, 224, 224)
        x = self.proj(x)          # (B, 768, 14, 14)
        x = x.flatten(2)          # (B, 768, 196)
        x = x.transpose(1, 2)     # (B, 196, 768) ← Transformer input
        return x

class ViT(nn.Module):
    def __init__(self, n_classes=1000, dim=768, depth=12, n_heads=12):
        super().__init__()
        self.patch_embed = PatchEmbedding()
        self.cls_token = nn.Parameter(torch.zeros(1, 1, dim))   # borrowed from BERT's [CLS]
        self.pos_embed = nn.Parameter(torch.zeros(1, 197, dim))  # 196 patches + 1 CLS

        encoder_layer = nn.TransformerEncoderLayer(
            d_model=dim, nhead=n_heads, dim_feedforward=4*dim, batch_first=True
        )
        self.transformer = nn.TransformerEncoder(encoder_layer, num_layers=depth)
        self.head = nn.Linear(dim, n_classes)

    def forward(self, x):
        x = self.patch_embed(x)                            # (B, 196, 768)
        cls = self.cls_token.expand(x.size(0), -1, -1)
        x = torch.cat([cls, x], dim=1)                     # (B, 197, 768)
        x = x + self.pos_embed
        x = self.transformer(x)
        return self.head(x[:, 0])                          # use CLS token for classification

The whole ViT is this simple—20-30 lines of code.

Stop 3: The CLIP Revolution

In 2021, OpenAI published an even more “shocking” paper: CLIP (Contrastive Language-Image Pre-training).

Core Idea

Not classification—instead bring “an image” and “its text description” close together.

                  Shared vector space
                        ↓
Image ──→ ViT ─────→ Image vector ←───── Same concept
                                          ↑
Text  ──→ Transformer ─→ Text vector ←─── Same concept

Training Data

400 million “image + description” pairs—scraped from the internet.

No manual labeling needed—alt text and captions are the labels.

Training Objective: Contrastive Learning

A batch contains N (image, text) pairs:

Image 1: An orange cat on a windowsill
Image 2: Three puppies playing
Image 3: Sunset at the beach
...

Text 1: "An orange cat on a window sill"
Text 2: "Three puppies playing"
Text 3: "Sunset at the beach"
...

Training objective:

(Image 1, Text 1) vectors close
(Image 1, Text 2/3/…) vectors far
Same for (Image 2, Text 2), etc.

Mathematically: maximize the diagonal, minimize off-diagonal of the similarity matrix.

What Can It Do (Zero-Shot!)

After training, CLIP can do any classification task without further training:

classes = ["a photo of a cat", "a photo of a dog", "a photo of a car"]
text_features = [clip.encode_text(c) for c in classes]

image_features = clip.encode_image(my_image)
similarities = [cos_sim(image_features, t) for t in text_features]
predicted = classes[argmax(similarities)]

Use any text description as a class—zero-shot classification.

This is the second paradigm revolution in computer vision after ImageNet.

Stop 4: CLIP’s Downstream Applications

CLIP’s encoders are extremely useful:

1. Text-to-image / Image-to-text Search

"a golden retriever at sunset"  →  CLIP text encoder  →  vector
all images  →  CLIP image encoder  →  vector database
→ cosine similarity → best matching image

Pinterest-style search, automatic album organization—all based on this.

2. Stable Diffusion’s “Ears”

How does Stable Diffusion “understand” your prompt? It uses CLIP’s text encoder to turn the prompt into a vector, fed to the diffusion model as “conditioning”.

Without CLIP, text-to-image wouldn’t work.

3. Multimodal LLMs

GPT-4V / Claude 3.5 and other multimodal LLM vision encoders—all derived from ViT or CLIP.

4. Zero-shot Detection and Segmentation

Extending CLIP’s ability to “find every cat in this image”— GroundingDINO, SAM and other advanced vision models build on this idea.

Stop 5: Limitations of ViT and CLIP

ViT’s Problems

Data hungry: worse than CNN on small datasets
Compute expensive: O(N²) attention explodes for high-res images
No inductive bias: loses CNN’s natural “translation invariance”

CLIP’s Problems

Training data bias: web data has all kinds of biases (gender, race, culture)
Poor fine-grained: can distinguish cat from dog, not two cat breeds
Depends on caption quality: alt text is often low quality
English-centric: weak on non-English

Stop 6: After ViT and CLIP

Swin Transformer (2021)

Adds CNN’s “hierarchical” idea back to ViT—progressive downsampling. Solves ViT’s slowness at high resolution.

MAE (Masked Autoencoders, 2021)

Kaiming He’s work. Like BERT, mask patches and let ViT learn self-supervised. No need for CLIP-style image-text pairs—cheaper.

SAM (Segment Anything, 2023)

Meta’s work. Universal image segmentation—use prompts (points, boxes, text) to segment any object. Foundation architecture: ViT.

DINO / DINOv2

Self-supervised ViT—no labels needed at all. DINOv2 is the current SOTA in visual representation learning.

Stop 7: Today’s Vision AI Stack

Foundation: ViT / CLIP (encoder)
   ↓
Multimodal LLM: GPT-4V / Claude 3.5 / Gemini
   ↓
Specific Tasks:
- Detection: GroundingDINO
- Segmentation: SAM, Mask2Former
- Generation: Stable Diffusion, Imagen
- Editing: InstructPix2Pix
- Video: Sora

The entire stack is built on ViT/CLIP.

These two papers (2020-2021) laid the foundation of modern vision AI—as impactful as Transformer was in NLP.

Companion Code

Try OpenCLIP:

import open_clip
import torch
from PIL import Image

model, _, preprocess = open_clip.create_model_and_transforms('ViT-L-14', pretrained='laion2b_s32b_b82k')
tokenizer = open_clip.get_tokenizer('ViT-L-14')

image = preprocess(Image.open("photo.jpg")).unsqueeze(0)
text = tokenizer(["a photo of a cat", "a photo of a dog", "a sunset"])

with torch.no_grad():
    image_features = model.encode_image(image)
    text_features = model.encode_text(text)
    similarity = (image_features @ text_features.T).softmax(dim=-1)

print(similarity)  # [0.05, 0.92, 0.03] - it's a dog

A few lines of code for “image classification with any class”—unimaginable 3 years ago.

💡 An observation

Transformers have devoured every modality:

Text ✓
Image ✓ (ViT)
Speech ✓ (Whisper / SeamlessM4T)
Video ✓ (Video Transformer, Sora)
Protein ✓ (AlphaFold 2)
Decisions ✓ (Decision Transformer)
Robotics ✓ (RT-2)

“Attention is all you need”—the title became a prophecy.

Next recommended: L5-04 CLIP Deep Dive or L4-04 Building Agents.