ViT and CLIP: Teaching Transformers to See
Slice images into patches, feed to Transformer—the biggest paradigm shift in computer vision in 2020.
In L3-06 we saw CNNs ruled computer vision for 8 years (2012-2020). In L3-08 we saw Transformers sweep through NLP.
A natural question in 2020: can we use Transformers on images too?
The answer: yes, and the results are stunning.
Stop 1: The Core Idea of ViT
Transformers process sequences—but images are 2D grids. How do we bridge this?
ViT (Vision Transformer, 2020) used a remarkably simple trick:
Split the image into small patches, treat each patch as a token.
A 224×224 image
↓ split into 16×16 patches (196 total)
↓ flatten each patch into a 256-dim vector (16×16)
↓ project to 768-dim via a linear layer
↓ add positional encoding
↓ feed to a standard Transformer
Copying NLP’s Transformer architecture verbatim—just replacing “words” with “patches”.
Why does this work?
CNN’s inductive biases:
- Locality (neighbors matter)
- Translation invariance (position-independent)
- Hierarchy (low-level → high-level)
ViT threw all of these away—learning purely from data.
Question: without inductive biases, the model should be harder to train—why does it win?
Answer: with enough data, no inductive bias is actually better—the model learns finer patterns than what CNN designers hand-coded.
Classic Finding
| Data scale | CNN performance | ViT performance |
|---|---|---|
| ImageNet (1.3M images) | Strong | Weak |
| JFT-300M (300M images) | Strong | Stronger + surpassing |
ViT surpasses CNNs at large data scale—the core rationale for all subsequent vision Transformers.
Stop 2: ViT in PyTorch
import torch
import torch.nn as nn
class PatchEmbedding(nn.Module):
def __init__(self, img_size=224, patch_size=16, in_channels=3, dim=768):
super().__init__()
self.n_patches = (img_size // patch_size) ** 2 # 196
# Use Conv2d for patch splitting + projection (two birds, one stone)
self.proj = nn.Conv2d(in_channels, dim, kernel_size=patch_size, stride=patch_size)
def forward(self, x):
# x: (B, 3, 224, 224)
x = self.proj(x) # (B, 768, 14, 14)
x = x.flatten(2) # (B, 768, 196)
x = x.transpose(1, 2) # (B, 196, 768) ← Transformer input
return x
class ViT(nn.Module):
def __init__(self, n_classes=1000, dim=768, depth=12, n_heads=12):
super().__init__()
self.patch_embed = PatchEmbedding()
self.cls_token = nn.Parameter(torch.zeros(1, 1, dim)) # borrowed from BERT's [CLS]
self.pos_embed = nn.Parameter(torch.zeros(1, 197, dim)) # 196 patches + 1 CLS
encoder_layer = nn.TransformerEncoderLayer(
d_model=dim, nhead=n_heads, dim_feedforward=4*dim, batch_first=True
)
self.transformer = nn.TransformerEncoder(encoder_layer, num_layers=depth)
self.head = nn.Linear(dim, n_classes)
def forward(self, x):
x = self.patch_embed(x) # (B, 196, 768)
cls = self.cls_token.expand(x.size(0), -1, -1)
x = torch.cat([cls, x], dim=1) # (B, 197, 768)
x = x + self.pos_embed
x = self.transformer(x)
return self.head(x[:, 0]) # use CLS token for classification
The whole ViT is this simple—20-30 lines of code.
Stop 3: The CLIP Revolution
In 2021, OpenAI published an even more “shocking” paper: CLIP (Contrastive Language-Image Pre-training).
Core Idea
Not classification—instead bring “an image” and “its text description” close together.
Shared vector space
↓
Image ──→ ViT ─────→ Image vector ←───── Same concept
↑
Text ──→ Transformer ─→ Text vector ←─── Same concept
Training Data
400 million “image + description” pairs—scraped from the internet.
No manual labeling needed—alt text and captions are the labels.
Training Objective: Contrastive Learning
A batch contains N (image, text) pairs:
Image 1: An orange cat on a windowsill
Image 2: Three puppies playing
Image 3: Sunset at the beach
...
Text 1: "An orange cat on a window sill"
Text 2: "Three puppies playing"
Text 3: "Sunset at the beach"
...
Training objective:
- (Image 1, Text 1) vectors close
- (Image 1, Text 2/3/…) vectors far
- Same for (Image 2, Text 2), etc.
Mathematically: maximize the diagonal, minimize off-diagonal of the similarity matrix.
What Can It Do (Zero-Shot!)
After training, CLIP can do any classification task without further training:
classes = ["a photo of a cat", "a photo of a dog", "a photo of a car"]
text_features = [clip.encode_text(c) for c in classes]
image_features = clip.encode_image(my_image)
similarities = [cos_sim(image_features, t) for t in text_features]
predicted = classes[argmax(similarities)]
Use any text description as a class—zero-shot classification.
This is the second paradigm revolution in computer vision after ImageNet.
Stop 4: CLIP’s Downstream Applications
CLIP’s encoders are extremely useful:
1. Text-to-image / Image-to-text Search
"a golden retriever at sunset" → CLIP text encoder → vector
all images → CLIP image encoder → vector database
→ cosine similarity → best matching image
Pinterest-style search, automatic album organization—all based on this.
2. Stable Diffusion’s “Ears”
How does Stable Diffusion “understand” your prompt? It uses CLIP’s text encoder to turn the prompt into a vector, fed to the diffusion model as “conditioning”.
Without CLIP, text-to-image wouldn’t work.
3. Multimodal LLMs
GPT-4V / Claude 3.5 and other multimodal LLM vision encoders—all derived from ViT or CLIP.
4. Zero-shot Detection and Segmentation
Extending CLIP’s ability to “find every cat in this image”— GroundingDINO, SAM and other advanced vision models build on this idea.
Stop 5: Limitations of ViT and CLIP
ViT’s Problems
- Data hungry: worse than CNN on small datasets
- Compute expensive: O(N²) attention explodes for high-res images
- No inductive bias: loses CNN’s natural “translation invariance”
CLIP’s Problems
- Training data bias: web data has all kinds of biases (gender, race, culture)
- Poor fine-grained: can distinguish cat from dog, not two cat breeds
- Depends on caption quality: alt text is often low quality
- English-centric: weak on non-English
Stop 6: After ViT and CLIP
Swin Transformer (2021)
Adds CNN’s “hierarchical” idea back to ViT—progressive downsampling. Solves ViT’s slowness at high resolution.
MAE (Masked Autoencoders, 2021)
Kaiming He’s work. Like BERT, mask patches and let ViT learn self-supervised. No need for CLIP-style image-text pairs—cheaper.
SAM (Segment Anything, 2023)
Meta’s work. Universal image segmentation—use prompts (points, boxes, text) to segment any object. Foundation architecture: ViT.
DINO / DINOv2
Self-supervised ViT—no labels needed at all. DINOv2 is the current SOTA in visual representation learning.
Stop 7: Today’s Vision AI Stack
Foundation: ViT / CLIP (encoder)
↓
Multimodal LLM: GPT-4V / Claude 3.5 / Gemini
↓
Specific Tasks:
- Detection: GroundingDINO
- Segmentation: SAM, Mask2Former
- Generation: Stable Diffusion, Imagen
- Editing: InstructPix2Pix
- Video: Sora
The entire stack is built on ViT/CLIP.
These two papers (2020-2021) laid the foundation of modern vision AI—as impactful as Transformer was in NLP.
Companion Code
Try OpenCLIP:
import open_clip
import torch
from PIL import Image
model, _, preprocess = open_clip.create_model_and_transforms('ViT-L-14', pretrained='laion2b_s32b_b82k')
tokenizer = open_clip.get_tokenizer('ViT-L-14')
image = preprocess(Image.open("photo.jpg")).unsqueeze(0)
text = tokenizer(["a photo of a cat", "a photo of a dog", "a sunset"])
with torch.no_grad():
image_features = model.encode_image(image)
text_features = model.encode_text(text)
similarity = (image_features @ text_features.T).softmax(dim=-1)
print(similarity) # [0.05, 0.92, 0.03] - it's a dog
A few lines of code for “image classification with any class”—unimaginable 3 years ago.
Transformers have devoured every modality:
- Text ✓
- Image ✓ (ViT)
- Speech ✓ (Whisper / SeamlessM4T)
- Video ✓ (Video Transformer, Sora)
- Protein ✓ (AlphaFold 2)
- Decisions ✓ (Decision Transformer)
- Robotics ✓ (RT-2)
“Attention is all you need”—the title became a prophecy.
Next recommended: L5-04 CLIP Deep Dive or L4-04 Building Agents.