L5 Chapter 1 🐣 🕒 13 min

Multimodal Overview: How AI "Sees, Hears, and Reads" at Once

GPT-4o recognizes your sketch, Sora generates video—how does "multimodal" AI work? This article opens the panoramic view.

HelloAI Editors

7/7/2026

L0-L4 have been about pure language AI. But the real world isn’t just text—there’s images, sound, video.

Multimodal AI is letting one model handle multiple information types. GPT-4o, Claude 3.5, Gemini, Sora are all representatives.

L5 path covers these. This opener builds the complete map of multimodal AI.

What Is a “Modality”

Modality = a form of information. Common ones:

Modality	Data form
Text	A string of characters
Image	Pixel matrix (H × W × 3)
Video	Frame sequence (T × H × W × 3)
Audio	Waveform (time series)
3D	Point clouds, meshes
Sensor	LiDAR, IMU

Multimodal AI = handles multiple modalities simultaneously. Examples:

Image + Text → image captioning, visual Q&A
Text → Image → DALL·E, Midjourney
Text → Video → Sora
Audio → Text → Whisper
Text → Audio → TTS
Image + Audio + Text → AI video editing

Core Idea: All Modalities Become Vectors

Humans don’t think “am I using my language module or vision module”—the brain processes uniformly. AI learns this trick: map all modalities to vectors in the same space, then use the same Transformer.

Word "cat"    →  encoder  →  [0.3, 0.8, -0.2, ...]  ← 768 dim
Image 🐱      →  encoder  →  [0.31, 0.79, -0.18, ...] ← 768 dim (similar!)
Cat meow      →  encoder  →  [0.29, 0.82, -0.21, ...] ← 768 dim (similar!)

This is the essence of “multimodal”: align the “semantics” of different modalities in the same vector space.

2021 · CLIP’s Breakthrough

CLIP (Contrastive Language-Image Pre-training, OpenAI 2021) is the foundational work of multimodal.

Core Idea

Train two encoders simultaneously:

Image encoder (ViT) turns images into vectors
Text encoder (Transformer) turns text into vectors

Training goal: make “matching image-text pairs” close in vector space, non-matching pairs far.

Training data: 400M "image + description" pairs

In a batch of 1000 pairs:
(Image1, "An orange cat")
(Image2, "Beach sunset")
(Image3, "Code screenshot")
...

Make (Image1's vector) ↔ ("An orange cat"'s vector) cosine similarity → 1
Make (Image1's vector) ↔ ("Beach sunset"'s vector) cosine similarity → 0

This is contrastive learning—positives close, negatives far.

What CLIP Can Do After Training

1. Zero-shot Image Classification

No further training needed—any classes work:

labels = ["a cat", "a dog", "a car", "a house"]
text_vecs = [encode_text(l) for l in labels]
image_vec = encode_image(my_image)
# See which text vector is closest to image vector
similarities = [cosine(image_vec, t) for t in text_vecs]
predicted = labels[argmax(similarities)]

This is the post-ImageNet paradigm revolution for image classification—no more “label thousands of images per class.”

2. Text-to-Image / Image-to-Text Search

Store all images as vectors—query via text (or vice versa). Pinterest-style search uses this.

3. Foundation for Downstream Tasks

CLIP encoder outputs vectors that can feed other models—e.g., Stable Diffusion uses CLIP to encode prompts.

CLIP’s influence: almost all multimodal systems today are based on CLIP or its variants.

2022 · DALL·E 2 / Stable Diffusion

Text-to-image breakthroughs.

Stable Diffusion’s Core Components

Text prompt
   ↓
CLIP Text Encoder ──→ text vector
                       ↓
Random noise image ──→ U-Net (50 denoising steps) ←─ uses text vector as condition
                       ↓
                  Clean image

L5-02 covers diffusion math in detail. Intuition:

Take an image and add noise until completely random
Train a “denoiser”—given a noisy image and text prompt, predict the original
At generation, reverse: start from random noise, gradually denoise to clean image

Companion visualization: Diffusion Denoising — watch 50 steps from noise to image.

2023 · GPT-4V and “Native Multimodal” LLMs

CLIP is “two separate encoders”—text and image only align at the end.

GPT-4V, Claude 3, Gemini go further—native multimodal:

Image → ViT → a sequence of image tokens
Text → Tokenizer → a sequence of text tokens
       ↓
Merged into the same Transformer
       ↓
Outputs text

The entire model doesn’t separate “vision” and “language”—it treats images as a special kind of “token sequence,” mixed with text tokens.

This lets the model:

Answer questions about images
Fix bugs from code screenshots
Read handwriting
Understand charts and tables
Watch video and answer (GPT-4o real-time vision)

Training Data: Image-Text Pairs

Need massive “image + text description” pairs:

Web image alt text
Textbook figures + captions
Screenshots + annotations
Comics + dialogue

Data quality >>> quantity. GPT-4V estimated to use tens of millions image-text pairs (vs CLIP’s 400M), but higher quality.

2024-2026 · Video Generation (Sora-class)

Extend Diffusion to time dimension:

Single image: (H, W, 3)
Video: (T, H, W, 3)

Sora’s approach: split video into “spatiotemporal patches”, process with Transformer:

Video → split into 4D patches → token sequence → Transformer → denoise

This lets the model:

Maintain temporal coherence (a cat doesn’t suddenly become a dog)
Understand physics (ball thrown will fall, water flows)
Respond to text commands (“a white cat looking at rain from a windowsill”)

Current SOTA: Sora, Runway Gen-3, Kling, Veo—quality improving monthly. Within 6 seconds, complex scenes look very real.

2025-2026 · Audio Multimodal

GPT-4o introduces “native audio”— not “ASR → LLM → TTS” pipeline, but audio tokens fed directly to LLM:

Audio waveform → split into audio tokens → LLM → directly outputs audio tokens → waveform

Effect:

Latency 320ms (vs 5 seconds before)
Preserves emotion (laughter, gasps)
Natural interruption (like real phone calls)

ChatGPT Voice Mode, Claude 4 Voice both based on this.

5 Major Multimodal Challenges

1. Data Alignment

Finding massive “same concept in different modalities”—say a same cat’s image + description + meow—is hard.

2. Compute Cost

Each image patch is a token—a 1024×1024 image = 4096 tokens. Video tens of times more. Transformer’s O(n²) complexity makes multimodal cost explode.

3. Safety

Text alignment is well-studied, but image, video, audio alignment much harder:

How to avoid NSFW images?
Defense against deepfakes?
Voice cloning scams?

4. Evaluation

LLM eval is already hard. Multimodal eval harder:

How to objectively judge “AI describes image correctly”?
How to evaluate “generated video quality”?

5. Long Tail

Text: you can find data on almost any topic. Video: long-tail scenarios (special angles, rare objects) have sparse data.

L5 Path Overview

L5 series will go deep:

#	Topic
L5-02	Diffusion math (with visualization companion)
L5-03	ViT and image Transformer
L5-04	Whisper / speech recognition
L5-05	TTS: from concatenation to neural synthesis
L5-06	Text-to-video: Sora path
L5-07	3D generation
L5-08	AI for Science (protein structure, chemistry)

Future Direction

2026 trend: “modality fusion” becoming “unified models.”

Not “image model + language model” combination, but one model handling all modalities natively.

GPT-4o already does “native multimodal”—next might be “native seven-modality” (text, image, video, audio, 3D, sensor, robotic action).

💡 The future of multimodal

2026 trend: “modality fusion” is becoming “unified models.”

Not “image model + language model” combinations, but one model handling all modalities natively.

GPT-4o already does “native multimodal”—next might be “native seven-modality” (text, image, video, audio, 3D, sensor, robotic action).

The AI of 5 years from now probably won’t separate modalities—only “unified intelligence”.

Next: “Diffusion Math: From Adding Noise to Generation” — truly understand the core of Stable Diffusion / Sora.