Multimodal Overview: How AI "Sees, Hears, and Reads" at Once
GPT-4o recognizes your sketch, Sora generates video—how does "multimodal" AI work? This article opens the panoramic view.
L0-L4 have been about pure language AI. But the real world isn’t just text—there’s images, sound, video.
Multimodal AI is letting one model handle multiple information types. GPT-4o, Claude 3.5, Gemini, Sora are all representatives.
L5 path covers these. This opener builds the complete map of multimodal AI.
What Is a “Modality”
Modality = a form of information. Common ones:
| Modality | Data form |
|---|---|
| Text | A string of characters |
| Image | Pixel matrix (H × W × 3) |
| Video | Frame sequence (T × H × W × 3) |
| Audio | Waveform (time series) |
| 3D | Point clouds, meshes |
| Sensor | LiDAR, IMU |
Multimodal AI = handles multiple modalities simultaneously. Examples:
- Image + Text → image captioning, visual Q&A
- Text → Image → DALL·E, Midjourney
- Text → Video → Sora
- Audio → Text → Whisper
- Text → Audio → TTS
- Image + Audio + Text → AI video editing
Core Idea: All Modalities Become Vectors
Humans don’t think “am I using my language module or vision module”—the brain processes uniformly. AI learns this trick: map all modalities to vectors in the same space, then use the same Transformer.
Word "cat" → encoder → [0.3, 0.8, -0.2, ...] ← 768 dim
Image 🐱 → encoder → [0.31, 0.79, -0.18, ...] ← 768 dim (similar!)
Cat meow → encoder → [0.29, 0.82, -0.21, ...] ← 768 dim (similar!)
This is the essence of “multimodal”: align the “semantics” of different modalities in the same vector space.
2021 · CLIP’s Breakthrough
CLIP (Contrastive Language-Image Pre-training, OpenAI 2021) is the foundational work of multimodal.
Core Idea
Train two encoders simultaneously:
- Image encoder (ViT) turns images into vectors
- Text encoder (Transformer) turns text into vectors
Training goal: make “matching image-text pairs” close in vector space, non-matching pairs far.
Training data: 400M "image + description" pairs
In a batch of 1000 pairs:
(Image1, "An orange cat")
(Image2, "Beach sunset")
(Image3, "Code screenshot")
...
Make (Image1's vector) ↔ ("An orange cat"'s vector) cosine similarity → 1
Make (Image1's vector) ↔ ("Beach sunset"'s vector) cosine similarity → 0
This is contrastive learning—positives close, negatives far.
What CLIP Can Do After Training
1. Zero-shot Image Classification
No further training needed—any classes work:
labels = ["a cat", "a dog", "a car", "a house"]
text_vecs = [encode_text(l) for l in labels]
image_vec = encode_image(my_image)
# See which text vector is closest to image vector
similarities = [cosine(image_vec, t) for t in text_vecs]
predicted = labels[argmax(similarities)]
This is the post-ImageNet paradigm revolution for image classification—no more “label thousands of images per class.”
2. Text-to-Image / Image-to-Text Search
Store all images as vectors—query via text (or vice versa). Pinterest-style search uses this.
3. Foundation for Downstream Tasks
CLIP encoder outputs vectors that can feed other models—e.g., Stable Diffusion uses CLIP to encode prompts.
CLIP’s influence: almost all multimodal systems today are based on CLIP or its variants.
2022 · DALL·E 2 / Stable Diffusion
Text-to-image breakthroughs.
Stable Diffusion’s Core Components
Text prompt
↓
CLIP Text Encoder ──→ text vector
↓
Random noise image ──→ U-Net (50 denoising steps) ←─ uses text vector as condition
↓
Clean image
L5-02 covers diffusion math in detail. Intuition:
- Take an image and add noise until completely random
- Train a “denoiser”—given a noisy image and text prompt, predict the original
- At generation, reverse: start from random noise, gradually denoise to clean image
Companion visualization: Diffusion Denoising — watch 50 steps from noise to image.
2023 · GPT-4V and “Native Multimodal” LLMs
CLIP is “two separate encoders”—text and image only align at the end.
GPT-4V, Claude 3, Gemini go further—native multimodal:
Image → ViT → a sequence of image tokens
Text → Tokenizer → a sequence of text tokens
↓
Merged into the same Transformer
↓
Outputs text
The entire model doesn’t separate “vision” and “language”—it treats images as a special kind of “token sequence,” mixed with text tokens.
This lets the model:
- Answer questions about images
- Fix bugs from code screenshots
- Read handwriting
- Understand charts and tables
- Watch video and answer (GPT-4o real-time vision)
Training Data: Image-Text Pairs
Need massive “image + text description” pairs:
- Web image alt text
- Textbook figures + captions
- Screenshots + annotations
- Comics + dialogue
Data quality >>> quantity. GPT-4V estimated to use tens of millions image-text pairs (vs CLIP’s 400M), but higher quality.
2024-2026 · Video Generation (Sora-class)
Extend Diffusion to time dimension:
- Single image: (H, W, 3)
- Video: (T, H, W, 3)
Sora’s approach: split video into “spatiotemporal patches”, process with Transformer:
Video → split into 4D patches → token sequence → Transformer → denoise
This lets the model:
- Maintain temporal coherence (a cat doesn’t suddenly become a dog)
- Understand physics (ball thrown will fall, water flows)
- Respond to text commands (“a white cat looking at rain from a windowsill”)
Current SOTA: Sora, Runway Gen-3, Kling, Veo—quality improving monthly. Within 6 seconds, complex scenes look very real.
2025-2026 · Audio Multimodal
GPT-4o introduces “native audio”— not “ASR → LLM → TTS” pipeline, but audio tokens fed directly to LLM:
Audio waveform → split into audio tokens → LLM → directly outputs audio tokens → waveform
Effect:
- Latency 320ms (vs 5 seconds before)
- Preserves emotion (laughter, gasps)
- Natural interruption (like real phone calls)
ChatGPT Voice Mode, Claude 4 Voice both based on this.
5 Major Multimodal Challenges
1. Data Alignment
Finding massive “same concept in different modalities”—say a same cat’s image + description + meow—is hard.
2. Compute Cost
Each image patch is a token—a 1024×1024 image = 4096 tokens. Video tens of times more. Transformer’s O(n²) complexity makes multimodal cost explode.
3. Safety
Text alignment is well-studied, but image, video, audio alignment much harder:
- How to avoid NSFW images?
- Defense against deepfakes?
- Voice cloning scams?
4. Evaluation
LLM eval is already hard. Multimodal eval harder:
- How to objectively judge “AI describes image correctly”?
- How to evaluate “generated video quality”?
5. Long Tail
Text: you can find data on almost any topic. Video: long-tail scenarios (special angles, rare objects) have sparse data.
L5 Path Overview
L5 series will go deep:
| # | Topic |
|---|---|
| L5-02 | Diffusion math (with visualization companion) |
| L5-03 | ViT and image Transformer |
| L5-04 | Whisper / speech recognition |
| L5-05 | TTS: from concatenation to neural synthesis |
| L5-06 | Text-to-video: Sora path |
| L5-07 | 3D generation |
| L5-08 | AI for Science (protein structure, chemistry) |
Future Direction
2026 trend: “modality fusion” becoming “unified models.”
Not “image model + language model” combination, but one model handling all modalities natively.
GPT-4o already does “native multimodal”—next might be “native seven-modality” (text, image, video, audio, 3D, sensor, robotic action).
2026 trend: “modality fusion” is becoming “unified models.”
Not “image model + language model” combinations, but one model handling all modalities natively.
GPT-4o already does “native multimodal”—next might be “native seven-modality” (text, image, video, audio, 3D, sensor, robotic action).
The AI of 5 years from now probably won’t separate modalities—only “unified intelligence”.
Next: “Diffusion Math: From Adding Noise to Generation” — truly understand the core of Stable Diffusion / Sora.