GPU Crash Course: Why AI Can't Live Without It
What is A100, H100, B200? Why does AI use GPUs not CPUs? Open the hardware black box.
L0-L6 covered “how AI works”. L7 shifts perspective—what hardware AI actually runs on in the physical world, how to deploy it, how to control costs.
First stop: GPU—the physical foundation of the AI revolution.
Stop 1: CPU vs GPU
Your laptop has a CPU—why doesn’t AI use it?
CPU: A Few “All-Rounders”
Typical CPU (e.g., M3 Pro):
- 10-14 cores
- Each core is extremely capable (branch prediction, cache, complex instructions)
- Suited for serial, diverse tasks (OS, browsers, game AI)
4 cores doing 4 different things simultaneously:
Core 1: Rendering webpage
Core 2: Playing music
Core 3: Background Slack sync
Core 4: System processes
GPU: Massive “Specialists”
Typical GPU (e.g., H100):
- 18,432 CUDA cores
- Each core is simple (basically just parallel arithmetic)
- Suited for massively parallel, repetitive tasks
18432 cores doing different parts of the same task:
Cores 1-512: Compute matrix rows 1-512
Cores 513-1024: Compute rows 513-1024
... all simultaneously
Why Neural Networks Suit GPUs
Neural network computation is almost entirely large matrix operations—
y = W x + b ← W might be 1000×1000 = 1M multiplications
1M multiplications:
- CPU: 14 cores serial—slow
- GPU: tens of thousands of cores in parallel—hundreds of times faster
A data point: training GPT-3 on CPU would take an estimated 350 years. On a GPU cluster: 34 days.
Stop 2: NVIDIA’s Dominance
Why does AI = NVIDIA?
CUDA Ecosystem
NVIDIA released CUDA in 2007—a GPU programming interface designed specifically for general-purpose computing.
By 2026, almost every AI framework (PyTorch, TensorFlow, JAX, CUDA-X) is built on CUDA.
AMD and Intel have GPUs too, but their ecosystems are far less mature than CUDA—this is NVIDIA’s moat.
NVIDIA GPU Generations
| Model | Year | VRAM | TFlops (FP16) | Position |
|---|---|---|---|---|
| V100 | 2017 | 32GB | 125 | First-gen deep learning card |
| A100 | 2020 | 40/80GB | 312 | Used to train GPT-3 |
| H100 | 2022 | 80GB | 989 | Used to train GPT-4 |
| H200 | 2024 | 141GB | 989 | H100 upgrade (more memory) |
| B100/B200 | 2025 | 192GB | 2000+ | Current king |
| GB200 NVL72 | 2025 | cluster | tens of thousands | Full-rack solution |
Each generation brings 2-3× performance.
Consumer vs Datacenter
| Category | Example | VRAM | Price |
|---|---|---|---|
| Consumer | RTX 4090 | 24GB | $2,000 |
| Consumer | RTX 5090 | 32GB | $3,000 |
| Datacenter | A100 80GB | 80GB | ~$15,000 |
| Datacenter | H100 | 80GB | ~$30,000 |
| Datacenter | B200 | 192GB | ~$60,000 |
Why are datacenter cards so expensive?
- Huge VRAM (required for big models)
- NVLink high-speed interconnect (fast multi-card communication)
- Long-term stability (24/7 training for months)
- Strict QA and warranty
An 8-card H100 server is ~$300K—this is the physical source of “training big models is expensive”.
Stop 3: GPU Internal Components
Some terminology to learn:
CUDA Cores
The most basic processing unit—does arithmetic.
Tensor Cores
Specifically accelerate matrix operations—introduced in 2017 with Volta architecture. Almost all neural network compute uses them. On the same H100, Tensor Cores are 8× faster than CUDA Cores.
VRAM
Stores data and model parameters. Deep learning bottleneck is usually VRAM, not compute:
Training a 7B model needs ~80GB VRAM (params + optimizer state + activations). A single A100 80GB barely fits. H100 + model parallelism can fit a 70B model.
NVLink / NVSwitch
High-speed communication between cards—10× faster than PCIe. In multi-card training, data sync between cards happens here; NVLink determines the ceiling.
Stop 4: Memory Hierarchy (Most Important Engineering Concept)
GPU has different speed memories:
Speed
↑
Registers few KB/SM ultra fast ←
Shared memory 96 KB/SM fast ←
L2 cache 40-50 MB somewhat fast
VRAM (HBM/GDDR) 80-192 GB slow
PCIe to CPU memory system RAM ultra slow
↓ lower = larger and slower
A counter-intuitive fact: GPU compute grows much faster than memory speed. Result: today’s GPUs often “compute too fast, waiting on memory”—this is called memory-bound.
The Essence of Flash Attention
In L3-05 we learned the Attention formula:
Traditional implementation:
- Compute , store to VRAM ( large)
- Compute softmax
- Multiply by V
VRAM reads/writes scale as —super slow.
Flash Attention (2022) trick: keep intermediate results in the fast “shared memory”, don’t write to slow VRAM— Result: same attention computation, 2-5× faster.
This is why “long context” suddenly became feasible in 2023-2024—Flash Attention broke the memory bottleneck.
Stop 5: How to Choose a GPU
By scenario:
Learning
- Google Colab free T4: enough for demos
- RTX 3090 / 4090: local dev, can train LoRA on 7B models
Individual / Small Team
- 2-4× 4090s: home-built lab
- Rent GPU cloud (Vast.ai, Lambda Labs): pay-per-use, save capex
Mid-Size Company AI Project
- 8-card A100/H100 server: build one + a few cloud
- AWS/GCP/Azure cloud: dynamic scaling
Large LLM Training
- 1000+ H100 cluster: typically built with NVIDIA, Supermicro, etc.
- Dedicated cloud: OpenAI-Microsoft, Anthropic-AWS have dedicated compute commitments
Stop 6: Training vs Inference
GPUs serve two scenarios:
Training
- Needs: massive compute + large VRAM + fast interconnect
- Duration: days to months
- Priority: throughput (samples per second)
- Suited: H100, A100, B200 “training cards”
Inference
- Needs: low latency + high throughput + low cost
- Duration: milliseconds-seconds per query
- Priority: response time + cost per query
- Suited: A10, L40S, RTX 4090, even smaller cards
Key difference:
- Training: 1 ultra-expensive card × 8 hours
- Inference: 100 cheap cards × 24h daily
Cost structures are completely different.
Stop 7: Alternatives
NVIDIA isn’t the only option:
AMD GPU (MI300X)
- 192GB VRAM (larger than H100)
- Performance close to H100
- 20% cheaper
- Downside: CUDA ecosystem incompatible—lots of code to rewrite
AMD has invested heavily in ROCm recently, but compatibility still lags. Meta and other big companies have started mixing AMD and NVIDIA.
Google TPU
Google’s in-house chip, optimized for TensorFlow / JAX—
- Google Cloud exclusive
- Gemini and similar trained on TPUs
- Single TPU underperforms H100, but cluster efficiency is high
Domestic Alternatives (China)
2024-2026, due to export controls:
- Huawei Ascend series: 910B / 910C, domestic leader
- Suiyuan, Moore Threads, Cambricon, etc.
- AMD H20 (export version): below H100 but usable
Chinese big companies are accelerating adaptation—domestic substitution over the next 5 years is a certain direction.
Stop 8: Cost Estimates
A simplified model:
Training LLMs
| Model | Estimated cost |
|---|---|
| 7B pretraining | 100K |
| 70B pretraining | 10M |
| GPT-4 scale (>1T) | 100M+ |
LLM Inference
| Model | Cost per 1M tokens |
|---|---|
| GPT-4o API | 15 output |
| Claude 3.5 API | 15 output |
| Self-hosted (70B) | ~3 output |
| Self-hosted (7B) | ~0.3 output |
Self-hosted inference is usually 10× cheaper—but needs an engineering team.
VRAM is more expensive than compute. GPU design’s biggest cost driver is high-bandwidth HBM memory—this is why an 80GB card is 2× the price of a 40GB card. For learning LLM systems optimization, always focus on VRAM first—every trick that fits a 70B model on one card (quantization, CPU offloading, PagedAttention) is drilling into this point.
Next: “Distributed Training: DP / DDP / FSDP / Tensor Parallel” — when one card doesn’t fit, how to train with multiple cards.