L7 Chapter 1 🐣 🕒 11 min

GPU Crash Course: Why AI Can't Live Without It

What is A100, H100, B200? Why does AI use GPUs not CPUs? Open the hardware black box.

Alai

7/14/2026

L0-L6 covered “how AI works”. L7 shifts perspective—what hardware AI actually runs on in the physical world, how to deploy it, how to control costs.

First stop: GPU—the physical foundation of the AI revolution.

Stop 1: CPU vs GPU

Your laptop has a CPU—why doesn’t AI use it?

CPU: A Few “All-Rounders”

Typical CPU (e.g., M3 Pro):

10-14 cores
Each core is extremely capable (branch prediction, cache, complex instructions)
Suited for serial, diverse tasks (OS, browsers, game AI)

4 cores doing 4 different things simultaneously:
Core 1: Rendering webpage
Core 2: Playing music
Core 3: Background Slack sync
Core 4: System processes

GPU: Massive “Specialists”

Typical GPU (e.g., H100):

18,432 CUDA cores
Each core is simple (basically just parallel arithmetic)
Suited for massively parallel, repetitive tasks

18432 cores doing different parts of the same task:
Cores 1-512: Compute matrix rows 1-512
Cores 513-1024: Compute rows 513-1024
... all simultaneously

Why Neural Networks Suit GPUs

Neural network computation is almost entirely large matrix operations—

y = W x + b    ← W might be 1000×1000 = 1M multiplications

1M multiplications:

CPU: 14 cores serial—slow
GPU: tens of thousands of cores in parallel—hundreds of times faster

A data point: training GPT-3 on CPU would take an estimated 350 years. On a GPU cluster: 34 days.

Stop 2: NVIDIA’s Dominance

Why does AI = NVIDIA?

CUDA Ecosystem

NVIDIA released CUDA in 2007—a GPU programming interface designed specifically for general-purpose computing.

By 2026, almost every AI framework (PyTorch, TensorFlow, JAX, CUDA-X) is built on CUDA.

AMD and Intel have GPUs too, but their ecosystems are far less mature than CUDA—this is NVIDIA’s moat.

NVIDIA GPU Generations

Model	Year	VRAM	TFlops (FP16)	Position
V100	2017	32GB	125	First-gen deep learning card
A100	2020	40/80GB	312	Used to train GPT-3
H100	2022	80GB	989	Used to train GPT-4
H200	2024	141GB	989	H100 upgrade (more memory)
B100/B200	2025	192GB	2000+	Current king
GB200 NVL72	2025	cluster	tens of thousands	Full-rack solution

Each generation brings 2-3× performance.

Consumer vs Datacenter

Category	Example	VRAM	Price
Consumer	RTX 4090	24GB	$2,000
Consumer	RTX 5090	32GB	$3,000
Datacenter	A100 80GB	80GB	~$15,000
Datacenter	H100	80GB	~$30,000
Datacenter	B200	192GB	~$60,000

Why are datacenter cards so expensive?

Huge VRAM (required for big models)
NVLink high-speed interconnect (fast multi-card communication)
Long-term stability (24/7 training for months)
Strict QA and warranty

An 8-card H100 server is ~$300K—this is the physical source of “training big models is expensive”.

Stop 3: GPU Internal Components

Some terminology to learn:

CUDA Cores

The most basic processing unit—does arithmetic.

Tensor Cores

Specifically accelerate matrix operations—introduced in 2017 with Volta architecture. Almost all neural network compute uses them. On the same H100, Tensor Cores are 8× faster than CUDA Cores.

VRAM

Stores data and model parameters. Deep learning bottleneck is usually VRAM, not compute:

Training a 7B model needs ~80GB VRAM (params + optimizer state + activations). A single A100 80GB barely fits. H100 + model parallelism can fit a 70B model.

NVLink / NVSwitch

High-speed communication between cards—10× faster than PCIe. In multi-card training, data sync between cards happens here; NVLink determines the ceiling.

Stop 4: Memory Hierarchy (Most Important Engineering Concept)

GPU has different speed memories:

Speed
↑
Registers              few KB/SM     ultra fast    ←
Shared memory          96 KB/SM      fast          ←
L2 cache               40-50 MB      somewhat fast
VRAM (HBM/GDDR)        80-192 GB     slow
PCIe to CPU memory     system RAM    ultra slow

↓ lower = larger and slower

A counter-intuitive fact: GPU compute grows much faster than memory speed. Result: today’s GPUs often “compute too fast, waiting on memory”—this is called memory-bound.

The Essence of Flash Attention

In L3-05 we learned the Attention formula:

\text{softmax}(QK^T / \sqrt{d}) V

Traditional implementation:

Compute $QK^T$ , store to VRAM ( $n^2$ large)
Compute softmax
Multiply by V

VRAM reads/writes scale as $n^2$ —super slow.

Flash Attention (2022) trick: keep intermediate results in the fast “shared memory”, don’t write to slow VRAM— Result: same attention computation, 2-5× faster.

This is why “long context” suddenly became feasible in 2023-2024—Flash Attention broke the memory bottleneck.

Stop 5: How to Choose a GPU

By scenario:

Learning

Google Colab free T4: enough for demos
RTX 3090 / 4090: local dev, can train LoRA on 7B models

Individual / Small Team

2-4× 4090s: home-built lab
Rent GPU cloud (Vast.ai, Lambda Labs): pay-per-use, save capex

Mid-Size Company AI Project

8-card A100/H100 server: build one + a few cloud
AWS/GCP/Azure cloud: dynamic scaling

Large LLM Training

1000+ H100 cluster: typically built with NVIDIA, Supermicro, etc.
Dedicated cloud: OpenAI-Microsoft, Anthropic-AWS have dedicated compute commitments

Stop 6: Training vs Inference

GPUs serve two scenarios:

Training

Needs: massive compute + large VRAM + fast interconnect
Duration: days to months
Priority: throughput (samples per second)
Suited: H100, A100, B200 “training cards”

Inference

Needs: low latency + high throughput + low cost
Duration: milliseconds-seconds per query
Priority: response time + cost per query
Suited: A10, L40S, RTX 4090, even smaller cards

Key difference:

Training: 1 ultra-expensive card × 8 hours
Inference: 100 cheap cards × 24h daily

Cost structures are completely different.

Stop 7: Alternatives

NVIDIA isn’t the only option:

AMD GPU (MI300X)

192GB VRAM (larger than H100)
Performance close to H100
20% cheaper
Downside: CUDA ecosystem incompatible—lots of code to rewrite

AMD has invested heavily in ROCm recently, but compatibility still lags. Meta and other big companies have started mixing AMD and NVIDIA.

Google TPU

Google’s in-house chip, optimized for TensorFlow / JAX—

Google Cloud exclusive
Gemini and similar trained on TPUs
Single TPU underperforms H100, but cluster efficiency is high

Domestic Alternatives (China)

2024-2026, due to export controls:

Huawei Ascend series: 910B / 910C, domestic leader
Suiyuan, Moore Threads, Cambricon, etc.
AMD H20 (export version): below H100 but usable

Chinese big companies are accelerating adaptation—domestic substitution over the next 5 years is a certain direction.

Stop 8: Cost Estimates

A simplified model:

Training LLMs

Model	Estimated cost
7B pretraining	$50K -$ 100K
70B pretraining	$1M -$ 10M
GPT-4 scale (>1T)	$50M -$ 100M+

LLM Inference

Model	Cost per 1M tokens
GPT-4o API	$5 input /$ 15 output
Claude 3.5 API	$3 input /$ 15 output
Self-hosted (70B)	~ $1 input /$ 3 output
Self-hosted (7B)	~ $0.1 input /$ 0.3 output

Self-hosted inference is usually 10× cheaper—but needs an engineering team.

💡 An engineering law

VRAM is more expensive than compute. GPU design’s biggest cost driver is high-bandwidth HBM memory—this is why an 80GB card is 2× the price of a 40GB card. For learning LLM systems optimization, always focus on VRAM first—every trick that fits a 70B model on one card (quantization, CPU offloading, PagedAttention) is drilling into this point.

Next: “Distributed Training: DP / DDP / FSDP / Tensor Parallel” — when one card doesn’t fit, how to train with multiple cards.