HelloAI
L7 Chapter 1 🐣 🕒 12 min

GPU Crash Course: Why AI Can't Live Without It

What is A100, H100, B200? Why does AI use GPUs not CPUs? Open the hardware black box.

A
Alai
7/14/2026

L0-L6 covered “how AI works”. L7 shifts perspective—what hardware AI actually runs on in the physical world, how to deploy it, how to control costs.

First stop: GPU—the physical foundation of the AI revolution.

Stop 1: CPU vs GPU

Your laptop has a CPU—why doesn’t AI use it?

CPU: A Few “All-Rounders”

Typical CPU (e.g., M3 Pro):

  • 10-14 cores
  • Each core is extremely capable (branch prediction, cache, complex instructions)
  • Suited for serial, diverse tasks (OS, browsers, game AI)
4 cores doing 4 different things simultaneously:
Core 1: Rendering webpage
Core 2: Playing music
Core 3: Background Slack sync
Core 4: System processes

GPU: Massive “Specialists”

Typical GPU (e.g., H100):

  • 18,432 CUDA cores
  • Each core is simple (basically just parallel arithmetic)
  • Suited for massively parallel, repetitive tasks
18432 cores doing different parts of the same task:
Cores 1-512: Compute matrix rows 1-512
Cores 513-1024: Compute rows 513-1024
... all simultaneously

Why Neural Networks Suit GPUs

Neural network computation is almost entirely large matrix operations

y = W x + b    ← W might be 1000×1000 = 1M multiplications

1M multiplications:

  • CPU: 14 cores serial—slow
  • GPU: tens of thousands of cores in parallel—hundreds of times faster

A data point: training GPT-3 on CPU would take an estimated 350 years. On a GPU cluster: 34 days.

Stop 2: NVIDIA’s Dominance

Why does AI = NVIDIA?

CUDA Ecosystem

NVIDIA released CUDA in 2007—a GPU programming interface designed specifically for general-purpose computing.

By 2026, almost every AI framework (PyTorch, TensorFlow, JAX, CUDA-X) is built on CUDA.

AMD and Intel have GPUs too, but their ecosystems are far less mature than CUDA—this is NVIDIA’s moat.

NVIDIA GPU Generations

ModelYearVRAMTFlops (FP16)Position
V100201732GB125First-gen deep learning card
A100202040/80GB312Used to train GPT-3
H100202280GB989Used to train GPT-4
H2002024141GB989H100 upgrade (more memory)
B100/B2002025192GB2000+Current king
GB200 NVL722025clustertens of thousandsFull-rack solution

Each generation brings 2-3× performance.

Consumer vs Datacenter

CategoryExampleVRAMPrice
ConsumerRTX 409024GB$2,000
ConsumerRTX 509032GB$3,000
DatacenterA100 80GB80GB~$15,000
DatacenterH10080GB~$30,000
DatacenterB200192GB~$60,000

Why are datacenter cards so expensive?

  • Huge VRAM (required for big models)
  • NVLink high-speed interconnect (fast multi-card communication)
  • Long-term stability (24/7 training for months)
  • Strict QA and warranty

An 8-card H100 server is ~$300K—this is the physical source of “training big models is expensive”.

Stop 3: GPU Internal Components

Some terminology to learn:

CUDA Cores

The most basic processing unit—does arithmetic.

Tensor Cores

Specifically accelerate matrix operations—introduced in 2017 with Volta architecture. Almost all neural network compute uses them. On the same H100, Tensor Cores are 8× faster than CUDA Cores.

VRAM

Stores data and model parameters. Deep learning bottleneck is usually VRAM, not compute:

Training a 7B model needs ~80GB VRAM (params + optimizer state + activations). A single A100 80GB barely fits. H100 + model parallelism can fit a 70B model.

High-speed communication between cards—10× faster than PCIe. In multi-card training, data sync between cards happens here; NVLink determines the ceiling.

Stop 4: Memory Hierarchy (Most Important Engineering Concept)

GPU has different speed memories:

Speed

Registers              few KB/SM     ultra fast    ←
Shared memory          96 KB/SM      fast          ←
L2 cache               40-50 MB      somewhat fast
VRAM (HBM/GDDR)        80-192 GB     slow
PCIe to CPU memory     system RAM    ultra slow

↓ lower = larger and slower

A counter-intuitive fact: GPU compute grows much faster than memory speed. Result: today’s GPUs often “compute too fast, waiting on memory”—this is called memory-bound.

The Essence of Flash Attention

In L3-05 we learned the Attention formula:

softmax(QKT/d)V\text{softmax}(QK^T / \sqrt{d}) V

Traditional implementation:

  1. Compute QKTQK^T, store to VRAM (n2n^2 large)
  2. Compute softmax
  3. Multiply by V

VRAM reads/writes scale as n2n^2—super slow.

Flash Attention (2022) trick: keep intermediate results in the fast “shared memory”, don’t write to slow VRAM— Result: same attention computation, 2-5× faster.

This is why “long context” suddenly became feasible in 2023-2024—Flash Attention broke the memory bottleneck.

Stop 5: How to Choose a GPU

By scenario:

Learning

  • Google Colab free T4: enough for demos
  • RTX 3090 / 4090: local dev, can train LoRA on 7B models

Individual / Small Team

  • 2-4× 4090s: home-built lab
  • Rent GPU cloud (Vast.ai, Lambda Labs): pay-per-use, save capex

Mid-Size Company AI Project

  • 8-card A100/H100 server: build one + a few cloud
  • AWS/GCP/Azure cloud: dynamic scaling

Large LLM Training

  • 1000+ H100 cluster: typically built with NVIDIA, Supermicro, etc.
  • Dedicated cloud: OpenAI-Microsoft, Anthropic-AWS have dedicated compute commitments

Stop 6: Training vs Inference

GPUs serve two scenarios:

Training

  • Needs: massive compute + large VRAM + fast interconnect
  • Duration: days to months
  • Priority: throughput (samples per second)
  • Suited: H100, A100, B200 “training cards”

Inference

  • Needs: low latency + high throughput + low cost
  • Duration: milliseconds-seconds per query
  • Priority: response time + cost per query
  • Suited: A10, L40S, RTX 4090, even smaller cards

Key difference:

  • Training: 1 ultra-expensive card × 8 hours
  • Inference: 100 cheap cards × 24h daily

Cost structures are completely different.

Stop 7: Alternatives

NVIDIA isn’t the only option:

AMD GPU (MI300X)

  • 192GB VRAM (larger than H100)
  • Performance close to H100
  • 20% cheaper
  • Downside: CUDA ecosystem incompatible—lots of code to rewrite

AMD has invested heavily in ROCm recently, but compatibility still lags. Meta and other big companies have started mixing AMD and NVIDIA.

Google TPU

Google’s in-house chip, optimized for TensorFlow / JAX

  • Google Cloud exclusive
  • Gemini and similar trained on TPUs
  • Single TPU underperforms H100, but cluster efficiency is high

Domestic Alternatives (China)

2024-2026, due to export controls:

  • Huawei Ascend series: 910B / 910C, domestic leader
  • Suiyuan, Moore Threads, Cambricon, etc.
  • AMD H20 (export version): below H100 but usable

Chinese big companies are accelerating adaptation—domestic substitution over the next 5 years is a certain direction.

Stop 8: Cost Estimates

A simplified model:

Training LLMs

ModelEstimated cost
7B pretraining50K50K - 100K
70B pretraining1M1M - 10M
GPT-4 scale (>1T)50M50M - 100M+

LLM Inference

ModelCost per 1M tokens
GPT-4o API5input/5 input / 15 output
Claude 3.5 API3input/3 input / 15 output
Self-hosted (70B)~1input/1 input / 3 output
Self-hosted (7B)~0.1input/0.1 input / 0.3 output

Self-hosted inference is usually 10× cheaper—but needs an engineering team.

💡 An engineering law

VRAM is more expensive than compute. GPU design’s biggest cost driver is high-bandwidth HBM memory—this is why an 80GB card is 2× the price of a 40GB card. For learning LLM systems optimization, always focus on VRAM first—every trick that fits a 70B model on one card (quantization, CPU offloading, PagedAttention) is drilling into this point.

Next: “Distributed Training: DP / DDP / FSDP / Tensor Parallel” — when one card doesn’t fit, how to train with multiple cards.