Attention Explained: From Intuition to Complete Derivation
Attention is all you need. This article takes you from "what is it actually doing" to "every line of its formula"—the core of Transformer, fully digested.
In 2017 the brash-titled paper Attention Is All You Need changed everything.
L0-05 taught you how to use attention through prompting. L1-02 showed you what Q, K, V matrices do. This article closes the last mile—not a single line of formula left vague.
After this, you’ll truly “see” the core of Transformer.
🎮 Strongly recommended: go play the Attention Live visualization for 5 minutes first. Reading this with the heatmaps in mind doubles the effect.
Station 1: What Problem Does Attention Solve
Back to a concrete example. A sentence:
“The cat chased the ball, because it was curious.”
Who does “it” refer to? Your brain reads “it” and instantly knows it’s “the cat.” But this needed looking back to find the referent.
How RNN Did It (poorly)
Before 2014, mainstream RNN approach:
Read left-to-right, each word compresses “itself + prior meaning” into a hidden state, passes it to the next word.
Problem: information decays in transit. By the time you read “it”, “the cat” is diluted in the Nth hidden state.
How Attention Does It (well)
Each word, when processed, can “directly look at” all other words in the sentence, deciding which to focus on based on relevance.
Reading “it”—
- Look at “cat”—relevance 0.62 ✓
- Look at “chased”—relevance 0.08
- Look at “ball”—relevance 0.14
- …
“cat” gets the highest weight—so “it” “borrows” semantics from “cat”.
This is the essence of attention: directed information flow.
Station 2: What Are Q, K, V
To make attention computable, we give each word three vectors:
- Q (Query): what current word “asks”—who should I attend to?
- K (Key): each word’s “ID card”—telling others “who I am, what I represent”
- V (Value): each word’s actual “carried information”—if you attend to me, take this content
Library analogy:
- You ask a question (Q)
- Each book has a title (K)
- Find the best matches, copy their contents (V) into your answer
Key question: where do Q, K, V come from?
Answer: linear projection of input word vectors.
Each word starts as an embedding vector (e.g., 768-dim). We use three trainable matrices to project it into Q, K, V:
Where:
- is
- are all (projecting to a smaller dim)
- Result are each
These three matrices are among the only things to learn in a Transformer.
Station 3: Attention Calculation (Core Formula)
OK, every word has its Q, K, V. How do we compute attention?
The complete formula—memorize it, this is the most important single line in AI history:
We unpack step by step.
Step 1: Raw scores
Stack all Q’s together (a matrix, each row a word’s Q), all K’s together.
For words, result is an matrix—each entry is “how much word i should attend to word j” as a raw score.
Mathematically this is dot product—remember L1-02’s “bigger dot product = more aligned directions”? Here we’re using dot product to measure Q-K similarity.
Step 2: Scale by
Why divide by ( is Q/K’s dim, e.g., 64)?
Math reason: dot product is a sum of terms. Bigger = bigger variance—after softmax becomes too extreme (one 1, rest near 0), gradient vanishes.
Dividing by keeps variance in a reasonable range.
This is an engineering detail that doesn’t affect understanding, but is critical.
Step 3: Softmax normalization
Apply softmax row-wise, making each row sum to 1:
Each row is now a probability distribution—telling you how word i allocates its “attention budget” across words.
Step 4: Weighted sum of V
Each word’s output = weighted average of all V’s, using the attention weights.
The most relevant word’s V has the highest weight—its info contributes most.
Every word takes a weighted average of all other words’ V, where weights come from Q-K similarity, then normalized. That’s it.
Station 4: 30 Lines of PyTorch
import torch
import torch.nn.functional as F
def scaled_dot_product_attention(Q, K, V, mask=None):
"""
Q, K, V: (batch, n_tokens, d_k)
returns: (batch, n_tokens, d_v) and attention weights
"""
d_k = Q.size(-1)
# Step 1: Q · K^T
scores = torch.matmul(Q, K.transpose(-2, -1)) # (batch, n, n)
# Step 2: scale
scores = scores / (d_k ** 0.5)
# Step 3 (optional): causal mask for generation
if mask is not None:
scores = scores.masked_fill(mask == 0, -1e9)
# Step 4: softmax
weights = F.softmax(scores, dim=-1) # (batch, n, n)
# Step 5: weighted V
output = torch.matmul(weights, V) # (batch, n, d_v)
return output, weights
# Try it
batch, n, d = 1, 10, 64
Q = K = V = torch.randn(batch, n, d)
output, attn = scaled_dot_product_attention(Q, K, V)
print(output.shape) # torch.Size([1, 10, 64])
print(attn.shape) # torch.Size([1, 10, 10])
print(attn[0, 5].sum()) # ≈ 1.0 (each row sums to 1)
These 30 lines are the heart of Transformer. The rest is engineering details.
Station 5: Multi-Head Attention
But in practice we don’t use just one attention—we use many (GPT-3 uses 96, PaLM 48).
Why? One head can only learn one “attention pattern”.
The visualization showed you:
- Some heads look at “previous word”
- Some look at “sentence start”
- Some look at “similar words”
- Some do “coreference resolution”
Running N parallel attentions lets the model capture N different dependency patterns simultaneously.
Math:
def multi_head_attention(X, n_heads=8, d_model=512):
d_k = d_model // n_heads # each head's dim
# Each head has its own W^Q, W^K, W^V
Q = X @ W_Q # (batch, n, d_model)
K = X @ W_K
V = X @ W_V
# Split into n_heads
Q = Q.view(batch, n, n_heads, d_k).transpose(1, 2) # (batch, n_heads, n, d_k)
K = K.view(batch, n, n_heads, d_k).transpose(1, 2)
V = V.view(batch, n, n_heads, d_k).transpose(1, 2)
# Each head does attention independently
out, _ = scaled_dot_product_attention(Q, K, V)
# Concatenate all heads' outputs
out = out.transpose(1, 2).contiguous().view(batch, n, d_model)
# Final linear layer to mix
return out @ W_O
This is the core of nn.MultiheadAttention.
Station 6: Self-Attention vs Cross-Attention
So far Q, K, V all came from the same input—called self-attention.
But Transformer uses different patterns in different places:
| Location | Q from | K, V from |
|---|---|---|
| Encoder self-attn | encoder input | encoder input |
| Decoder self-attn (with mask) | decoder input | decoder input |
| Decoder cross-attn | decoder input | encoder output |
The last—cross-attention—lets the decoder look back at encoder-encoded info during generation. This is the heart of Seq2Seq and translation models.
Station 7: Causal Mask
Autoregressive models like GPT have a constraint: when generating word i, you can’t look at words i+1 and later—otherwise it’s cheating.
Implementation: set scores at “future positions” to before softmax:
# Upper triangular mask (excluding diagonal)
mask = torch.triu(torch.ones(n, n), diagonal=1).bool()
scores.masked_fill_(mask, float('-inf'))
After softmax, future positions have weight 0—the model only sees what’s been generated.
This is how GPT works—generate one token at a time, only see the past.
Common Misconceptions
”Attention is just weighted sum”—not quite
More accurately: attention lets each position “shop for” information from other positions. The weight is just a byproduct.
”Higher Q-K similarity = higher weight”—right, but with a caveat
The similarity is measured by dot product—you need to measure it in the right “projection space” for it to mean anything. That’s why Q and K have their own projection matrices.
”Aren’t multi-heads redundant?”—they can be
Research shows BERT’s 12 heads, only 4-6 are doing real work; the rest can be pruned. This is an open question.
One-Line Summary
Attention = directed information flow + softmax-weighted fusion.
Its revolution: any two positions can communicate directly—regardless of how many words apart.
RNN is a relay race. Attention is a stadium broadcast.
Want to “See” It
👀 Attention Live visualization — play 4 attention heads, see attention patterns under different tasks.
- Read attention sections in BERT, GPT papers
- Hand-write a Transformer block in PyTorch
- Speak confidently in “attention pattern” discussions
Next: build a minimal Transformer end-to-end.
Next: “CNN Convolution Principles: From Filters to ResNet” — the king of vision before attention.