L3 Chapter 5 🐣 🕒 12 min

Attention Explained: From Intuition to Complete Derivation

Attention is all you need. This article takes you from "what is it actually doing" to "every line of its formula"—the core of Transformer, fully digested.

HelloAI Editors

6/15/2026

In 2017 the brash-titled paper Attention Is All You Need changed everything.

L0-05 taught you how to use attention through prompting. L1-02 showed you what Q, K, V matrices do. This article closes the last mile—not a single line of formula left vague.

After this, you’ll truly “see” the core of Transformer.

🎮 Strongly recommended: go play the Attention Live visualization for 5 minutes first. Reading this with the heatmaps in mind doubles the effect.

Station 1: What Problem Does Attention Solve

Back to a concrete example. A sentence:

“The cat chased the ball, because it was curious.”

Who does “it” refer to? Your brain reads “it” and instantly knows it’s “the cat.” But this needed looking back to find the referent.

How RNN Did It (poorly)

Before 2014, mainstream RNN approach:

Read left-to-right, each word compresses “itself + prior meaning” into a hidden state, passes it to the next word.

Problem: information decays in transit. By the time you read “it”, “the cat” is diluted in the Nth hidden state.

How Attention Does It (well)

Each word, when processed, can “directly look at” all other words in the sentence, deciding which to focus on based on relevance.

Reading “it”—

Look at “cat”—relevance 0.62 ✓
Look at “chased”—relevance 0.08
Look at “ball”—relevance 0.14
…

“cat” gets the highest weight—so “it” “borrows” semantics from “cat”.

This is the essence of attention: directed information flow.

Station 2: What Are Q, K, V

To make attention computable, we give each word three vectors:

Q (Query): what current word “asks”—who should I attend to?
K (Key): each word’s “ID card”—telling others “who I am, what I represent”
V (Value): each word’s actual “carried information”—if you attend to me, take this content

Library analogy:

You ask a question (Q)
Each book has a title (K)
Find the best matches, copy their contents (V) into your answer

Key question: where do Q, K, V come from?

Answer: linear projection of input word vectors.

Each word starts as an embedding vector $x_i$ (e.g., 768-dim). We use three trainable matrices to project it into Q, K, V:

Q_i = x_i W^Q, \quad K_i = x_i W^K, \quad V_i = x_i W^V

Where:

$x_i$ is $1 \times 768$
$W^Q, W^K, W^V$ are all $768 \times 64$ (projecting to a smaller dim)
Result $Q_i, K_i, V_i$ are each $1 \times 64$

These three matrices $W^Q, W^K, W^V$ are among the only things to learn in a Transformer.

Station 3: Attention Calculation (Core Formula)

OK, every word has its Q, K, V. How do we compute attention?

The complete formula—memorize it, this is the most important single line in AI history:

\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right) V

We unpack step by step.

Step 1: Raw scores $QK^T$

Stack all Q’s together (a matrix, each row a word’s Q), all K’s together.

\text{scores} = Q K^T

For $n$ words, result is an $n \times n$ matrix—each entry $[i,j]$ is “how much word i should attend to word j” as a raw score.

Mathematically this is dot product—remember L1-02’s “bigger dot product = more aligned directions”? Here we’re using dot product to measure Q-K similarity.

Step 2: Scale by $\sqrt{d_k}$

Why divide by $\sqrt{d_k}$ ( $d_k$ is Q/K’s dim, e.g., 64)?

Math reason: dot product is a sum of $d_k$ terms. Bigger $d_k$ = bigger variance—after softmax becomes too extreme (one 1, rest near 0), gradient vanishes.

Dividing by $\sqrt{d_k}$ keeps variance in a reasonable range.

This is an engineering detail that doesn’t affect understanding, but is critical.

Step 3: Softmax normalization

Apply softmax row-wise, making each row sum to 1:

\text{attention\_weights}[i,j] = \frac{\exp(\text{scores}[i,j])}{\sum_k \exp(\text{scores}[i,k])}

Each row is now a probability distribution—telling you how word i allocates its “attention budget” across words.

Step 4: Weighted sum of V

\text{output} = \text{attention\_weights} \cdot V

Each word’s output = weighted average of all V’s, using the attention weights.

The most relevant word’s V has the highest weight—its info contributes most.

💡 One line summary

Every word takes a weighted average of all other words’ V, where weights come from Q-K similarity, then normalized. That’s it.

Station 4: 30 Lines of PyTorch

import torch
import torch.nn.functional as F

def scaled_dot_product_attention(Q, K, V, mask=None):
    """
    Q, K, V: (batch, n_tokens, d_k)
    returns: (batch, n_tokens, d_v) and attention weights
    """
    d_k = Q.size(-1)

    # Step 1: Q · K^T
    scores = torch.matmul(Q, K.transpose(-2, -1))   # (batch, n, n)

    # Step 2: scale
    scores = scores / (d_k ** 0.5)

    # Step 3 (optional): causal mask for generation
    if mask is not None:
        scores = scores.masked_fill(mask == 0, -1e9)

    # Step 4: softmax
    weights = F.softmax(scores, dim=-1)   # (batch, n, n)

    # Step 5: weighted V
    output = torch.matmul(weights, V)     # (batch, n, d_v)

    return output, weights


# Try it
batch, n, d = 1, 10, 64
Q = K = V = torch.randn(batch, n, d)

output, attn = scaled_dot_product_attention(Q, K, V)
print(output.shape)   # torch.Size([1, 10, 64])
print(attn.shape)     # torch.Size([1, 10, 10])
print(attn[0, 5].sum())  # ≈ 1.0 (each row sums to 1)

These 30 lines are the heart of Transformer. The rest is engineering details.

Station 5: Multi-Head Attention

But in practice we don’t use just one attention—we use many (GPT-3 uses 96, PaLM 48).

Why? One head can only learn one “attention pattern”.

The visualization showed you:

Some heads look at “previous word”
Some look at “sentence start”
Some look at “similar words”
Some do “coreference resolution”

Running N parallel attentions lets the model capture N different dependency patterns simultaneously.

Math:

def multi_head_attention(X, n_heads=8, d_model=512):
    d_k = d_model // n_heads  # each head's dim

    # Each head has its own W^Q, W^K, W^V
    Q = X @ W_Q   # (batch, n, d_model)
    K = X @ W_K
    V = X @ W_V

    # Split into n_heads
    Q = Q.view(batch, n, n_heads, d_k).transpose(1, 2)  # (batch, n_heads, n, d_k)
    K = K.view(batch, n, n_heads, d_k).transpose(1, 2)
    V = V.view(batch, n, n_heads, d_k).transpose(1, 2)

    # Each head does attention independently
    out, _ = scaled_dot_product_attention(Q, K, V)

    # Concatenate all heads' outputs
    out = out.transpose(1, 2).contiguous().view(batch, n, d_model)

    # Final linear layer to mix
    return out @ W_O

This is the core of nn.MultiheadAttention.

Station 6: Self-Attention vs Cross-Attention

So far Q, K, V all came from the same input—called self-attention.

But Transformer uses different patterns in different places:

Location	Q from	K, V from
Encoder self-attn	encoder input	encoder input
Decoder self-attn (with mask)	decoder input	decoder input
Decoder cross-attn	decoder input	encoder output

The last—cross-attention—lets the decoder look back at encoder-encoded info during generation. This is the heart of Seq2Seq and translation models.

Station 7: Causal Mask

Autoregressive models like GPT have a constraint: when generating word i, you can’t look at words i+1 and later—otherwise it’s cheating.

Implementation: set scores at “future positions” to $-\infty$ before softmax:

# Upper triangular mask (excluding diagonal)
mask = torch.triu(torch.ones(n, n), diagonal=1).bool()
scores.masked_fill_(mask, float('-inf'))

After softmax, future positions have weight 0—the model only sees what’s been generated.

This is how GPT works—generate one token at a time, only see the past.

Common Misconceptions

”Attention is just weighted sum”—not quite

More accurately: attention lets each position “shop for” information from other positions. The weight is just a byproduct.

”Higher Q-K similarity = higher weight”—right, but with a caveat

The similarity is measured by dot product—you need to measure it in the right “projection space” for it to mean anything. That’s why Q and K have their own projection matrices.

”Aren’t multi-heads redundant?”—they can be

Research shows BERT’s 12 heads, only 4-6 are doing real work; the rest can be pruned. This is an open question.

One-Line Summary

Attention = directed information flow + softmax-weighted fusion.

Its revolution: any two positions can communicate directly—regardless of how many words apart.

RNN is a relay race. Attention is a stadium broadcast.

Want to “See” It

👀 Attention Live visualization — play 4 attention heads, see attention patterns under different tasks.

💡 What you've unlocked after reading

Read attention sections in BERT, GPT papers
Hand-write a Transformer block in PyTorch
Speak confidently in “attention pattern” discussions

Next: build a minimal Transformer end-to-end.

Next: “CNN Convolution Principles: From Filters to ResNet” — the king of vision before attention.