L3 第 8 篇 🐥 难度 🕒 13 分钟

完整 Transformer 架构：把所有积木组合起来

注意力是核心，但 Transformer 还有位置编码、FFN、残差连接、LayerNorm。这一篇把它们组合成一个完整可跑的模型。

阿

阿莱

2026/7/3

L3-05 你学了注意力——Transformer 的核心。但 Transformer 不只是注意力——还有 位置编码、FFN、残差连接、LayerNorm。

这一篇把所有积木拼起来——读完你能在 PyTorch 里手写一个能训的 Transformer。

整体架构图

原论文（Attention Is All You Need, 2017）的架构：

   ┌──────── Encoder ────────┐    ┌──────── Decoder ────────┐
   │                          │    │                          │
   │ ┌──── 重复 N 次 ────┐    │    │ ┌──── 重复 N 次 ────┐    │
   │ │ MultiHead Self-  │    │    │ │ Masked Self-Attn │    │
   │ │  Attention       │    │    │ │     + LayerNorm  │    │
   │ │     + LayerNorm  │    │    │ │                  │    │
   │ │                  │    │    │ │ Cross-Attention  │    │
   │ │ Feed Forward     │    │    │ │     + LayerNorm  │    │
   │ │     + LayerNorm  │    │    │ │                  │    │
   │ └──────────────────┘    │    │ │ Feed Forward     │    │
   │                          │    │ │     + LayerNorm  │    │
   │ ↑                        │    │ └──────────────────┘    │
   │ Embedding + Positional   │    │ ↑                        │
   │                          │    │ Embedding + Positional   │
   └────────┬─────────────────┘    └────────┬─────────────────┘
             │                                │
        Input tokens                      Output tokens (shifted)
                                              │
                                         Linear + Softmax
                                              ↓
                                         概率分布

看着复杂，但每个积木我们都讲过或马上要讲。

第一站：Tokenization + Embedding（输入端）

文本进模型前要做两步：

1. Tokenization（L4-02 已讲）

"Hello world" → ["Hello", " world"] → [15496, 1917]（token IDs）

2. Embedding 查找

每个 token ID 查表，得到一个高维向量：

# Vocab size 50000，embedding dim 768
embedding = nn.Embedding(num_embeddings=50000, embedding_dim=768)

token_ids = torch.tensor([15496, 1917])
embeddings = embedding(token_ids)
print(embeddings.shape)   # (2, 768) ←  每个 token 一个 768 维向量

这是 Transformer 看到的真实输入——一串向量。

第二站：位置编码

但有个严重问题——Attention 是位置无关的！

"猫追狗"     注意力权重和  "狗追猫"     完全一样

因为 attention 只看每对 token 的相似度，不关心顺序。这显然不对。

解决：给每个位置加一个唯一的”位置向量”。

原始 Transformer：正弦位置编码

PE(pos, 2i) = \sin(pos / 10000^{2i/d})

PE(pos, 2i+1) = \cos(pos / 10000^{2i/d})

位置 0、1、2… 都对应一个独特的向量。直接和 token embedding 相加：

input = token_embedding + positional_encoding

为什么用 sin/cos：模型能从 $PE_{pos+k}$ 通过线性变换得到 $PE_{pos}$ —— 天然编码了相对位置。

学习式位置编码（BERT 用的）

不要数学公式，直接学一个 nn.Embedding(max_len, d_model)。简单粗暴，效果差不多。

RoPE（Rotary Position Embedding）—— 现代 LLM 用的

把 Q 和 K 向量在偶数维和奇数维之间做旋转，旋转角度和位置成正比。

# 简化形式
def apply_rope(q, k, pos):
    # 把 q[i:i+1] 和 q[i+1:i+2] 当复数 (a + bi)
    # 乘以 e^(i·pos·θ) 实现旋转
    ...

好处：

天然支持任意长度（不预定义 max_len）
编码”相对位置”
数学美感

Llama、PaLM、Mistral、Claude 等几乎所有现代 LLM 都用 RoPE 或它的变种。

ALiBi（Attention with Linear Biases）

另一种现代选择——直接在 attention 分数上加一个位置距离惩罚：

\text{score}(i, j) = q_i \cdot k_j - m \cdot |i - j|

简洁、外推性好。

第三站：Transformer Block 内部

一个 Transformer block 长这样：

class TransformerBlock(nn.Module):
    def __init__(self, d_model, n_heads):
        super().__init__()
        self.attn = nn.MultiheadAttention(d_model, n_heads, batch_first=True)
        self.ffn = nn.Sequential(
            nn.Linear(d_model, 4 * d_model),  # 升 4 倍
            nn.GELU(),
            nn.Linear(4 * d_model, d_model),  # 降回
        )
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)

    def forward(self, x):
        # 1. Self-Attention + 残差 + Norm
        attn_out, _ = self.attn(x, x, x)
        x = self.norm1(x + attn_out)

        # 2. FFN + 残差 + Norm
        ffn_out = self.ffn(x)
        x = self.norm2(x + ffn_out)

        return x

它有 4 个关键设计：

1. Self-Attention（L3-05 详讲）

每个 token “回头看” 所有其它 token，根据相关性聚合信息。

2. FFN（前馈网络）

每个 token 位置上独立过一个 2 层 MLP：

Linear → GELU → Linear

中间维度通常是 $4 \times d_{model}$ 。FFN 是 Transformer 大部分参数所在——GPT-3 总参数的 70% 在 FFN 里。

直觉：Attention 在做”信息流动”，FFN 在做”信息处理”——把流过来的信息进一步加工。

3. 残差连接（Skip Connection）

每个子层都是 $x + f(x)$ ，不是 $f(x)$ ——让梯度有”快速通道”反传。

没有残差，深网络训不动（详见 L3-06 中的 ResNet 故事）。

4. LayerNorm

对每个 token 的特征维度做归一化：

\text{LayerNorm}(x) = \gamma \odot \frac{x - \mu_x}{\sigma_x} + \beta

让训练稳定。BatchNorm 在 NLP 上不好用（batch 内 token 长度不一），所以 Transformer 用 LayerNorm。

Pre-Norm vs Post-Norm：原版是 Post-Norm（先加再 norm），但现代 LLM 几乎全用 Pre-Norm（先 norm 再加）—— 训练更稳。

第四站：堆叠 + 输出端

把 N 个 block 堆起来（GPT-3 用 96 层），最后接一个线性层把每个位置的向量映射到词表大小，过 softmax 输出概率：

class TransformerModel(nn.Module):
    def __init__(self, vocab_size, d_model, n_heads, n_layers, max_len):
        super().__init__()
        self.token_embed = nn.Embedding(vocab_size, d_model)
        self.pos_embed = nn.Embedding(max_len, d_model)
        self.blocks = nn.ModuleList([
            TransformerBlock(d_model, n_heads) for _ in range(n_layers)
        ])
        self.final_norm = nn.LayerNorm(d_model)
        self.lm_head = nn.Linear(d_model, vocab_size, bias=False)

    def forward(self, input_ids):
        positions = torch.arange(input_ids.size(1), device=input_ids.device)
        x = self.token_embed(input_ids) + self.pos_embed(positions)
        for block in self.blocks:
            x = block(x)
        x = self.final_norm(x)
        return self.lm_head(x)   # (batch, seq, vocab_size)

这就是一个完整的 Transformer！ GPT 的核心就是这套（更多优化和精调）。

第五站：Encoder vs Decoder vs Encoder-Decoder

Transformer 有三种主要变体——按”任务类型”分：

1. Encoder-Only（BERT 类）

输入完整句子 → Transformer Block × N → 每个位置的语义向量

用途：理解任务（分类、抽取、句子嵌入）。特点：自注意力能”双向”看所有位置——所以叫 BERT（Bidirectional Encoder Representations from Transformers）。

代表：BERT、RoBERTa、ALBERT、ELECTRA。

2. Decoder-Only（GPT 类）

"今天天气" → Transformer Block (带因果掩码) → 预测 "真"

用途：生成任务（写文章、对话、代码）。特点：自注意力有因果掩码——每个位置只能看自己和前面，模拟”边生成边看”。

代表：GPT-2/3/4、Claude、Llama、Mistral——当今 LLM 几乎全是 Decoder-Only。

3. Encoder-Decoder（原始 Transformer / T5）

源句子 → Encoder → encoded
                    ↓
目标句子 → Decoder (cross-attention 看 encoded) → 翻译

用途：序列到序列任务（翻译、摘要）。特点：Decoder 有额外的 cross-attention 层”回头看” encoder。

代表：T5、BART、原始 Transformer、Whisper。

2026 现状：Decoder-Only 是绝对主流。Encoder-Only 还有它的位置（BERT 仍是高质量句子嵌入的首选）。Encoder-Decoder 主要在专门任务里用。

第六站：参数量怎么算

一个 Transformer 参数主要来源：

组件	参数量
Token Embedding	$V \times d$
Positional Embedding（如果学的）	$L \times d$
每层 Attention 的 4 个矩阵	$4 \times d^2$
每层 FFN 的 2 个矩阵	$8 d^2$ （4倍中间维度）
LayerNorm	微不足道

总参数 ≈ $V \times d + N \times 12 d^2$

对 GPT-3：

$V = 50000$ , $d = 12288$ , $N = 96$
$\approx 50000 \times 12288 + 96 \times 12 \times 12288^2 = 6 \times 10^8 + 1.7 \times 10^{11}$
约 175B 参数 ← 对上了

一句话总结

Transformer = Embedding + 位置编码 + N × (Attention + FFN + Norm + 残差) + 输出层。

5 个积木的组合。每个积木单独看都不复杂。但拼起来——它支撑着 2026 年所有的大模型。

💡 想"看见"它

Transformer 内部解剖可视化（开发中）—— 逐层展开 GPT-2，看数据如何流过 12 层 Block。配合注意力实时计算一起玩。

下一步推荐：L4-01《LLM 是怎么炼成的》 —— 把架构和训练流程连起来。

🔗 被以下 4 篇文章引用

📬

读到这里说明你认真在学 🎯

订阅每周精选 —— 下一篇新文章 / 新可视化第一时间送到邮箱。

💬

讨论区

· 用 GitHub 账号登录评论

⚠️ Giscus 评论未配置 —— 在 src/components/Comments.astro 顶部填入仓库 ID 和分类 ID（见组件注释里的配置步骤）。