L1 Chapter 2 🐣 🕒 10 min

Linear Algebra: Vectors and Matrices via Pictures and Positions

All of AI is matrix operations under the hood. This one doesn't teach determinants or eigenvalues—just lets you "see" what vectors are doing.

HelloAI Editors

6/11/2026

Open any AI paper and the first thing you’ll see is:

y = W x + b

Four symbols that determine all of today’s AI. $x$ is input, $W$ is a matrix, $b$ is a bias, $y$ is the output.

This is linear algebra’s core application. This article explains what those letters are doing, and why AI can’t live without them.

Vector: An Ordered List of Numbers

Simplest concept. A vector is just a list of numbers:

v = [3, 5]

Two numbers: 3 and 5.

But the key isn’t “two numbers”—it’s “what they represent”.

Vectors are powerful because: the same [3, 5] can represent infinitely many things.

On a 2D plane, it’s a point

[3, 5] \to \text{the location (3, 5)}

On a 2D plane, it’s also an arrow

From origin (0,0) pointing to (3, 5). Point and arrow are two interpretations of the same vector.

In higher dimensions, it represents more complex things

Vector	What it could mean
$[3, 5]$	Your height (meters) + weight (centi-units)
$[170, 60, 25]$	Someone’s height / weight / age
$[0, 1, 0, 0]$	One-hot encoding for “banana” from {apple, banana, cherry, orange}
$[0.2, -0.5, 0.8, ..., 0.1]$ (768 numbers)	A word’s BERT embedding
$[r, g, b]$	A pixel’s color

All of these are vectors—same math object, different meanings.

💡 A key intuition

Vectors are the universal currency of information in AI. Whether your data is text, image, or audio—it eventually becomes a list of numbers (a vector) for AI to process.

Dot Product: The “Intimacy” Between Two Vectors

Two vectors can be combined via the dot product:

[3, 5] \cdot [2, 4] = 3 \times 2 + 5 \times 4 = 6 + 20 = 26

Multiply corresponding entries, then add.

What’s the point?

Dot product has a magical property:

The more two vectors “point in the same direction”, the larger the dot product.

Same direction → max (positive)
Perpendicular → 0
Opposite → min (negative)

This makes dot product AI’s standard tool for measuring “similarity”.

In L0-11’s glossary, we said “similar embeddings = similar meanings”— The actual measure is dot product (or its variant: cosine similarity).

Example. Assume “king”, “queen”, and “banana” have been mapped to vectors:

king   = [0.8, 0.2, 0.9, ...]   # 768 numbers
queen  = [0.7, 0.3, 0.85, ...]
banana = [-0.3, 0.6, -0.1, ...]

king · queen  = a big positive number  → they're similar
king · banana = near zero              → they're unrelated

This is the core of RAG retrieval, recommender systems, and search.

Matrix: Stacks of Vectors

If a vector is one row of numbers, a matrix is multiple rows:

W = \begin{bmatrix} 2 & 0 \\ 0 & 3 \end{bmatrix}

Can be read as two vectors stacked: $[2, 0]$ and $[0, 3]$ .

But matrices’ most powerful aspect isn’t “holding data”—it’s performing transformations.

Matrix × Vector: A Single “Transformation”

Back to $y = Wx$ .

A super simple example:

\begin{bmatrix} 2 & 0 \\ 0 & 3 \end{bmatrix} \times \begin{bmatrix} 3 \\ 5 \end{bmatrix} = \begin{bmatrix} 6 \\ 15 \end{bmatrix}

The rule: Each row of the matrix dot-products with the vector → one output number.

Row 1 $[2, 0]$ dot $[3, 5]$ = $6$
Row 2 $[0, 3]$ dot $[3, 5]$ = $15$

What does this mean geometrically?

Original vector was $(3, 5)$ . After multiplication, it became $(6, 15)$ :

x direction stretched 2×
y direction stretched 3×

This matrix is a “stretcher”.

Different matrices do different transformations:

Matrix	Effect
$\begin{bmatrix} 2 & 0 \\ 0 & 2 \end{bmatrix}$	Scale by 2
$\begin{bmatrix} 1 & 0 \\ 0 & -1 \end{bmatrix}$	Flip vertically
$\begin{bmatrix} \cos\theta & -\sin\theta \\ \sin\theta & \cos\theta \end{bmatrix}$	Rotate by θ
$\begin{bmatrix} 1 & 0.5 \\ 0 & 1 \end{bmatrix}$	Skew (parallelogram)

Core insight:

A matrix isn’t a pile of numbers. A matrix is an “action”—it transforms vectors.

Why AI Can’t Live Without Matrices

Back to our starting formula:

y = W x + b

This is a single layer of a neural network. What’s it doing?

$x$ is the input vector (e.g., 768-dim BERT embedding)
$W$ is a learned matrix (e.g., 768 × 256)
$b$ is a bias vector
$y$ is the output (256-dim)

What does this layer do? It transforms 768-dim input into 256-dim output.

The specific numbers in $W$ decide the transformation. Training a neural network is essentially adjusting numbers in $W$ so the network maps inputs to “outputs we want.”

A GPT-3 has 175 billion parameters—most of them are numbers in various matrices $W$ .

🔬 One-line summary

Neural network = a chain of matrix transformations + activation functions. Every “large model” you’ve heard of is fundamentally: vector → matrix → vector → matrix → … in relay.

Hands-On With NumPy

Let’s actually run this in Python:

import numpy as np

# Vector
x = np.array([3, 5])
print(x)              # [3 5]
print(x.shape)        # (2,)

# Matrix
W = np.array([[2, 0],
              [0, 3]])
print(W.shape)        # (2, 2)

# Matrix times vector
y = W @ x            # @ is matrix multiplication operator
print(y)              # [6 15] ← matches our hand-calc

# Dot product
v1 = np.array([3, 5])
v2 = np.array([2, 4])
print(v1 @ v2)        # 26
print(np.dot(v1, v2)) # 26 (another way)

# Real scenario: similarity of two embeddings
emb_king  = np.random.randn(768)
emb_queen = np.random.randn(768)
similarity = emb_king @ emb_queen / (np.linalg.norm(emb_king) * np.linalg.norm(emb_queen))
print(f"Similarity: {similarity:.3f}")

Strongly recommended: open Google Colab, run this code. Code your fingers touched sticks best.

Real Scenario: Q, K, V in Transformer

We mentioned Q, K, V in L0’s glossary—the core of Transformer. Now you can understand them:

# X is the input sequence's embedding matrix
# Assume 10 words in a sentence, each 768-dim
X = np.random.randn(10, 768)  # (10, 768)

# Wq, Wk, Wv are three "learned" matrices
Wq = np.random.randn(768, 64)  # (768, 64)
Wk = np.random.randn(768, 64)
Wv = np.random.randn(768, 64)

# Matrix multiplications → Q, K, V
Q = X @ Wq  # (10, 64)
K = X @ Wk  # (10, 64)
V = X @ Wv  # (10, 64)

# Attention scores = Q · K^T
scores = Q @ K.T  # (10, 10)

Every operation here is matrix multiplication. This is the “heartbeat” of Transformer.

One-Line Summary

Vectors are AI’s data, matrices are AI’s operations.

Master both, and you’ll read 80% of formulas in AI papers.

Want to “See” It

👀 Open the Embedding Space Walker visualization — see 50 words positioned in vector space, play the king - man + woman = queen arithmetic.

💡 Next preview

With linear algebra down, we hit one question: where do the numbers in matrix $W$ come from? Answer: “training”. Training is built on derivatives and gradients—next article covers it.

Next: “Derivatives and Gradients: The Math Definition of ‘Learning’”