L2 Chapter 2 🐣 🕒 9 min

Linear Regression: The Simplest and Most Profound ML Model

Think linear regression is too simple? It's the last layer of every neural network, the starting point of every ML project, and still beats deep learning in many scenarios.

HelloAI Editors

6/25/2026

Linear regression is ML’s oldest algorithm—predating the term “machine learning” by 100+ years (Gauss used it in 1809).

Don’t underestimate it. All your ML projects start here—either using it directly, as a baseline, or as a complex model’s “final layer.”

One-line Definition

Find a line (or hyperplane) to fit your data.

Simplest form:

y = wx + b

$x$ : input feature
$y$ : numerical prediction
$w, b$ : model parameters (what we train)

A Concrete Example

Task: predict house price from square footage.

Area (m²)	Price (k)
50	100
80	180
100	230
120	280
150	350

Intuition: each m² adds about 2.3k. Try $y = 2.3x - 5$ .

So $w = 2.3, b = -5$ . Model training = finding the best $w, b$ .

How to “Train”

L1-03 hand-walked us through this: define a loss function, minimize with gradient descent.

Loss Function (MSE)

Mean squared error—average squared difference between prediction and truth:

L(w, b) = \frac{1}{N} \sum_i (wx_i + b - y_i)^2

Closed-Form Solution (no gradient descent needed)

Linear regression has a unique advantage: the optimum can be computed directly, no iteration:

\hat{w} = \frac{\text{Cov}(x, y)}{\text{Var}(x)}

\hat{b} = \bar{y} - \hat{w} \bar{x}

This is least squares—19th century math. Today’s sklearn.LinearRegression uses it.

Trivia: with one feature, least squares has closed form (5 lines of code). For more features, the solution involves matrix inversion—still computable, but typically gradient descent is faster past 100K samples.

Multivariate Linear Regression

Real data has many features. House price doesn’t just depend on area:

Number of bedrooms (rooms)
Location score (location)
Age (age)

The model becomes:

y = w_1 \cdot \text{area} + w_2 \cdot \text{rooms} + w_3 \cdot \text{location} + w_4 \cdot \text{age} + b

In matrix form:

y = w \cdot x + b

—this is the L1-02 formula you’ve seen. Linear regression is just the simplest case of $y = Wx + b$ .

With sklearn

from sklearn.linear_model import LinearRegression
import numpy as np

# Data
X = np.array([[50], [80], [100], [120], [150]])
y = np.array([100, 180, 230, 280, 350])

# Train (one line)
model = LinearRegression()
model.fit(X, y)

# See results
print(f"w = {model.coef_[0]:.2f}")   # ≈ 2.3
print(f"b = {model.intercept_:.2f}") # ≈ -5

# Predict
print(model.predict([[90]]))   # ≈ 200
print(model.predict([[200]]))  # ≈ 460

3 lines to train, 1 line to predict. This is the essence of ML engineering—most complex work is wrapped in libraries like sklearn.

Multi-Feature Version

import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

# Assume we have richer data
df = pd.read_csv('housing.csv')
X = df[['area', 'rooms', 'location_score', 'age']]
y = df['price']

# Train/test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train
model = LinearRegression()
model.fit(X_train, y_train)

# Evaluate
score = model.score(X_test, y_test)
print(f"R² = {score:.3f}")

# Look at each feature's weight
for name, w in zip(X.columns, model.coef_):
    print(f"{name}: {w:+.3f}")

Weights are interpretable—this is linear regression’s killer feature: you can read “bedroom count’s impact is +12k each”, “each year of age subtracts 5k”.

Assumptions

Linear regression isn’t a silver bullet. It has core assumptions:

Relationship is linear— $y$ does roughly scale linearly with $x$
Features aren’t strongly collinear (no redundant features)
Residuals are normally distributed (needed for statistical inference)
Homoscedasticity (constant variance)

Violating assumptions still works, but model quality drops.

What if it’s not linear

Many relationships aren’t straight lines. E.g., price vs area might be quadratic (luxury premium).

Trick: feature engineering—add $x^2$ as a new feature.

df['area_sq'] = df['area'] ** 2
X = df[['area', 'area_sq', 'rooms', ...]]   # Now fits curves

The model is still linear (in parameters), but fits nonlinear data—a powerful little trick.

Regularization (covered in L2-07)

Add a penalty to prevent overfitting:

Name	Formula	Effect
Ridge (L2)	$L + \lambda \\|w\\|^2$	All weights become smaller
Lasso (L1)	$L + \lambda \\|w\\|_1$	Some weights become exactly 0 (feature selection)
ElasticNet	L1 + L2 mix	Balance

from sklearn.linear_model import Ridge, Lasso

model = Ridge(alpha=1.0)
model.fit(X_train, y_train)
# Or
model = Lasso(alpha=0.1)
model.fit(X_train, y_train)
print(model.coef_)   # See which features got zeroed

Pros and Cons

✅ Pros:

Lightning-fast training (milliseconds for small data)
Extremely interpretable (every weight has clear meaning)
The first baseline of any ML project
Mature mathematics, established statistical inference

❌ Cons:

Limited expressiveness (can’t learn complex nonlinearities)
Sensitive to outliers
Many assumptions

Real Business Role

Many think “classical ML is obsolete”—wrong. Linear regression is still one of industry’s most-used algorithms:

A/B testing: estimating “X feature’s marginal contribution to KPI”—linear regression at the core
Insurance pricing: premium = base + risk factor × weight—pure linear regression
Economic models: estimating “policy impact on unemployment”—the foundation of econometrics
Project baseline: always run linear regression first; if your fancy model isn’t much better, you have a problem

One truth: most business problems—linear regression + good feature engineering gets you 80% of the way. If you can’t do that 80%, no fancy model saves you.

💡 The depth of linear regression

The last layer (output layer) of every deep neural network— is typically a linear regression.

The logits of every LLM are too—a $W$ matrix maps hidden states to vocabulary-size vectors.

Linear regression isn’t a “simple algorithm”—it’s the most important “lego brick” of ML.

Next: “Logistic Regression & Classification: From Regression to Decision” — same idea, extended from “predict numbers” to “predict categories”.