Linear Regression: The Simplest and Most Profound ML Model
Think linear regression is too simple? It's the last layer of every neural network, the starting point of every ML project, and still beats deep learning in many scenarios.
Linear regression is ML’s oldest algorithm—predating the term “machine learning” by 100+ years (Gauss used it in 1809).
Don’t underestimate it. All your ML projects start here—either using it directly, as a baseline, or as a complex model’s “final layer.”
One-line Definition
Find a line (or hyperplane) to fit your data.
Simplest form:
- : input feature
- : numerical prediction
- : model parameters (what we train)
A Concrete Example
Task: predict house price from square footage.
| Area (m²) | Price (k) |
|---|---|
| 50 | 100 |
| 80 | 180 |
| 100 | 230 |
| 120 | 280 |
| 150 | 350 |
Intuition: each m² adds about 2.3k. Try .
So . Model training = finding the best .
How to “Train”
L1-03 hand-walked us through this: define a loss function, minimize with gradient descent.
Loss Function (MSE)
Mean squared error—average squared difference between prediction and truth:
Closed-Form Solution (no gradient descent needed)
Linear regression has a unique advantage: the optimum can be computed directly, no iteration:
This is least squares—19th century math. Today’s sklearn.LinearRegression uses it.
Trivia: with one feature, least squares has closed form (5 lines of code). For more features, the solution involves matrix inversion—still computable, but typically gradient descent is faster past 100K samples.
Multivariate Linear Regression
Real data has many features. House price doesn’t just depend on area:
- Number of bedrooms (rooms)
- Location score (location)
- Age (age)
The model becomes:
In matrix form:
—this is the L1-02 formula you’ve seen. Linear regression is just the simplest case of .
With sklearn
from sklearn.linear_model import LinearRegression
import numpy as np
# Data
X = np.array([[50], [80], [100], [120], [150]])
y = np.array([100, 180, 230, 280, 350])
# Train (one line)
model = LinearRegression()
model.fit(X, y)
# See results
print(f"w = {model.coef_[0]:.2f}") # ≈ 2.3
print(f"b = {model.intercept_:.2f}") # ≈ -5
# Predict
print(model.predict([[90]])) # ≈ 200
print(model.predict([[200]])) # ≈ 460
3 lines to train, 1 line to predict. This is the essence of ML engineering—most complex work is wrapped in libraries like sklearn.
Multi-Feature Version
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
# Assume we have richer data
df = pd.read_csv('housing.csv')
X = df[['area', 'rooms', 'location_score', 'age']]
y = df['price']
# Train/test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Train
model = LinearRegression()
model.fit(X_train, y_train)
# Evaluate
score = model.score(X_test, y_test)
print(f"R² = {score:.3f}")
# Look at each feature's weight
for name, w in zip(X.columns, model.coef_):
print(f"{name}: {w:+.3f}")
Weights are interpretable—this is linear regression’s killer feature: you can read “bedroom count’s impact is +12k each”, “each year of age subtracts 5k”.
Assumptions
Linear regression isn’t a silver bullet. It has core assumptions:
- Relationship is linear— does roughly scale linearly with
- Features aren’t strongly collinear (no redundant features)
- Residuals are normally distributed (needed for statistical inference)
- Homoscedasticity (constant variance)
Violating assumptions still works, but model quality drops.
What if it’s not linear
Many relationships aren’t straight lines. E.g., price vs area might be quadratic (luxury premium).
Trick: feature engineering—add as a new feature.
df['area_sq'] = df['area'] ** 2
X = df[['area', 'area_sq', 'rooms', ...]] # Now fits curves
The model is still linear (in parameters), but fits nonlinear data—a powerful little trick.
Regularization (covered in L2-07)
Add a penalty to prevent overfitting:
| Name | Formula | Effect |
|---|---|---|
| Ridge (L2) | All weights become smaller | |
| Lasso (L1) | Some weights become exactly 0 (feature selection) | |
| ElasticNet | L1 + L2 mix | Balance |
from sklearn.linear_model import Ridge, Lasso
model = Ridge(alpha=1.0)
model.fit(X_train, y_train)
# Or
model = Lasso(alpha=0.1)
model.fit(X_train, y_train)
print(model.coef_) # See which features got zeroed
Pros and Cons
✅ Pros:
- Lightning-fast training (milliseconds for small data)
- Extremely interpretable (every weight has clear meaning)
- The first baseline of any ML project
- Mature mathematics, established statistical inference
❌ Cons:
- Limited expressiveness (can’t learn complex nonlinearities)
- Sensitive to outliers
- Many assumptions
Real Business Role
Many think “classical ML is obsolete”—wrong. Linear regression is still one of industry’s most-used algorithms:
- A/B testing: estimating “X feature’s marginal contribution to KPI”—linear regression at the core
- Insurance pricing: premium = base + risk factor × weight—pure linear regression
- Economic models: estimating “policy impact on unemployment”—the foundation of econometrics
- Project baseline: always run linear regression first; if your fancy model isn’t much better, you have a problem
One truth: most business problems—linear regression + good feature engineering gets you 80% of the way. If you can’t do that 80%, no fancy model saves you.
The last layer (output layer) of every deep neural network— is typically a linear regression.
The logits of every LLM are too—a matrix maps hidden states to vocabulary-size vectors.
Linear regression isn’t a “simple algorithm”—it’s the most important “lego brick” of ML.
Next: “Logistic Regression & Classification: From Regression to Decision” — same idea, extended from “predict numbers” to “predict categories”.