L2 Chapter 5 🐣 🕒 11 min

Random Forest + Boosting: From Weak Learners to Superhuman

A single tree is mediocre. A crowd of trees voting is supernatural. The origin story of XGBoost—the Kaggle champion for a decade.

HelloAI Editors

6/28/2026

At L2-04’s end I said “single trees are weak, ensembles are unbeatable”—this article unpacks that.

The core idea of ensemble learning:

Many “mediocre” weak models combined can far exceed a single strong model.

This isn’t unique to ML—human society works this way. Juries (multiple independent judgments) beat a single judge. Expert panels (consensus) beat single experts. The algorithm world has pushed this principle to extremes.

Two Schools

Ensemble learning has two camps with completely different “combination methods”:

School	Chinese	Core Idea	Representative
Bagging	装袋	Train N independent models, vote	Random Forest
Boosting	提升	Train N sequential models, each correcting the previous	XGBoost, LightGBM

Both use decision trees as “base models”—but combine fundamentally differently.

I. Bagging and Random Forest

Bagging Principle

Bagging = Bootstrap Aggregating.

Sample training data with replacement to construct N “slightly different” subsets
Train one tree independently on each subset
At prediction time, let N trees vote (classification) or average (regression)

Key: each tree trains on “slightly different” data—so their errors aren’t correlated—errors cancel out when voting.

Random Forest

Random Forest = Bagging + extra randomness:

When training each tree, each node split only considers a random subset of features (not all).

This trick makes each tree more “independent,” improving vote quality.

With sklearn

from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split

X, y = load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 100-tree random forest (defaults are good)
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)

print(f"Accuracy: {rf.score(X_test, y_test):.3f}")  # typically 96-97%

L2-04’s single decision tree got 93%—change one line, jump to 97%. The power of ensembles.

Key Hyperparameters

Parameter	Recommended start
`n_estimators`	100-500 (more trees = more stable, but slower)
`max_depth`	None (let trees grow) or 10
`max_features`	’sqrt’ (use √N features per split)
`min_samples_leaf`	1-5

Random Forest’s Advantages

✅ Works well almost without tuning ✅ Hard to overfit (more trees = stabler votes) ✅ Provides feature importance (inherits decision tree’s strength) ✅ Trees train in parallel

❌ Slow (100 trees vs 1) ❌ Big model (stores 100 trees)

II. Boosting and XGBoost

Boosting Principle

Completely opposite approach. Sequential training:

Train tree 1
See where it failed—boost weight of “wrong” samples
Train tree 2 (with adjusted weights)—it focuses on tree 1’s mistakes
…repeat N times
Final prediction = weighted sum of N trees

Key intuition: each new tree “fills in” the mistakes left by all previous trees.

Gradient Boosting

An elegant version—express “correcting mistakes” as gradients:

Difference between current prediction and truth = residual
Train a new tree to fit the residual
Add to current prediction

Sounds strange? Think of L1-03 gradient descent—Gradient Boosting does gradient descent in function space. Each added tree is a step along the loss’s “negative gradient direction.”

XGBoost / LightGBM / CatBoost

XGBoost is Gradient Boosting’s “engineering miracle”—2014 Tianqi Chen open-sourced it and it swept Kaggle.

Later Microsoft released LightGBM (faster), Yandex released CatBoost (good with categorical features).

Today’s industry trio: XGBoost, LightGBM, CatBoost—their algorithms are based on Gradient Boosting, differing in engineering and details.

Using XGBoost

# Need: pip install xgboost
import xgboost as xgb
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split

X, y = load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

model = xgb.XGBClassifier(
    n_estimators=200,
    learning_rate=0.05,
    max_depth=4,
    random_state=42
)
model.fit(X_train, y_train,
          eval_set=[(X_test, y_test)],
          verbose=False)

print(f"Accuracy: {model.score(X_test, y_test):.3f}")  # typically 97-98%

A few lines of config gets deep-learning-level results.

Key Hyperparameters

Parameter	Role	Tuning direction
`n_estimators`	Number of trees	100-1000
`learning_rate`	Each tree’s contribution	0.01-0.1 (smaller needs more trees)
`max_depth`	Single tree depth	3-8 (not as deep as Random Forest)
`subsample`	% samples per tree	0.6-1.0
`colsample_bytree`	% features per tree	0.5-1.0
`reg_lambda`	L2 regularization	0-1

XGBoost tuning is an art, but defaults usually work—don’t start by tuning everything.

Bagging vs Boosting: How to Choose

Dimension	Random Forest	XGBoost
Training speed	Slow (many independent trees)	Slow (sequential, no parallel)
Tuning sensitivity	Low	High
Overfit risk	Low	Medium (needs early stopping)
Performance ceiling	Medium-high	Highest
Recommended for	New project baseline	When tuned well, wins Kaggle

My rules of thumb:

Tight on time → Random Forest (one line, no loss)
Maximizing performance → XGBoost (worth tuning)
Huge data → LightGBM (fastest)

Why Ensembles Are So Powerful

Mathematical intuition:

If N models’ errors are uncorrelated (independent), then the error of the vote ≈ single model’s error rate to the Nth power.

Assume single model is 70% accurate. 11 independent models voting, majority-correct probability ≈ 91%! 21 → 97%!

This is why ensemble models almost always win Kaggle.

But there’s a prerequisite: models must be independent and better than random. If they all err the same way, voting can’t save them.

Reality: models aren’t fully independent—they use the same data, similar algorithms. So actual gains aren’t as dramatic as theory. But still significantly better than single model.

Stacking: The Ultimate Combination

Even more aggressive: use a model to learn “how to combine other models”.

Train N different base models (decision tree, neural net, linear regression, SVM…)
Use a “meta-learner” to learn “how to weight these N outputs”

from sklearn.ensemble import StackingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC

stack = StackingClassifier(
    estimators=[
        ('rf', RandomForestClassifier()),
        ('xgb', xgb.XGBClassifier()),
        ('svc', SVC(probability=True))
    ],
    final_estimator=LogisticRegression()
)
stack.fit(X_train, y_train)

Top Kaggle competitors use this a lot—but production deployment is hard (many models, slow inference).

Industry Reality

70%+ of Kaggle winning solutions are based on GBDT family (XGBoost/LightGBM/CatBoost)—especially for tabular data.

Neural networks on structured tabular data usually can’t beat GBDT—unless data is huge (millions+) and features complex.

This is why ML engineer interviews always ask about XGBoost—it’s the real “production weapon.”

💡 An industrial story

A 2023 recommendation system at a major company went through:

Gen 1: handcrafted features + logistic regression
Gen 2: handcrafted features + GBDT → significant lift
Gen 3: handcrafted features + deep learning → small lift, ops cost exploded
Final choice: handcrafted features + GBDT as main + deep learning for local boosts

Conclusion: in many businesses, GBDT is still the more economic choice. Deep learning isn’t a panacea.

Next: “K-Means Clustering: The Classic Unsupervised Algorithm” — leaving supervised learning, entering the unlabeled world.