HelloAI
L2 Chapter 5 🐣 🕒 13 min

Random Forest + Boosting: From Weak Learners to Superhuman

A single tree is mediocre. A crowd of trees voting is supernatural. The origin story of XGBoost—the Kaggle champion for a decade.

H
HelloAI Editors
6/28/2026

At L2-04’s end I said “single trees are weak, ensembles are unbeatable”—this article unpacks that.

The core idea of ensemble learning:

Many “mediocre” weak models combined can far exceed a single strong model.

This isn’t unique to ML—human society works this way. Juries (multiple independent judgments) beat a single judge. Expert panels (consensus) beat single experts. The algorithm world has pushed this principle to extremes.

Two Schools

Ensemble learning has two camps with completely different “combination methods”:

SchoolChineseCore IdeaRepresentative
Bagging装袋Train N independent models, voteRandom Forest
Boosting提升Train N sequential models, each correcting the previousXGBoost, LightGBM

Both use decision trees as “base models”—but combine fundamentally differently.

I. Bagging and Random Forest

Bagging Principle

Bagging = Bootstrap Aggregating.

  1. Sample training data with replacement to construct N “slightly different” subsets
  2. Train one tree independently on each subset
  3. At prediction time, let N trees vote (classification) or average (regression)

Key: each tree trains on “slightly different” data—so their errors aren’t correlated—errors cancel out when voting.

Random Forest

Random Forest = Bagging + extra randomness:

When training each tree, each node split only considers a random subset of features (not all).

This trick makes each tree more “independent,” improving vote quality.

With sklearn

from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split

X, y = load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 100-tree random forest (defaults are good)
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)

print(f"Accuracy: {rf.score(X_test, y_test):.3f}")  # typically 96-97%

L2-04’s single decision tree got 93%—change one line, jump to 97%. The power of ensembles.

Key Hyperparameters

ParameterRecommended start
n_estimators100-500 (more trees = more stable, but slower)
max_depthNone (let trees grow) or 10
max_features’sqrt’ (use √N features per split)
min_samples_leaf1-5

Random Forest’s Advantages

✅ Works well almost without tuning ✅ Hard to overfit (more trees = stabler votes) ✅ Provides feature importance (inherits decision tree’s strength) ✅ Trees train in parallel

❌ Slow (100 trees vs 1) ❌ Big model (stores 100 trees)

II. Boosting and XGBoost

Boosting Principle

Completely opposite approach. Sequential training:

  1. Train tree 1
  2. See where it failed—boost weight of “wrong” samples
  3. Train tree 2 (with adjusted weights)—it focuses on tree 1’s mistakes
  4. …repeat N times
  5. Final prediction = weighted sum of N trees

Key intuition: each new tree “fills in” the mistakes left by all previous trees.

Gradient Boosting

An elegant version—express “correcting mistakes” as gradients:

  1. Difference between current prediction and truth = residual
  2. Train a new tree to fit the residual
  3. Add to current prediction

Sounds strange? Think of L1-03 gradient descent—Gradient Boosting does gradient descent in function space. Each added tree is a step along the loss’s “negative gradient direction.”

XGBoost / LightGBM / CatBoost

XGBoost is Gradient Boosting’s “engineering miracle”—2014 Tianqi Chen open-sourced it and it swept Kaggle.

Later Microsoft released LightGBM (faster), Yandex released CatBoost (good with categorical features).

Today’s industry trio: XGBoost, LightGBM, CatBoost—their algorithms are based on Gradient Boosting, differing in engineering and details.

Using XGBoost

# Need: pip install xgboost
import xgboost as xgb
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split

X, y = load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

model = xgb.XGBClassifier(
    n_estimators=200,
    learning_rate=0.05,
    max_depth=4,
    random_state=42
)
model.fit(X_train, y_train,
          eval_set=[(X_test, y_test)],
          verbose=False)

print(f"Accuracy: {model.score(X_test, y_test):.3f}")  # typically 97-98%

A few lines of config gets deep-learning-level results.

Key Hyperparameters

ParameterRoleTuning direction
n_estimatorsNumber of trees100-1000
learning_rateEach tree’s contribution0.01-0.1 (smaller needs more trees)
max_depthSingle tree depth3-8 (not as deep as Random Forest)
subsample% samples per tree0.6-1.0
colsample_bytree% features per tree0.5-1.0
reg_lambdaL2 regularization0-1

XGBoost tuning is an art, but defaults usually work—don’t start by tuning everything.

Bagging vs Boosting: How to Choose

DimensionRandom ForestXGBoost
Training speedSlow (many independent trees)Slow (sequential, no parallel)
Tuning sensitivityLowHigh
Overfit riskLowMedium (needs early stopping)
Performance ceilingMedium-highHighest
Recommended forNew project baselineWhen tuned well, wins Kaggle

My rules of thumb:

  • Tight on time → Random Forest (one line, no loss)
  • Maximizing performance → XGBoost (worth tuning)
  • Huge data → LightGBM (fastest)

Why Ensembles Are So Powerful

Mathematical intuition:

If N models’ errors are uncorrelated (independent), then the error of the vote ≈ single model’s error rate to the Nth power.

Assume single model is 70% accurate. 11 independent models voting, majority-correct probability ≈ 91%! 21 → 97%!

This is why ensemble models almost always win Kaggle.

But there’s a prerequisite: models must be independent and better than random. If they all err the same way, voting can’t save them.

Reality: models aren’t fully independent—they use the same data, similar algorithms. So actual gains aren’t as dramatic as theory. But still significantly better than single model.

Stacking: The Ultimate Combination

Even more aggressive: use a model to learn “how to combine other models”.

  1. Train N different base models (decision tree, neural net, linear regression, SVM…)
  2. Use a “meta-learner” to learn “how to weight these N outputs”
from sklearn.ensemble import StackingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC

stack = StackingClassifier(
    estimators=[
        ('rf', RandomForestClassifier()),
        ('xgb', xgb.XGBClassifier()),
        ('svc', SVC(probability=True))
    ],
    final_estimator=LogisticRegression()
)
stack.fit(X_train, y_train)

Top Kaggle competitors use this a lot—but production deployment is hard (many models, slow inference).

Industry Reality

70%+ of Kaggle winning solutions are based on GBDT family (XGBoost/LightGBM/CatBoost)—especially for tabular data.

Neural networks on structured tabular data usually can’t beat GBDT—unless data is huge (millions+) and features complex.

This is why ML engineer interviews always ask about XGBoost—it’s the real “production weapon.”

💡 An industrial story

A 2023 recommendation system at a major company went through:

  • Gen 1: handcrafted features + logistic regression
  • Gen 2: handcrafted features + GBDT → significant lift
  • Gen 3: handcrafted features + deep learning → small lift, ops cost exploded
  • Final choice: handcrafted features + GBDT as main + deep learning for local boosts

Conclusion: in many businesses, GBDT is still the more economic choice. Deep learning isn’t a panacea.

Next: “K-Means Clustering: The Classic Unsupervised Algorithm” — leaving supervised learning, entering the unlabeled world.