Random Forest + Boosting: From Weak Learners to Superhuman
A single tree is mediocre. A crowd of trees voting is supernatural. The origin story of XGBoost—the Kaggle champion for a decade.
At L2-04’s end I said “single trees are weak, ensembles are unbeatable”—this article unpacks that.
The core idea of ensemble learning:
Many “mediocre” weak models combined can far exceed a single strong model.
This isn’t unique to ML—human society works this way. Juries (multiple independent judgments) beat a single judge. Expert panels (consensus) beat single experts. The algorithm world has pushed this principle to extremes.
Two Schools
Ensemble learning has two camps with completely different “combination methods”:
| School | Chinese | Core Idea | Representative |
|---|---|---|---|
| Bagging | 装袋 | Train N independent models, vote | Random Forest |
| Boosting | 提升 | Train N sequential models, each correcting the previous | XGBoost, LightGBM |
Both use decision trees as “base models”—but combine fundamentally differently.
I. Bagging and Random Forest
Bagging Principle
Bagging = Bootstrap Aggregating.
- Sample training data with replacement to construct N “slightly different” subsets
- Train one tree independently on each subset
- At prediction time, let N trees vote (classification) or average (regression)
Key: each tree trains on “slightly different” data—so their errors aren’t correlated—errors cancel out when voting.
Random Forest
Random Forest = Bagging + extra randomness:
When training each tree, each node split only considers a random subset of features (not all).
This trick makes each tree more “independent,” improving vote quality.
With sklearn
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
X, y = load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# 100-tree random forest (defaults are good)
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)
print(f"Accuracy: {rf.score(X_test, y_test):.3f}") # typically 96-97%
L2-04’s single decision tree got 93%—change one line, jump to 97%. The power of ensembles.
Key Hyperparameters
| Parameter | Recommended start |
|---|---|
n_estimators | 100-500 (more trees = more stable, but slower) |
max_depth | None (let trees grow) or 10 |
max_features | ’sqrt’ (use √N features per split) |
min_samples_leaf | 1-5 |
Random Forest’s Advantages
✅ Works well almost without tuning ✅ Hard to overfit (more trees = stabler votes) ✅ Provides feature importance (inherits decision tree’s strength) ✅ Trees train in parallel
❌ Slow (100 trees vs 1) ❌ Big model (stores 100 trees)
II. Boosting and XGBoost
Boosting Principle
Completely opposite approach. Sequential training:
- Train tree 1
- See where it failed—boost weight of “wrong” samples
- Train tree 2 (with adjusted weights)—it focuses on tree 1’s mistakes
- …repeat N times
- Final prediction = weighted sum of N trees
Key intuition: each new tree “fills in” the mistakes left by all previous trees.
Gradient Boosting
An elegant version—express “correcting mistakes” as gradients:
- Difference between current prediction and truth = residual
- Train a new tree to fit the residual
- Add to current prediction
Sounds strange? Think of L1-03 gradient descent—Gradient Boosting does gradient descent in function space. Each added tree is a step along the loss’s “negative gradient direction.”
XGBoost / LightGBM / CatBoost
XGBoost is Gradient Boosting’s “engineering miracle”—2014 Tianqi Chen open-sourced it and it swept Kaggle.
Later Microsoft released LightGBM (faster), Yandex released CatBoost (good with categorical features).
Today’s industry trio: XGBoost, LightGBM, CatBoost—their algorithms are based on Gradient Boosting, differing in engineering and details.
Using XGBoost
# Need: pip install xgboost
import xgboost as xgb
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
X, y = load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
model = xgb.XGBClassifier(
n_estimators=200,
learning_rate=0.05,
max_depth=4,
random_state=42
)
model.fit(X_train, y_train,
eval_set=[(X_test, y_test)],
verbose=False)
print(f"Accuracy: {model.score(X_test, y_test):.3f}") # typically 97-98%
A few lines of config gets deep-learning-level results.
Key Hyperparameters
| Parameter | Role | Tuning direction |
|---|---|---|
n_estimators | Number of trees | 100-1000 |
learning_rate | Each tree’s contribution | 0.01-0.1 (smaller needs more trees) |
max_depth | Single tree depth | 3-8 (not as deep as Random Forest) |
subsample | % samples per tree | 0.6-1.0 |
colsample_bytree | % features per tree | 0.5-1.0 |
reg_lambda | L2 regularization | 0-1 |
XGBoost tuning is an art, but defaults usually work—don’t start by tuning everything.
Bagging vs Boosting: How to Choose
| Dimension | Random Forest | XGBoost |
|---|---|---|
| Training speed | Slow (many independent trees) | Slow (sequential, no parallel) |
| Tuning sensitivity | Low | High |
| Overfit risk | Low | Medium (needs early stopping) |
| Performance ceiling | Medium-high | Highest |
| Recommended for | New project baseline | When tuned well, wins Kaggle |
My rules of thumb:
- Tight on time → Random Forest (one line, no loss)
- Maximizing performance → XGBoost (worth tuning)
- Huge data → LightGBM (fastest)
Why Ensembles Are So Powerful
Mathematical intuition:
If N models’ errors are uncorrelated (independent), then the error of the vote ≈ single model’s error rate to the Nth power.
Assume single model is 70% accurate. 11 independent models voting, majority-correct probability ≈ 91%! 21 → 97%!
This is why ensemble models almost always win Kaggle.
But there’s a prerequisite: models must be independent and better than random. If they all err the same way, voting can’t save them.
Reality: models aren’t fully independent—they use the same data, similar algorithms. So actual gains aren’t as dramatic as theory. But still significantly better than single model.
Stacking: The Ultimate Combination
Even more aggressive: use a model to learn “how to combine other models”.
- Train N different base models (decision tree, neural net, linear regression, SVM…)
- Use a “meta-learner” to learn “how to weight these N outputs”
from sklearn.ensemble import StackingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
stack = StackingClassifier(
estimators=[
('rf', RandomForestClassifier()),
('xgb', xgb.XGBClassifier()),
('svc', SVC(probability=True))
],
final_estimator=LogisticRegression()
)
stack.fit(X_train, y_train)
Top Kaggle competitors use this a lot—but production deployment is hard (many models, slow inference).
Industry Reality
70%+ of Kaggle winning solutions are based on GBDT family (XGBoost/LightGBM/CatBoost)—especially for tabular data.
Neural networks on structured tabular data usually can’t beat GBDT—unless data is huge (millions+) and features complex.
This is why ML engineer interviews always ask about XGBoost—it’s the real “production weapon.”
A 2023 recommendation system at a major company went through:
- Gen 1: handcrafted features + logistic regression
- Gen 2: handcrafted features + GBDT → significant lift
- Gen 3: handcrafted features + deep learning → small lift, ops cost exploded
- Final choice: handcrafted features + GBDT as main + deep learning for local boosts
Conclusion: in many businesses, GBDT is still the more economic choice. Deep learning isn’t a panacea.
Next: “K-Means Clustering: The Classic Unsupervised Algorithm” — leaving supervised learning, entering the unlabeled world.