Decision Trees: The Most Interpretable ML Algorithm
A chain of yes/no questions forms a tree of decisions. It's the foundation for XGBoost and LightGBM—the champions of Kaggle.
Logistic regression is strong, but has one limitation: it can only learn linear decision boundaries.
If your data is inherently nonlinear—e.g., “high risk when age < 18 or > 60, low in middle”—logistic regression struggles.
Decision trees solve such problems beautifully—they make decisions via a chain of “if-else” branches. The most human-like ML algorithm.
An Example
Task: predict whether someone will buy a phone.
A human sales rep might think:
Budget > $700?
├── Yes: Age < 35?
│ ├── Yes → likely buyer (high-spending young person)
│ └── No: Has it been 2+ years since last upgrade?
│ ├── Yes → moderate chance
│ └── No → likely won't
└── No: Gender?
├── Male → check if gamer
├── Female → check if photography hobbyist
└── ...
This is a decision tree—a tree of decisions from root to leaves.
Tree Structure
| Element | Meaning |
|---|---|
| Root node | First decision |
| Internal nodes | Each intermediate decision |
| Leaf nodes | Final predictions |
| Branches | ”Yes/No” paths |
Each internal node asks a question about one feature: “is this feature > some threshold?”
How to “Train” a Tree
Intuition: at each node, find the feature + threshold that best “splits” the two classes.
Two common metrics for “how well does it split”:
1. Gini Impurity
is the proportion of class in that node.
- All same class (pure):
- Classes evenly mixed (most impure): max
2. Entropy
We covered this in L1-05. Higher entropy = less pure.
During training: at each node, iterate over all features and possible thresholds, find the split that minimizes child-node impurity.
This is a greedy algorithm—doesn’t guarantee global optimum, but works well in practice.
Run with sklearn
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
# Classic iris dataset
X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
# Train
tree = DecisionTreeClassifier(max_depth=3, random_state=42)
tree.fit(X_train, y_train)
# Evaluate
print(f"Accuracy: {tree.score(X_test, y_test):.3f}")
# Visualize the tree
plt.figure(figsize=(15, 8))
plot_tree(tree, filled=True, feature_names=['sepal_len','sepal_wid','petal_len','petal_wid'],
class_names=['Setosa', 'Versicolor', 'Virginica'])
plt.savefig('tree.png')
The output is a visualizable tree—you can directly read “how the model makes decisions”—an interpretability no other algorithm offers.
A Few Important Hyperparameters
| Parameter | Role |
|---|---|
max_depth | Max tree depth (most important to prevent overfitting) |
min_samples_split | Minimum samples a node needs to be split |
min_samples_leaf | Minimum samples in a leaf |
criterion | ’gini’ or ‘entropy’ |
Rule of thumb: try max_depth=5 first; lower to 3 if overfitting, raise to 10 if underfitting.
Decision Trees’ Killer Feature: Interpretability
tree.feature_importances_ gives you each feature’s importance (how much info it contributed at split time):
import pandas as pd
feature_names = ['sepal_len','sepal_wid','petal_len','petal_wid']
importances = pd.Series(tree.feature_importances_, index=feature_names)
print(importances.sort_values(ascending=False))
# petal_len 0.91
# petal_wid 0.07
# sepal_len 0.02
# sepal_wid 0.00
The model tells you: petal length determines almost everything.
This interpretability is required in scenarios like:
- Bank credit models (regulators demand “why was the loan denied”)
- Medical diagnosis assistance (doctors must see reasoning)
- Judicial sentencing aids (must be legally explainable)
Real fact: EU GDPR Article 22 explicitly bans “fully automated decisions that have significant impact on individuals and can’t be explained.” This single law guarantees demand for decision trees (and derivatives) in finance/medicine forever.
What Trees Handle Natively
✅ Natural strengths:
- Mixed numeric and categorical features (no one-hot encoding needed)
- Different feature scales (no standardization needed)
- Nonlinear relationships
- Missing values (sklearn implementation supports them)
- Multi-class classification (no one-vs-rest needed)
This makes decision trees especially easy to preprocess—just fit.
Decision Trees’ Fatal Weaknesses
❌ Overfitting: A deep single tree will “memorize”—modeling every noise in training data.
❌ Instability: Slightly change training data, tree structure can be completely different.
❌ Hard to learn diagonal decision boundaries: It can only make “vertical” or “horizontal” cuts; diagonals require many steps, ugly.
❌ Limited single-tree performance: On standard benchmarks, a single tree often loses to logistic regression, SVM.
But… Ensembles Are Invincible
The next story is great—
A single tree is weak, but letting many trees vote is overwhelmingly strong:
| Algorithm | Idea |
|---|---|
| Random Forest | Train N independent trees, vote |
| GBDT / XGBoost / LightGBM | Tree 1 learns part; tree 2 learns tree 1’s mistakes; tree 3 learns tree 1+2’s mistakes; sum |
These two paradigms (Bagging and Boosting) are explained in detail in L2-05. XGBoost / LightGBM have dominated Kaggle for the last decade—often beating even deep learning.
In many business cases—XGBoost still beats neural networks. Especially structured tabular data (Excel-style):
- User features + behavior
- Financial data
- Medical features
- Ranking in recommender systems
If your data is tabular + thousands to millions of samples, try XGBoost first, neural networks second.
A Complete Example
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import classification_report
# Real data: breast cancer classification
X, y = load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Train
tree = DecisionTreeClassifier(max_depth=4, random_state=42)
tree.fit(X_train, y_train)
# Cross-validation (more stable evaluation)
cv_scores = cross_val_score(tree, X_train, y_train, cv=5)
print(f"CV accuracy: {cv_scores.mean():.3f} ± {cv_scores.std():.3f}")
# Test set report
y_pred = tree.predict(X_test)
print(classification_report(y_test, y_pred))
Typically reaches ~93% accuracy—single tree, 5 lines of code.
Next article: we push it to 97% with Random Forest.
Next: “Random Forest + Boosting: Turn Weak Learners Into Supermen”