L2 Chapter 4 🐣 🕒 10 min

Decision Trees: The Most Interpretable ML Algorithm

A chain of yes/no questions forms a tree of decisions. It's the foundation for XGBoost and LightGBM—the champions of Kaggle.

HelloAI Editors

6/27/2026

Logistic regression is strong, but has one limitation: it can only learn linear decision boundaries.

If your data is inherently nonlinear—e.g., “high risk when age < 18 or > 60, low in middle”—logistic regression struggles.

Decision trees solve such problems beautifully—they make decisions via a chain of “if-else” branches. The most human-like ML algorithm.

An Example

Task: predict whether someone will buy a phone.

A human sales rep might think:

Budget > $700?
├── Yes: Age < 35?
│       ├── Yes → likely buyer (high-spending young person)
│       └── No: Has it been 2+ years since last upgrade?
│               ├── Yes → moderate chance
│               └── No → likely won't
└── No: Gender?
        ├── Male → check if gamer
        ├── Female → check if photography hobbyist
        └── ...

This is a decision tree—a tree of decisions from root to leaves.

Tree Structure

Element	Meaning
Root node	First decision
Internal nodes	Each intermediate decision
Leaf nodes	Final predictions
Branches	”Yes/No” paths

Each internal node asks a question about one feature: “is this feature > some threshold?”

How to “Train” a Tree

Intuition: at each node, find the feature + threshold that best “splits” the two classes.

Two common metrics for “how well does it split”:

1. Gini Impurity

G = 1 - \sum_k p_k^2

$p_k$ is the proportion of class $k$ in that node.

All same class (pure): $G = 0$
Classes evenly mixed (most impure): $G$ max

2. Entropy

H = -\sum_k p_k \log p_k

We covered this in L1-05. Higher entropy = less pure.

During training: at each node, iterate over all features and possible thresholds, find the split that minimizes child-node impurity.

This is a greedy algorithm—doesn’t guarantee global optimum, but works well in practice.

Run with sklearn

from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt

# Classic iris dataset
X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Train
tree = DecisionTreeClassifier(max_depth=3, random_state=42)
tree.fit(X_train, y_train)

# Evaluate
print(f"Accuracy: {tree.score(X_test, y_test):.3f}")

# Visualize the tree
plt.figure(figsize=(15, 8))
plot_tree(tree, filled=True, feature_names=['sepal_len','sepal_wid','petal_len','petal_wid'],
          class_names=['Setosa', 'Versicolor', 'Virginica'])
plt.savefig('tree.png')

The output is a visualizable tree—you can directly read “how the model makes decisions”—an interpretability no other algorithm offers.

A Few Important Hyperparameters

Parameter	Role
`max_depth`	Max tree depth (most important to prevent overfitting)
`min_samples_split`	Minimum samples a node needs to be split
`min_samples_leaf`	Minimum samples in a leaf
`criterion`	’gini’ or ‘entropy’

Rule of thumb: try max_depth=5 first; lower to 3 if overfitting, raise to 10 if underfitting.

Decision Trees’ Killer Feature: Interpretability

tree.feature_importances_ gives you each feature’s importance (how much info it contributed at split time):

import pandas as pd

feature_names = ['sepal_len','sepal_wid','petal_len','petal_wid']
importances = pd.Series(tree.feature_importances_, index=feature_names)
print(importances.sort_values(ascending=False))

# petal_len     0.91
# petal_wid     0.07
# sepal_len     0.02
# sepal_wid     0.00

The model tells you: petal length determines almost everything.

This interpretability is required in scenarios like:

Bank credit models (regulators demand “why was the loan denied”)
Medical diagnosis assistance (doctors must see reasoning)
Judicial sentencing aids (must be legally explainable)

Real fact: EU GDPR Article 22 explicitly bans “fully automated decisions that have significant impact on individuals and can’t be explained.” This single law guarantees demand for decision trees (and derivatives) in finance/medicine forever.

What Trees Handle Natively

✅ Natural strengths:

Mixed numeric and categorical features (no one-hot encoding needed)
Different feature scales (no standardization needed)
Nonlinear relationships
Missing values (sklearn implementation supports them)
Multi-class classification (no one-vs-rest needed)

This makes decision trees especially easy to preprocess—just fit.

Decision Trees’ Fatal Weaknesses

❌ Overfitting: A deep single tree will “memorize”—modeling every noise in training data.

❌ Instability: Slightly change training data, tree structure can be completely different.

❌ Hard to learn diagonal decision boundaries: It can only make “vertical” or “horizontal” cuts; diagonals require many steps, ugly.

❌ Limited single-tree performance: On standard benchmarks, a single tree often loses to logistic regression, SVM.

But… Ensembles Are Invincible

The next story is great—

A single tree is weak, but letting many trees vote is overwhelmingly strong:

Algorithm	Idea
Random Forest	Train N independent trees, vote
GBDT / XGBoost / LightGBM	Tree 1 learns part; tree 2 learns tree 1’s mistakes; tree 3 learns tree 1+2’s mistakes; sum

These two paradigms (Bagging and Boosting) are explained in detail in L2-05. XGBoost / LightGBM have dominated Kaggle for the last decade—often beating even deep learning.

💡 An industrial truth

In many business cases—XGBoost still beats neural networks. Especially structured tabular data (Excel-style):

User features + behavior
Financial data
Medical features
Ranking in recommender systems

If your data is tabular + thousands to millions of samples, try XGBoost first, neural networks second.

A Complete Example

from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import classification_report

# Real data: breast cancer classification
X, y = load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train
tree = DecisionTreeClassifier(max_depth=4, random_state=42)
tree.fit(X_train, y_train)

# Cross-validation (more stable evaluation)
cv_scores = cross_val_score(tree, X_train, y_train, cv=5)
print(f"CV accuracy: {cv_scores.mean():.3f} ± {cv_scores.std():.3f}")

# Test set report
y_pred = tree.predict(X_test)
print(classification_report(y_test, y_pred))

Typically reaches ~93% accuracy—single tree, 5 lines of code.

Next article: we push it to 97% with Random Forest.

Next: “Random Forest + Boosting: Turn Weak Learners Into Supermen”