HelloAI
L2 Chapter 4 🐣 🕒 11 min

Decision Trees: The Most Interpretable ML Algorithm

A chain of yes/no questions forms a tree of decisions. It's the foundation for XGBoost and LightGBM—the champions of Kaggle.

H
HelloAI Editors
6/27/2026

Logistic regression is strong, but has one limitation: it can only learn linear decision boundaries.

If your data is inherently nonlinear—e.g., “high risk when age < 18 or > 60, low in middle”—logistic regression struggles.

Decision trees solve such problems beautifully—they make decisions via a chain of “if-else” branches. The most human-like ML algorithm.

An Example

Task: predict whether someone will buy a phone.

A human sales rep might think:

Budget > $700?
├── Yes: Age < 35?
│       ├── Yes → likely buyer (high-spending young person)
│       └── No: Has it been 2+ years since last upgrade?
│               ├── Yes → moderate chance
│               └── No → likely won't
└── No: Gender?
        ├── Male → check if gamer
        ├── Female → check if photography hobbyist
        └── ...

This is a decision tree—a tree of decisions from root to leaves.

Tree Structure

ElementMeaning
Root nodeFirst decision
Internal nodesEach intermediate decision
Leaf nodesFinal predictions
Branches”Yes/No” paths

Each internal node asks a question about one feature: “is this feature > some threshold?”

How to “Train” a Tree

Intuition: at each node, find the feature + threshold that best “splits” the two classes.

Two common metrics for “how well does it split”:

1. Gini Impurity

G=1kpk2G = 1 - \sum_k p_k^2

pkp_k is the proportion of class kk in that node.

  • All same class (pure): G=0G = 0
  • Classes evenly mixed (most impure): GG max

2. Entropy

H=kpklogpkH = -\sum_k p_k \log p_k

We covered this in L1-05. Higher entropy = less pure.

During training: at each node, iterate over all features and possible thresholds, find the split that minimizes child-node impurity.

This is a greedy algorithm—doesn’t guarantee global optimum, but works well in practice.

Run with sklearn

from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt

# Classic iris dataset
X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Train
tree = DecisionTreeClassifier(max_depth=3, random_state=42)
tree.fit(X_train, y_train)

# Evaluate
print(f"Accuracy: {tree.score(X_test, y_test):.3f}")

# Visualize the tree
plt.figure(figsize=(15, 8))
plot_tree(tree, filled=True, feature_names=['sepal_len','sepal_wid','petal_len','petal_wid'],
          class_names=['Setosa', 'Versicolor', 'Virginica'])
plt.savefig('tree.png')

The output is a visualizable tree—you can directly read “how the model makes decisions”—an interpretability no other algorithm offers.

A Few Important Hyperparameters

ParameterRole
max_depthMax tree depth (most important to prevent overfitting)
min_samples_splitMinimum samples a node needs to be split
min_samples_leafMinimum samples in a leaf
criterion’gini’ or ‘entropy’

Rule of thumb: try max_depth=5 first; lower to 3 if overfitting, raise to 10 if underfitting.

Decision Trees’ Killer Feature: Interpretability

tree.feature_importances_ gives you each feature’s importance (how much info it contributed at split time):

import pandas as pd

feature_names = ['sepal_len','sepal_wid','petal_len','petal_wid']
importances = pd.Series(tree.feature_importances_, index=feature_names)
print(importances.sort_values(ascending=False))

# petal_len     0.91
# petal_wid     0.07
# sepal_len     0.02
# sepal_wid     0.00

The model tells you: petal length determines almost everything.

This interpretability is required in scenarios like:

  • Bank credit models (regulators demand “why was the loan denied”)
  • Medical diagnosis assistance (doctors must see reasoning)
  • Judicial sentencing aids (must be legally explainable)

Real fact: EU GDPR Article 22 explicitly bans “fully automated decisions that have significant impact on individuals and can’t be explained.” This single law guarantees demand for decision trees (and derivatives) in finance/medicine forever.

What Trees Handle Natively

Natural strengths:

  • Mixed numeric and categorical features (no one-hot encoding needed)
  • Different feature scales (no standardization needed)
  • Nonlinear relationships
  • Missing values (sklearn implementation supports them)
  • Multi-class classification (no one-vs-rest needed)

This makes decision trees especially easy to preprocess—just fit.

Decision Trees’ Fatal Weaknesses

Overfitting: A deep single tree will “memorize”—modeling every noise in training data.

Instability: Slightly change training data, tree structure can be completely different.

Hard to learn diagonal decision boundaries: It can only make “vertical” or “horizontal” cuts; diagonals require many steps, ugly.

Limited single-tree performance: On standard benchmarks, a single tree often loses to logistic regression, SVM.

But… Ensembles Are Invincible

The next story is great—

A single tree is weak, but letting many trees vote is overwhelmingly strong:

AlgorithmIdea
Random ForestTrain N independent trees, vote
GBDT / XGBoost / LightGBMTree 1 learns part; tree 2 learns tree 1’s mistakes; tree 3 learns tree 1+2’s mistakes; sum

These two paradigms (Bagging and Boosting) are explained in detail in L2-05. XGBoost / LightGBM have dominated Kaggle for the last decade—often beating even deep learning.

💡 An industrial truth

In many business cases—XGBoost still beats neural networks. Especially structured tabular data (Excel-style):

  • User features + behavior
  • Financial data
  • Medical features
  • Ranking in recommender systems

If your data is tabular + thousands to millions of samples, try XGBoost first, neural networks second.

A Complete Example

from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import classification_report

# Real data: breast cancer classification
X, y = load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train
tree = DecisionTreeClassifier(max_depth=4, random_state=42)
tree.fit(X_train, y_train)

# Cross-validation (more stable evaluation)
cv_scores = cross_val_score(tree, X_train, y_train, cv=5)
print(f"CV accuracy: {cv_scores.mean():.3f} ± {cv_scores.std():.3f}")

# Test set report
y_pred = tree.predict(X_test)
print(classification_report(y_test, y_pred))

Typically reaches ~93% accuracy—single tree, 5 lines of code.

Next article: we push it to 97% with Random Forest.

Next: “Random Forest + Boosting: Turn Weak Learners Into Supermen”