Decision Trees in Python with scikit-learn

A decision tree is a supervised machine learning algorithm that makes predictions by learning a hierarchy of if-then-else rules from training data. Each internal node tests a feature, each branch represents an outcome of that test, and each leaf node holds a prediction (a class label for classification, or a numeric value for regression).

This chapter covers:

How decision trees split data using impurity measures (Gini and entropy)
Building a classification tree and a regression tree in Python with scikit-learn
Controlling tree depth and preventing overfitting with hyperparameters
Visualizing and inspecting a trained tree
Advantages, limitations, and when to use decision trees

How a Decision Tree Splits Data

When training, the algorithm searches every feature and every possible threshold to find the split that most reduces impurity — a measure of how mixed the classes are in a node.

Two impurity measures are common in scikit-learn:

Gini Impurity

Gini impurity measures the probability of incorrectly classifying a randomly chosen sample if it were labeled according to the class distribution in the node.

Gini(node) = 1 - Σ pᵢ²

A pure node (all samples belong to one class) has Gini = 0. A maximally mixed node has Gini approaching 0.5 for binary classification.

Entropy and Information Gain

Entropy comes from information theory. It is maximized when classes are equally distributed and zero when the node is pure.

Entropy(node) = -Σ pᵢ log₂(pᵢ)

Information gain is the drop in entropy after a split. The algorithm picks the split that yields the largest information gain. In scikit-learn, you choose between the two via the criterion parameter ("gini" is the default).

Recursive Splitting

Splitting repeats recursively on each child node until a stopping condition is met: the node is pure, no feature improves impurity, or a depth/size limit is reached. This produces the binary tree structure.

Classification Tree in Python

The Iris dataset has 150 samples and 4 numeric features. The goal is to predict one of three flower species.

from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report

# Load dataset
data = load_iris()
X, y = data.data, data.target

# Split: 80 % train, 20 % test
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Train — limit depth to 3 to keep the tree readable
clf = DecisionTreeClassifier(criterion="gini", max_depth=3, random_state=42)
clf.fit(X_train, y_train)

# Evaluate
y_pred = clf.predict(X_test)
print(f"Accuracy: {accuracy_score(y_test, y_pred):.2f}")
print(classification_report(y_test, y_pred, target_names=data.target_names))

Expected output:

Accuracy: 1.00
              precision    recall  f1-score   support

      setosa       1.00      1.00      1.00        10
  versicolor       1.00      1.00      1.00         9
   virginica       1.00      1.00      1.00        11

    accuracy                           1.00        30
   macro avg       1.00      1.00      1.00        30
weighted avg       1.00      1.00      1.00        30

The Iris dataset is linearly separable with depth 3, so the tree achieves perfect test accuracy. Real-world datasets will be messier.

Predicting New Samples

After training, call predict() to classify new observations and predict_proba() to get class probabilities:

import numpy as np

# A new flower: sepal length 5.1, sepal width 3.5, petal length 1.4, petal width 0.2
new_sample = np.array([[5.1, 3.5, 1.4, 0.2]])

predicted_class = clf.predict(new_sample)
predicted_proba = clf.predict_proba(new_sample)

print("Predicted class:", data.target_names[predicted_class[0]])
print("Class probabilities:", predicted_proba)

Expected output:

Predicted class: setosa
Class probabilities: [[1. 0. 0.]]

Regression Tree in Python

Decision trees also handle continuous targets. Use DecisionTreeRegressor instead of DecisionTreeClassifier.

from sklearn.tree import DecisionTreeRegressor
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
import numpy as np

# Synthetic regression dataset
X_reg, y_reg = make_regression(
    n_samples=300, n_features=5, noise=20, random_state=42
)

X_train_r, X_test_r, y_train_r, y_test_r = train_test_split(
    X_reg, y_reg, test_size=0.2, random_state=42
)

reg = DecisionTreeRegressor(max_depth=5, random_state=42)
reg.fit(X_train_r, y_train_r)

y_pred_r = reg.predict(X_test_r)

mse = mean_squared_error(y_test_r, y_pred_r)
r2 = r2_score(y_test_r, y_pred_r)
print(f"MSE : {mse:.2f}")
print(f"R²  : {r2:.2f}")

A regression tree splits by minimizing the mean squared error (MSE) within each node and predicts the mean target value of all training samples that reach a leaf.

Tuning Hyperparameters

Without limits, a decision tree will grow until every leaf is pure, perfectly memorizing the training set (overfitting). Hyperparameters control tree complexity:

Parameter	Default	Effect
`max_depth`	`None`	Maximum number of levels. Lower = simpler tree.
`min_samples_split`	`2`	Minimum samples required to split a node. Higher = fewer splits.
`min_samples_leaf`	`1`	Minimum samples required in a leaf. Higher = smoother boundaries.
`max_features`	`None`	Number of features to consider at each split (useful for feature selection).
`criterion`	`"gini"`	Impurity measure: `"gini"` or `"entropy"` for classifiers; `"squared_error"` for regressors.

Use cross-validation and grid search to find the best combination:

from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import GridSearchCV

data = load_iris()
X, y = data.data, data.target

param_grid = {
    "max_depth": [2, 3, 4, 5, None],
    "min_samples_split": [2, 5, 10],
    "criterion": ["gini", "entropy"],
}

grid_search = GridSearchCV(
    DecisionTreeClassifier(random_state=42),
    param_grid,
    cv=5,
    scoring="accuracy",
)
grid_search.fit(X, y)

print("Best params :", grid_search.best_params_)
print(f"Best CV score: {grid_search.best_score_:.3f}")

Expected output (values may vary slightly across scikit-learn versions):

Best params : {'criterion': 'gini', 'max_depth': 3, 'min_samples_split': 2}
Best CV score: 0.973

Handling Categorical Features

scikit-learn decision trees require numeric input. Encode categorical columns before training:

Ordinal categories (e.g., size: small < medium < large): use OrdinalEncoder.
Nominal categories (e.g., color: red, green, blue): use OneHotEncoder to avoid implying an order.

from sklearn.preprocessing import OrdinalEncoder
import numpy as np

# Encode only the categorical column; keep the numeric column as-is
sizes = np.array([["small"], ["large"], ["medium"], ["large"]])
weights = np.array([1.2, 3.4, 2.1, 4.0])

# Explicit category order: large=0, medium=1, small=2
enc = OrdinalEncoder(categories=[["large", "medium", "small"]])
sizes_encoded = enc.fit_transform(sizes)

X_encoded = np.column_stack([sizes_encoded, weights])
print(X_encoded)

Expected output:

[[2.  1.2]
 [0.  3.4]
 [1.  2.1]
 [0.  4. ]]

See the Categorical Data chapter for a full walkthrough.

Visualizing a Decision Tree

Inspecting the tree structure reveals which features drive the most splits and makes the model auditable.

Text Representation

from sklearn.tree import DecisionTreeClassifier, export_text
from sklearn.datasets import load_iris

data = load_iris()
clf = DecisionTreeClassifier(max_depth=2, random_state=42)
clf.fit(data.data, data.target)

print(export_text(clf, feature_names=list(data.feature_names)))

Expected output:

|--- petal length (cm) <= 2.45
|   |--- class: 0
|--- petal length (cm) >  2.45
|   |--- petal width (cm) <= 1.75
|   |   |--- class: 1
|   |--- petal width (cm) >  1.75
|   |   |--- class: 2

Graphical Plot

import matplotlib.pyplot as plt
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.datasets import load_iris

data = load_iris()
clf = DecisionTreeClassifier(max_depth=2, random_state=42)
clf.fit(data.data, data.target)

plt.figure(figsize=(10, 5))
plot_tree(
    clf,
    feature_names=data.feature_names,
    class_names=data.target_names,
    filled=True,
    rounded=True,
)
plt.title("Iris Decision Tree (max_depth=2)")
plt.tight_layout()
plt.savefig("iris_tree.png", dpi=150)
plt.show()

filled=True colors each node by its majority class; darker shades mean higher class purity.

Feature Importance

After training, feature_importances_ gives each feature a score between 0 and 1, where higher means the feature contributed more to reducing impurity across all splits:

from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import load_iris
import numpy as np

data = load_iris()
clf = DecisionTreeClassifier(max_depth=3, random_state=42)
clf.fit(data.data, data.target)

importances = clf.feature_importances_
for name, imp in sorted(
    zip(data.feature_names, importances), key=lambda x: x[1], reverse=True
):
    print(f"{name:30s}: {imp:.4f}")

Expected output:

petal length (cm)             : 0.5856
petal width (cm)              : 0.4144
sepal length (cm)             : 0.0000
sepal width (cm)              : 0.0000

Features with an importance of 0 were never used by any split and could be dropped to simplify the model.

Advantages and Limitations

When to use decision trees

You need an interpretable model — the rules can be printed as plain English.
Your dataset contains a mix of numeric and categorical features (after encoding).
You want a quick baseline before trying ensemble methods.
The relationship between features and target is non-linear or involves interactions.

Limitations

Limitation	Mitigation
Overfits easily without tuning	Constrain `max_depth`, `min_samples_leaf`; use cross-validation
High variance (small data changes → different tree)	Use ensemble methods: Random Forest / Bootstrap Aggregation
Biased toward features with more unique values	Use `max_features` or normalize split criteria
Poor at extrapolating beyond training data range	Prefer linear models for extrapolation tasks
Axis-aligned splits only	Oblique trees exist but are not in scikit-learn

Algorithm	Key difference
Logistic Regression	Linear boundary; better for linearly separable data; does not handle interactions automatically
K-Nearest Neighbors	Instance-based; no explicit model; requires feature scaling
Decision Tree	Non-linear; no scaling needed; highly interpretable
Random Forest (see Bootstrap Aggregation)	Ensemble of many trees; much lower variance; less interpretable

Key Takeaways

Decision trees split data by maximizing information gain (or minimizing Gini impurity) at each node; the process repeats recursively.
DecisionTreeClassifier and DecisionTreeRegressor in scikit-learn share the same API and hyperparameter names.
Always set max_depth or min_samples_leaf to prevent overfitting; tune them with grid search and cross-validation.
feature_importances_ reveals which features the tree relies on most — useful for feature selection.
Single trees are a good interpretable baseline, but ensemble methods like Random Forest almost always outperform them on real-world data.

How a Decision Tree Splits Data

Gini Impurity

Entropy and Information Gain

Recursive Splitting

Classification Tree in Python

Predicting New Samples

Regression Tree in Python

Tuning Hyperparameters

Handling Categorical Features

Visualizing a Decision Tree

Text Representation

Graphical Plot

Feature Importance

Advantages and Limitations

When to use decision trees

Limitations

Decision Trees vs. Related Algorithms

Key Takeaways