Python Confusion Matrix Explained with Examples

A confusion matrix is a table that summarizes the prediction results of a classification model by comparing actual labels against predicted labels. It is the starting point for computing precision, recall, F1-score, and most other classification metrics.

This chapter covers:

What the four cells (TP, FP, TN, FN) mean and why each matters
How to compute common metrics from those cells
Building and visualizing a confusion matrix in Python with scikit-learn and seaborn
Multi-class confusion matrices
Common pitfalls — particularly on imbalanced datasets

What Is a Confusion Matrix?

For a binary classifier (two possible classes: positive and negative), the confusion matrix is a 2×2 table:

	Predicted Positive	Predicted Negative
Actual Positive	True Positive (TP)	False Negative (FN)
Actual Negative	False Positive (FP)	True Negative (TN)

Each cell counts a specific type of outcome:

Term	Abbreviation	Meaning
True Positive	TP	Model says positive; reality is positive ✓
True Negative	TN	Model says negative; reality is negative ✓
False Positive	FP	Model says positive; reality is negative ✗ (Type I error)
False Negative	FN	Model says negative; reality is positive ✗ (Type II error)

A quick memory aid: the first word (True / False) tells you whether the prediction was correct; the second word (Positive / Negative) tells you what the model predicted.

Metrics Derived from the Matrix

The four counts feed into every standard classification metric:

Metric	Formula	What it measures
Accuracy	`(TP + TN) / (TP + TN + FP + FN)`	Overall fraction of correct predictions
Precision	`TP / (TP + FP)`	Of all positive predictions, how many were right
Recall (Sensitivity)	`TP / (TP + FN)`	Of all actual positives, how many were found
Specificity	`TN / (TN + FP)`	Of all actual negatives, how many were correctly ruled out
F1-Score	`2 × Precision × Recall / (Precision + Recall)`	Harmonic mean of precision and recall

When to prioritize precision vs. recall

Prioritize recall when missing a positive case is costly — medical screening, fraud detection, spam filters that must catch every spam message.
Prioritize precision when false alarms are costly — surgery recommendations, legal document flagging, push notification systems.
F1-score balances both and is the default metric when the dataset is imbalanced.

Worked Numerical Example

Suppose a model screens 100 patients for a disease:

TP = 50 (sick patients correctly identified)
FP = 5 (healthy patients incorrectly flagged as sick)
FN = 10 (sick patients missed)
TN = 35 (healthy patients correctly cleared)

Step-by-step calculations:

Accuracy  = (50 + 35) / 100 = 0.85  (85 %)
Precision = 50 / (50 + 5)  ≈ 0.909 (90.9 %)
Recall    = 50 / (50 + 10) ≈ 0.833 (83.3 %)
F1-Score  = 2 × 0.909 × 0.833 / (0.909 + 0.833) ≈ 0.869 (86.9 %)

Notice that accuracy (85 %) looks decent, but recall is only 83 % — meaning 10 out of 60 sick patients were missed. In a medical context that gap matters far more than the accuracy headline.

Building a Confusion Matrix in Python

Using scikit-learn

sklearn.metrics provides confusion_matrix() and a ready-made text report via classification_report().

from sklearn.metrics import confusion_matrix, classification_report

# Ground-truth labels and model predictions
y_true = [1, 1, 0, 0, 1, 0, 1, 0, 1, 0]
y_pred = [1, 0, 0, 1, 1, 0, 1, 0, 1, 1]

cm = confusion_matrix(y_true, y_pred)
print("Confusion Matrix:")
print(cm)
print()
print(classification_report(y_true, y_pred, target_names=["Negative", "Positive"]))

confusion_matrix() returns a NumPy array. By default, rows are actual classes and columns are predicted classes (matching the table layout shown above). The array layout is:

[[TN  FP]
 [FN  TP]]

Computing metrics manually

You can extract the four cells and compute metrics yourself to verify:

from sklearn.metrics import confusion_matrix

y_true = [1, 1, 0, 0, 1, 0, 1, 0, 1, 0]
y_pred = [1, 0, 0, 1, 1, 0, 1, 0, 1, 1]

tn, fp, fn, tp = confusion_matrix(y_true, y_pred).ravel()

accuracy  = (tp + tn) / (tp + tn + fp + fn)
precision = tp / (tp + fp)
recall    = tp / (tp + fn)
f1        = 2 * precision * recall / (precision + recall)

print(f"TP={tp}, FP={fp}, FN={fn}, TN={tn}")

print(f"Accuracy : {accuracy:.3f}")
print(f"Precision: {precision:.3f}")
print(f"Recall   : {recall:.3f}")
print(f"F1-Score : {f1:.3f}")

Expected output:

TP=4, FP=2, FN=1, TN=3
Accuracy : 0.700
Precision: 0.667
Recall   : 0.800
F1-Score : 0.727

Visualizing with seaborn

A heatmap makes the confusion matrix easier to read, especially for multi-class problems:

import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import confusion_matrix

y_true = [1, 1, 0, 0, 1, 0, 1, 0, 1, 0]
y_pred = [1, 0, 0, 1, 1, 0, 1, 0, 1, 1]

cm = confusion_matrix(y_true, y_pred)

sns.heatmap(
    cm,
    annot=True,
    fmt="d",
    cmap="Blues",
    xticklabels=["Negative", "Positive"],
    yticklabels=["Negative", "Positive"],
)
plt.xlabel("Predicted label")
plt.ylabel("True label")
plt.title("Confusion Matrix")
plt.tight_layout()
plt.savefig("confusion_matrix.png", dpi=150)
plt.show()

annot=True prints the count inside each cell; fmt="d" formats them as integers.

Multi-Class Confusion Matrices

When there are more than two classes, the confusion matrix expands to an N×N grid. Each row still represents actual classes; each column represents predicted classes. The diagonal cells are correct predictions; off-diagonal cells are errors.

from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
import matplotlib.pyplot as plt

# Three classes: cat, dog, rabbit
y_true = ["cat", "dog", "rabbit", "cat", "dog", "rabbit",
          "cat", "dog", "cat", "rabbit"]
y_pred = ["cat", "dog", "rabbit", "dog", "dog", "cat",
          "cat", "rabbit", "cat", "rabbit"]

labels = ["cat", "dog", "rabbit"]
cm = confusion_matrix(y_true, y_pred, labels=labels)
print(cm)

disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=labels)
disp.plot(cmap="Blues")
plt.title("Multi-Class Confusion Matrix")
plt.tight_layout()
plt.savefig("cm_multiclass.png", dpi=150)
plt.show()

ConfusionMatrixDisplay (added in scikit-learn 0.24) is a convenient one-liner alternative to the seaborn heatmap and requires no seaborn dependency.

For multi-class problems, precision and recall are computed per class and then averaged. classification_report() gives you three averaging strategies:

macro — unweighted average across classes (treats all classes equally).
weighted — average weighted by support (number of true instances per class).
micro — aggregate TP/FP/FN across all classes before dividing (gives overall accuracy for balanced datasets).

Common Pitfalls

Accuracy is misleading on imbalanced datasets

Consider a dataset where 95 % of samples are negative. A model that always predicts negative achieves 95 % accuracy but has zero recall — it never catches a positive case. The confusion matrix immediately reveals this: the entire first row (Actual Positive) will be all FN.

Always pair accuracy with precision, recall, or F1-score on imbalanced data. See the Train/Test Split chapter for how to create a representative split, and the AUC-ROC Curve chapter for a threshold-independent evaluation metric.

Choosing the wrong averaging strategy

Using macro averaging when classes are highly imbalanced inflates the score of rare classes. Use weighted for a realistic picture of overall model quality on the full dataset.

Forgetting normalization

Raw counts depend on dataset size. When comparing models trained on datasets of different sizes, normalize the matrix by dividing each row by its sum (pass normalize='true' to confusion_matrix()):

cm_normalized = confusion_matrix(y_true, y_pred, normalize="true")
print(cm_normalized.round(2))

Each row now sums to 1.0, showing the fraction of each actual class predicted correctly.

Confusion Matrix vs. Other Evaluation Tools

Tool	Best for
Confusion matrix	Understanding the specific types of errors a model makes
AUC-ROC Curve	Comparing classifiers across all decision thresholds
Cross-Validation	Estimating how well the matrix generalizes to unseen data
Grid Search	Tuning hyperparameters using a chosen metric (e.g., F1-score)

Key Takeaways

A confusion matrix breaks predictions into TP, FP, TN, and FN — four counts that reveal which errors a model makes, not just how many.
Accuracy alone is insufficient; always check precision, recall, and F1-score, especially on imbalanced data.
Use sklearn.metrics.confusion_matrix() for computation and seaborn or ConfusionMatrixDisplay for visualization.
Multi-class matrices follow the same row = actual, column = predicted convention and scale to N×N.
Match the averaging strategy (macro, weighted, micro) to your dataset's class distribution.