Mean, Median, and Mode in Python — With Examples

Mean, median, and mode are the three fundamental measures of central tendency in statistics. They each describe the "center" of a dataset in a different way, and knowing which one to use — and when — is one of the first practical skills you need for machine learning data preparation.

This chapter covers:

What each measure means and how it is calculated
How to compute them in Python with numpy and the statistics module
When to prefer one measure over another
How to use them to fill in missing values (imputation)

What Are Mean, Median, and Mode?

All three measures summarize a dataset with a single representative value, but they capture different aspects of the distribution:

Measure	Definition	Best for
Mean	Sum of all values ÷ count	Symmetric, normally distributed data
Median	Middle value when sorted	Skewed data or data with outliers
Mode	Most frequently occurring value	Categorical data or discrete counts

Understanding data distribution helps you decide which measure is most appropriate for your dataset.

Mean

The mean (arithmetic average) adds all values together and divides by the number of values.

Formula: mean = (x₁ + x₂ + … + xₙ) / n

Use numpy.mean() to calculate it in Python:

Find the mean of a list using numpy

python— editable, runs on the server

When to use the mean

The mean works well when your data has no extreme outliers and follows a roughly symmetric distribution. When outliers are present, they pull the mean toward them and make it a poor representative of the "typical" value.

Mean vs. median with an outlier

import numpy as np

salaries = [40000, 42000, 45000, 48000, 50000, 300000]
print(f"Mean:   {np.mean(salaries):.0f}")    # Output: 87500
print(f"Median: {np.median(salaries):.0f}")  # Output: 46500

Here the mean is 87 500 — much higher than five of the six salaries — because one extreme value (300 000) skews it upward. The median (46 500) better represents what a typical employee earns.

Median

The median is the middle value of a sorted dataset.

Odd number of values: the middle element.
Even number of values: the average of the two middle elements.

Use numpy.median():

Find the median of a list using numpy

python— editable, runs on the server

Even-length dataset — median averages the two middle values

import numpy as np

data_even = [1, 3, 5, 7]
print(np.median(data_even))  # Output: 4.0  (average of 3 and 5)

When to use the median

The median is the preferred measure of central tendency whenever your data is skewed or contains outliers, because it is not affected by extreme values. Income, house prices, and age distributions are classic examples where the median is more informative than the mean.

Mode

The mode is the value that appears most often in a dataset. A dataset can have:

No mode — if all values appear the same number of times.
One mode (unimodal) — the most common case.
Multiple modes (multimodal) — two or more values tie for highest frequency.

Use statistics.mode() from the standard library:

Find the mode of a list using the statistics module

python— editable, runs on the server

Handling multimodal data

statistics.mode() raises a StatisticsError in Python 3.7 and earlier when there is a tie. In Python 3.8+ it returns the first mode encountered. To safely retrieve all modes, use statistics.multimode():

Find all modes when data has multiple peaks

import statistics

votes = [1, 1, 2, 2, 3]
print(statistics.multimode(votes))  # Output: [1, 2]

Mode for numeric data

The mode is most natural for categorical or discrete integer data, but it also works for continuous numeric data:

import statistics

scores = [10, 20, 20, 30, 40]
print(statistics.mode(scores))  # Output: 20

When to use the mode

Use the mode when working with categorical features (colors, labels, product categories) or when you need to know the most popular item — for example, the most common defect type in a quality-control dataset.

Comparing All Three Measures

The example below shows how mean, median, and mode diverge on a skewed dataset. A few older employees pull the mean up, while the median and mode stay close to where most of the data actually sits:

Compare mean, median, and mode on skewed data

import numpy as np
import statistics

ages = [22, 23, 24, 24, 25, 25, 25, 26, 60]

print(f"Mean:   {np.mean(ages):.1f}")         # Output: 28.2
print(f"Median: {np.median(ages):.1f}")       # Output: 25.0
print(f"Mode:   {statistics.mode(ages)}")     # Output: 25

The mean (28.2) is pulled up by the single 60-year-old. The median and mode (both 25) accurately represent the typical employee.

Using Mean and Median to Impute Missing Values

A common preprocessing step before training a model is to replace missing values (NaN) with a representative statistic. This is called imputation.

Mean imputation — replace NaN with the column average. Fast, but sensitive to outliers.
Median imputation — replace NaN with the median. Robust to outliers; preferred for skewed features.
Mode imputation — replace NaN with the most frequent value. Appropriate for categorical columns.

Mean imputation with numpy

import numpy as np

data = [10.0, 20.0, float('nan'), 40.0, 50.0]
mean_val = np.nanmean(data)          # ignores NaN: (10+20+40+50)/4 = 30.0
imputed = [mean_val if np.isnan(x) else x for x in data]
print(imputed)
# Output: [10.0, 20.0, 30.0, 40.0, 50.0]

In production code you would typically use sklearn.impute.SimpleImputer, which integrates cleanly into scikit-learn pipelines and applies the same fitted statistics to both training and test sets.

Quick Reference: Which Measure to Choose

Scenario	Recommended measure
Normally distributed numeric data	Mean
Skewed numeric data (income, prices)	Median
Data with extreme outliers	Median
Categorical data (labels, colors)	Mode
Imputing numeric columns with outliers	Median
Imputing categorical columns	Mode
Finding the most popular value	Mode

Data Distribution — understand normal, skewed, and uniform distributions before choosing a measure.
Standard Deviation — measure how spread out your data is around the mean.
Percentile — rank values relative to the rest of the dataset.
Scale — feature scaling techniques that build on these statistics.

Conclusion

Mean, median, and mode each capture a different aspect of your data's center. The mean is the most common default but is fragile in the presence of outliers. The median is robust and should be your first choice for skewed distributions. The mode is indispensable for categorical data and for quick "most common value" queries. In machine learning, all three appear regularly in exploratory data analysis and missing-value imputation — choosing the right one for each column leads to cleaner features and better model performance.

What Are Mean, Median, and Mode?

Mean

When to use the mean

Median

When to use the median

Mode

Handling multimodal data

Mode for numeric data

When to use the mode

Comparing All Three Measures

Using Mean and Median to Impute Missing Values

Quick Reference: Which Measure to Choose

Related Topics

Conclusion