Cross-Validation in Python with scikit-learn

Cross-validation is the standard way to estimate how well a machine learning model will perform on unseen data. Instead of relying on a single train/test split — which can produce an overly optimistic or pessimistic score depending on which samples happen to end up where — cross-validation trains and evaluates the model multiple times on different partitions of the data and averages the results.

This page covers:

Why a single train/test split is not enough
K-Fold cross-validation and how to choose k
Stratified K-Fold for imbalanced class distributions
Leave-One-Out (LOO) cross-validation for small datasets
Evaluating multiple metrics with cross_validate
Using a Pipeline inside cross-validation to prevent data leakage
Nested cross-validation for unbiased hyperparameter tuning

All examples use scikit-learn and the built-in Iris dataset, so you can run them immediately without downloading anything.

Why Cross-Validation Matters

A naive evaluation workflow splits the data once, trains on one part, and tests on the other. The score you get depends heavily on which samples landed in each part — a lucky split can make a weak model look good; an unlucky one can make a strong model look bad.

Cross-validation solves this by repeating the train/test process k times, each time using a different portion of the data as the test set. The final score is the average over all folds, which is far more stable than a single measurement.

Cross-validation also makes the most of limited data: every sample is used for both training and evaluation across the full experiment.

See the train/test split page for the simpler baseline technique that cross-validation improves upon.

K-Fold Cross-Validation

K-Fold is the most widely used cross-validation strategy. The data is divided into k equal-sized folds. In each of the k iterations:

One fold is held out as the test set.
The remaining k - 1 folds form the training set.
The model is trained from scratch and scored on the test fold.

After k iterations you have k scores. Their mean is the cross-validated performance estimate; their standard deviation tells you how consistent that performance is across different data slices.

For k = 5 and a dataset of 150 samples each fold contains 30 samples (20 %) for testing and 120 samples (80 %) for training.

Basic K-Fold Example

from sklearn.model_selection import KFold, cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_iris

# Load the built-in Iris dataset (150 samples, 4 features, 3 classes)
iris = load_iris()
X, y = iris.data, iris.target

model = LogisticRegression(max_iter=200)
kfold = KFold(n_splits=5, shuffle=True, random_state=42)

scores = cross_val_score(model, X, y, cv=kfold, scoring='accuracy')

print("Fold scores:", scores.round(4))
# Fold scores: [1.     1.     0.9333 0.9667 0.9667]

print("Mean accuracy: %.4f  Std: %.4f" % (scores.mean(), scores.std()))
# Mean accuracy: 0.9733  Std: 0.0249

The cross_val_score function handles all the iteration internally. Key parameters:

Parameter	Purpose
`estimator`	Any scikit-learn model (or Pipeline)
`X, y`	Feature matrix and target vector
`cv`	Cross-validator object or integer (e.g. `cv=5`)
`scoring`	Metric string — `'accuracy'`, `'f1_macro'`, `'roc_auc'`, etc.

Choosing k

k = 5 or k = 10 is recommended for most datasets. These values give a good bias-variance tradeoff in the estimation.
Larger k (e.g. 10) produces lower bias but higher variance in the estimate, and is more expensive to compute.
Smaller k (e.g. 3) is faster but the estimate is more sensitive to how the data was divided.
For very small datasets (fewer than ~100 samples) consider Leave-One-Out instead.

Inspecting the Folds Manually

You can iterate over the folds yourself when you need to inspect what goes into each split or when you want to perform custom logic per fold:

from sklearn.model_selection import KFold
from sklearn.datasets import load_iris
import numpy as np

iris = load_iris()
X, y = iris.data, iris.target

kfold = KFold(n_splits=5, shuffle=True, random_state=42)

for fold, (train_idx, test_idx) in enumerate(kfold.split(X), start=1):
    X_train, X_test = X[train_idx], X[test_idx]
    y_train, y_test = y[train_idx], y[test_idx]
    print(f"Fold {fold}: train={len(train_idx)} samples, test={len(test_idx)} samples")

Output:

Fold 1: train=120 samples, test=30 samples
Fold 2: train=120 samples, test=30 samples
Fold 3: train=120 samples, test=30 samples
Fold 4: train=120 samples, test=30 samples
Fold 5: train=120 samples, test=30 samples

Stratified K-Fold Cross-Validation

Plain K-Fold divides the data by index order. With imbalanced class distributions this can cause some folds to contain very few examples of a minority class, making the score unreliable.

Stratified K-Fold ensures that each fold contains approximately the same proportion of each class as the whole dataset. Use StratifiedKFold whenever your target is categorical:

from sklearn.model_selection import StratifiedKFold, cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_iris

iris = load_iris()
X, y = iris.data, iris.target

model = LogisticRegression(max_iter=200)
skfold = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

scores = cross_val_score(model, X, y, cv=skfold, scoring='accuracy')

print("Fold scores:", scores.round(4))
# Fold scores: [1.     0.9667 0.9333 1.     0.9333]

print("Mean accuracy: %.4f  Std: %.4f" % (scores.mean(), scores.std()))
# Mean accuracy: 0.9667  Std: 0.0298

StratifiedKFold is the default cross-validator used inside GridSearchCV and RandomizedSearchCV for classification problems — you get stratification automatically in those contexts.

Leave-One-Out Cross-Validation

Leave-One-Out (LOO) cross-validation is the extreme case: k equals the number of samples. In each iteration one sample is the test set and all remaining samples form the training set. For a 150-sample dataset that means 150 training/evaluation cycles.

from sklearn.model_selection import LeaveOneOut, cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_iris

iris = load_iris()
X, y = iris.data, iris.target

model = LogisticRegression(max_iter=200)
loocv = LeaveOneOut()

scores = cross_val_score(model, X, y, cv=loocv, scoring='accuracy')

print(f"Number of folds: {len(scores)}")        # 150
print(f"Mean accuracy: {scores.mean():.4f}")     # 0.9667
print(f"Std deviation: {scores.std():.4f}")      # 0.1795

When to use LOO:

Your dataset has fewer than ~100 samples and you cannot afford to reserve any data for testing.
You want the lowest-bias estimate of model performance.

Drawbacks of LOO:

Very high computational cost — the model is retrained n times.
High variance in the estimate: each fold's test score is 0 or 1 (binary classification) or a single point, so the standard deviation is not meaningful for individual folds.

For most datasets K-Fold with k=5 or k=10 is a better tradeoff.

Evaluating Multiple Metrics at Once

cross_val_score can only compute one metric per call. Use cross_validate to compute several metrics simultaneously and also retrieve training scores to check for overfitting:

from sklearn.model_selection import KFold, cross_validate
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_iris
import numpy as np

iris = load_iris()
X, y = iris.data, iris.target

model = LogisticRegression(max_iter=200)
kfold = KFold(n_splits=5, shuffle=True, random_state=42)

cv_results = cross_validate(
    model, X, y,
    cv=kfold,
    scoring=['accuracy', 'f1_macro'],
    return_train_score=True,
)

print("Test accuracy: ", cv_results['test_accuracy'].round(4))
# Test accuracy:  [1.     1.     0.9333 0.9667 0.9667]

print("Train accuracy:", cv_results['train_accuracy'].round(4))
# Train accuracy: [0.975  0.9583 0.9833 0.975  0.9833]

print("Test F1-macro: ", cv_results['test_f1_macro'].round(4))
# Test F1-macro:  [1.     1.     0.9259 0.9691 0.971 ]

Comparing train and test scores across folds is a quick way to spot overfitting: if train accuracy is consistently much higher than test accuracy the model is memorising the training data. See the bias and variance discussion for more background.

Using a Pipeline Inside Cross-Validation

A common mistake is to fit preprocessing steps (like feature scaling or imputation) on the entire dataset before cross-validation. This leaks information from the test fold into the training process, leading to an overly optimistic score.

The correct pattern is to wrap preprocessing and the model together in a Pipeline and pass the pipeline to cross_val_score. scikit-learn refits the entire pipeline — scaler included — independently inside each fold:

from sklearn.model_selection import StratifiedKFold, cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.datasets import load_iris

iris = load_iris()
X, y = iris.data, iris.target

# Correct: preprocessing is fitted only on training folds
pipe = Pipeline([
    ('scaler', StandardScaler()),
    ('clf', LogisticRegression(max_iter=200)),
])

skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(pipe, X, y, cv=skf, scoring='accuracy')

print("Fold scores:", scores.round(4))
# Fold scores: [1.     0.9667 0.9    1.     0.9   ]

print("Mean accuracy: %.4f  Std: %.4f" % (scores.mean(), scores.std()))
# Mean accuracy: 0.9533  Std: 0.0452

Always use a Pipeline when your workflow includes any step that learns from the data (scaling, PCA, encoding, imputation).

Nested Cross-Validation

When you use cross-validation to both tune hyperparameters and evaluate model performance on the same data, you risk overfitting to the validation folds — the chosen hyperparameters are the ones that happened to score best on those particular partitions, so the reported score is optimistic.

Nested cross-validation separates the two concerns:

Inner loop: select hyperparameters via grid search on training folds.
Outer loop: evaluate the best model found by the inner loop on a held-out test fold.

from sklearn.model_selection import GridSearchCV, StratifiedKFold, cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_iris

iris = load_iris()
X, y = iris.data, iris.target

# Inner CV: hyperparameter selection
inner_cv = StratifiedKFold(n_splits=3, shuffle=True, random_state=2)
param_grid = {'C': [0.01, 0.1, 1, 10]}
gs = GridSearchCV(
    LogisticRegression(max_iter=300),
    param_grid,
    cv=inner_cv,
    scoring='accuracy',
)

# Outer CV: unbiased performance estimation
outer_cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=1)
nested_scores = cross_val_score(gs, X, y, cv=outer_cv, scoring='accuracy')

print("Nested CV fold scores:", nested_scores.round(4))
# Nested CV fold scores: [0.9667 1.     0.9333 1.     0.9   ]

print("Mean accuracy: %.4f" % nested_scores.mean())
# Mean accuracy: 0.9600

The nested approach gives an unbiased estimate of the final deployed model's performance. Use it whenever you are reporting results in a research context or comparing algorithms. For a plain deployment workflow where you will retrain on all available data anyway, a single outer loop with GridSearchCV is usually sufficient. See the grid search page for a detailed walkthrough of hyperparameter tuning.

Common Pitfalls

Preprocessing outside the fold

Fitting a scaler on the full dataset before calling cross_val_score — rather than inside a Pipeline — leaks test-fold statistics into training. The solution is always to use a Pipeline.

Using `random_state` incorrectly

If you set shuffle=True without a random_state, each run produces a different split and your results are not reproducible. Always set random_state to a fixed integer when reporting numbers.

Interpreting standard deviation

A high standard deviation across folds is not always bad — it may reflect genuine variability in the dataset (e.g. some folds are easier than others). Look at individual fold scores before drawing conclusions.

Cross-validating on time-series data

K-Fold shuffles data randomly, which would mix future information into past training windows for time-series problems. Use TimeSeriesSplit from scikit-learn instead, which respects temporal order.

Quick Reference

Technique	When to use	scikit-learn class
K-Fold	Default choice for most regression/classification tasks	`KFold`
Stratified K-Fold	Classification with imbalanced classes	`StratifiedKFold`
Leave-One-Out	Very small datasets (< ~100 samples)	`LeaveOneOut`
Nested CV	Reporting unbiased scores with hyperparameter tuning	`GridSearchCV` inside `cross_val_score`
Time-series CV	Data with temporal ordering	`TimeSeriesSplit`

Train/Test Split — the simpler baseline that cross-validation improves upon
Grid Search — hyperparameter tuning, commonly paired with cross-validation
Logistic Regression — one of the classifiers used in the examples above
Linear Regression — regression counterpart, also evaluated with cross-validation
Confusion Matrix — per-class performance breakdown to complement accuracy scores
AUC-ROC Curve — another evaluation metric you can pass to scoring in cross_val_score