Linear Regression in Python with scikit-learn

Linear regression is one of the most fundamental algorithms in machine learning. It models the relationship between a dependent variable (what you want to predict) and one or more independent variables (the inputs) by fitting a straight line — or a hyperplane — through the data.

This page covers:

How simple and multiple linear regression work mathematically
The ordinary least squares (OLS) method for fitting a line
Key assumptions you must check before trusting your model
A complete scikit-learn walkthrough: load data, train, evaluate, and interpret results
How to read model coefficients and spot common pitfalls

How Linear Regression Works

The Equation

Simple linear regression (one input feature) fits this line:

y = β₀ + β₁x + ε

y — the dependent variable (target)
x — the independent variable (feature)
β₀ — the intercept (value of y when x = 0)
β₁ — the slope (change in y for a one-unit increase in x)
ε — the error term (noise the model cannot explain)

Multiple linear regression extends this to n features:

y = β₀ + β₁x₁ + β₂x₂ + ... + βₙxₙ + ε

Each coefficient βᵢ tells you how much y changes when xᵢ increases by one unit, holding all other features constant.

Ordinary Least Squares (OLS)

The model learns the coefficients by minimizing the sum of squared residuals — the difference between each actual value yᵢ and the model's prediction ŷᵢ:

SSR = Σ(yᵢ - ŷᵢ)²

Squaring the residuals penalizes large errors more than small ones and ensures that positive and negative errors do not cancel out. This criterion has an exact closed-form solution, which is why linear regression trains almost instantly even on large datasets.

Key Assumptions

Linear regression produces reliable predictions only when these conditions hold:

Assumption	What to check
Linearity	The relationship between features and target is approximately linear
Independence	Observations are independent of each other
Homoscedasticity	The variance of residuals is roughly constant across all predictions
Normality of residuals	Residuals are approximately normally distributed
No multicollinearity	Independent variables are not highly correlated with each other

When these assumptions are violated, the coefficient estimates may be biased or the model may perform poorly on unseen data.

Simple Linear Regression Example

Before moving to multiple features, let's see how the algorithm fits a line to a single feature. This makes the geometry easy to visualize.

import numpy as np
import matplotlib
matplotlib.use('Agg')  # non-interactive backend for scripts
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression

# Simulate: house size (sq ft) vs price ($1000s)
rng = np.random.default_rng(42)
X_simple = rng.uniform(500, 3000, 50).reshape(-1, 1)
y_simple = 50 + 0.1 * X_simple.ravel() + rng.normal(0, 15, 50)

model = LinearRegression()
model.fit(X_simple, y_simple)

print(f"Intercept (β₀): {model.intercept_:.2f}")
print(f"Slope    (β₁): {model.coef_[0]:.4f}")
print(f"Interpretation: each extra sq ft adds ${model.coef_[0]*1000:.0f} to the predicted price")

Expected output:

Intercept (β₀): 46.17
Slope    (β₁): 0.1007
Interpretation: each extra sq ft adds $101 to the predicted price

The intercept and slope values are recovered automatically by OLS — you do not have to do any algebra yourself.

Multiple Linear Regression with scikit-learn

Real datasets have many features. This section walks through a full pipeline on the California Housing dataset, which records census-block-level housing statistics for California in 1990.

Step 1: Import Libraries

import pandas as pd
import numpy as np
from sklearn.datasets import fetch_california_housing
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score

Step 2: Load and Explore the Dataset

california = fetch_california_housing()
df = pd.DataFrame(california.data, columns=california.feature_names)
df['MedHouseVal'] = california.target  # median house value in $100,000s

print(df.shape)          # (20640, 9)
print(df.head())
print(df.describe())

The dataset has 20,640 rows and 8 input features:

Feature	Description
`MedInc`	Median income in the block (in tens of thousands of dollars)
`HouseAge`	Median age of houses in the block
`AveRooms`	Average number of rooms per household
`AveBedrms`	Average number of bedrooms per household
`Population`	Block population
`AveOccup`	Average household occupancy
`Latitude`	Block latitude
`Longitude`	Block longitude

The target MedHouseVal is the median house value in units of $100,000.

Step 3: Choose Features and Split the Data

For a straightforward demonstration, we use all 8 features. See Train/Test Split for a detailed explanation of why we split data.

X = df[california.feature_names]   # all 8 features
y = df['MedHouseVal']

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

print(f"Training samples: {len(X_train)}")   # 16512
print(f"Test samples:     {len(X_test)}")    # 4128

The random_state=42 ensures reproducible splits every time you run the script.

Step 4: Train the Model

model = LinearRegression()
model.fit(X_train, y_train)

That is all it takes. The fit() method solves the OLS problem analytically using matrix algebra — there is no iterative gradient descent involved by default.

Step 5: Inspect the Learned Coefficients

Understanding what the model learned is as important as its accuracy:

coef_df = pd.DataFrame({
    'Feature': california.feature_names,
    'Coefficient': model.coef_
}).sort_values('Coefficient', key=abs, ascending=False)

print(coef_df.to_string(index=False))
print(f"\nIntercept: {model.intercept_:.4f}")

Typical output:

   Feature  Coefficient
 AveBedrms     0.7831
    MedInc     0.4487
 Longitude    -0.4337
  Latitude    -0.4198
  AveRooms    -0.1233
  HouseAge     0.0097
  AveOccup    -0.0035
Population    -0.0000

Intercept: -37.0233

Reading the coefficients:

AveBedrms = 0.783: a one-unit increase in average bedrooms predicts a $78,300 increase in house value — but this is entangled with AveRooms (they are correlated). When correlated features are both present, individual coefficients can become large, unstable, or even counterintuitive. This is multicollinearity.
MedInc = 0.449: a one-unit increase in median income (roughly $10,000) predicts a $44,900 increase in house value, holding everything else constant.
Longitude = -0.434 and Latitude = -0.420: purely geographic controls; the model uses them to capture location effects even though it cannot model non-linear geography well.

Step 6: Evaluate the Model

y_pred = model.predict(X_test)

mse  = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
r2   = r2_score(y_test, y_pred)

print(f"RMSE: {rmse:.4f}  (in $100,000s, so ±${rmse*100_000:,.0f})")
print(f"R²:   {r2:.4f}")

Expected output:

RMSE: 0.7456  (in $100,000s, so ±$74,560)
R²:   0.5758

Interpreting the metrics:

RMSE (Root Mean Squared Error) — the average prediction error in the same units as the target. Lower is better.
R² (coefficient of determination) — the proportion of variance in y that the model explains. An R² of 0.58 means the model explains about 58% of the variance in house prices. Values closer to 1.0 are better; values near 0 mean the model is barely better than predicting the mean.

An R² of ~0.58 is typical for this dataset with linear regression. The relationship between house prices and these features is partly non-linear, which is why methods like polynomial regression or gradient boosting often score higher.

Step 7: Visualize Predicted vs Actual Values

The clearest diagnostic plot for a regression model is predicted vs actual — it works regardless of how many features you have:

import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt

plt.figure(figsize=(7, 5))
plt.scatter(y_test, y_pred, alpha=0.3, s=10, color='steelblue')
plt.plot([y_test.min(), y_test.max()],
         [y_test.min(), y_test.max()],
         'r--', linewidth=1.5, label='Perfect prediction')
plt.xlabel('Actual Median House Value ($100,000s)')
plt.ylabel('Predicted Median House Value ($100,000s)')
plt.title('Linear Regression: Predicted vs Actual')
plt.legend()
plt.tight_layout()
plt.savefig('lr_predicted_vs_actual.png', dpi=120)
print("Plot saved.")

Points that fall on the red dashed line are perfect predictions. Scatter around the line shows error. A fan shape (wider scatter at higher values) signals heteroscedasticity — one of the key assumptions is violated.

Full Pipeline (All Steps Together)

import pandas as pd
import numpy as np
from sklearn.datasets import fetch_california_housing
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score

# Load data
california = fetch_california_housing()
df = pd.DataFrame(california.data, columns=california.feature_names)
df['MedHouseVal'] = california.target

# Split
X = df[california.feature_names]
y = df['MedHouseVal']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train
model = LinearRegression()
model.fit(X_train, y_train)

# Evaluate
y_pred = model.predict(X_test)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
r2   = r2_score(y_test, y_pred)

print(f"RMSE: {rmse:.4f}")
print(f"R²:   {r2:.4f}")

When to Use Linear Regression

Linear regression is a good first choice when:

The relationship between inputs and output is approximately linear
Interpretability matters — you need to explain predictions to stakeholders
The dataset is small or medium-sized and training speed is important
You want a fast baseline before trying more complex models

Consider alternatives when:

Features and target have strong non-linear relationships → try polynomial regression or decision trees
You have many features that may be irrelevant → regularized variants (Ridge, Lasso) prevent overfitting by shrinking coefficients
The target is a category, not a number → use logistic regression instead

Common Pitfalls

Forgetting to scale features. Linear regression coefficients reflect the units of each feature. If one feature is in thousands and another is in fractions, the raw coefficient sizes are not comparable. Use StandardScaler before comparing feature importances. See Feature Scaling for details.

Multicollinearity. Highly correlated features make individual coefficients unreliable — they can even flip sign. Check the correlation matrix with df.corr() and drop or combine correlated features.

Extrapolation. A linear model trained on data in a certain range can give wildly wrong predictions outside that range. Always check that new inputs fall within the training distribution.

Ignoring residual plots. Always plot residuals after fitting. Patterns in residuals (curves, fans, outlier clusters) indicate that model assumptions are violated and predictions should not be trusted without further investigation.

Next Steps

Once you have a working linear regression baseline, explore these related topics:

Multiple Regression — deeper dive into using multiple features and interpreting each coefficient
Polynomial Regression — fit curves instead of lines by adding polynomial feature terms
Train/Test Split — understand why and how to properly evaluate model performance
Feature Scaling — standardize inputs so coefficients and gradient-based solvers behave correctly
Logistic Regression — predict categories (yes/no, spam/not-spam) instead of continuous values