Skip to content

Multiple Regression

In this article, we will delve into multiple linear regression, a powerful machine learning technique for predicting continuous numerical values based on multiple predictor variables. With the help of Python, we will build and analyze a model that can predict a numerical outcome based on multiple input features.

What is Multiple Linear Regression?

Linear regression is a technique used to model the relationship between a dependent variable and one or more independent variables. When there is only one independent variable, it is called simple linear regression. However, when there are multiple independent variables, it is called multiple linear regression.

In multiple linear regression, the goal is to find the line of best fit that predicts the dependent variable based on the independent variables. This line is determined by minimizing the sum of the squared distances between the observed values and the predicted values. The coefficients of the line represent the relationship between each independent variable and the dependent variable, while the intercept represents the expected value of the dependent variable when all independent variables are zero.

The Dataset

To illustrate multiple linear regression, we will use the California Housing dataset, which contains information about housing in California. The dataset has 20,640 samples and 8 features, with the median house value (in $100,000s) as the dependent variable.

python
from sklearn.datasets import fetch_california_housing

housing = fetch_california_housing()
X = housing.data
y = housing.target
feature_names = housing.feature_names

Data Preprocessing

Before we can build the model, we need to preprocess the data. This involves splitting it into training and testing sets and scaling the features to ensure they contribute equally to the model.

python
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

Building the Model

With the data preprocessed, we can now build the multiple linear regression model. We will use the LinearRegression class from scikit-learn, which provides a simple and efficient way to implement the algorithm.

python
from sklearn.linear_model import LinearRegression

model = LinearRegression()
model.fit(X_train_scaled, y_train)

The .fit() method calculates the optimal coefficients that minimize the residual sum of squares.

Model Evaluation

To evaluate the performance of the model, we will use two metrics: mean squared error (MSE) and R-squared (R²).

python
from sklearn.metrics import mean_squared_error, r2_score

y_pred = model.predict(X_test_scaled)
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f"Mean Squared Error: {mse:.4f}")
print(f"R-squared: {r2:.4f}")

MSE measures the average squared difference between predicted and actual values, while R² indicates the proportion of variance in the dependent variable explained by the model.

Interpretation of Model Coefficients

The coefficients of the model represent the relationship between each independent variable and the dependent variable. A positive coefficient indicates that the variable has a positive effect on the dependent variable, while a negative coefficient indicates the opposite.

python
for name, coef in zip(feature_names, model.coef_):
    print(f"{name}: {coef:.4f}")

By examining these values, you can determine which features have the strongest influence on the target variable.

Conclusion

In this article, we have explored the concept of multiple linear regression and how it can be used to predict a continuous numerical outcome based on multiple predictor variables. We have used Python and the scikit-learn library to build and evaluate a multiple linear regression model using the California Housing dataset. The results show that the model is able to predict median house values with reasonable accuracy, and the coefficients provide insight into the relationship between the independent variables and the dependent variable.

Dual-run preview — compare with live Symfony routes.