Linear Regression
Linear regression is a powerful statistical tool that is widely used in machine learning and predictive modeling. It is a technique that is used to find the
Linear regression is a powerful statistical tool widely used in machine learning and predictive modeling. It finds the best-fit line between a dependent variable and one or more independent variables by minimizing the sum of squared errors between predicted and actual values.
In this guide, we will show how to implement linear regression in Python using scikit-learn. We will start with a brief introduction to the technique and its applications, then walk through a complete implementation.
Introduction to Linear Regression
Linear regression models the relationship between a dependent variable and one or more independent variables. It assumes a linear relationship, meaning changes in the dependent variable are proportional to changes in the independent variables. It is widely used in finance, economics, marketing, and engineering to predict trends and support decision-making.
There are two main types: simple linear regression (one independent variable) and multiple linear regression (two or more independent variables). This guide focuses on multiple linear regression.
Implementation of Linear Regression using Scikit-learn
Scikit-learn is a popular Python machine learning library that provides robust tools for data analysis and modeling. It includes a dedicated module for linear regression, simplifying model implementation.
Step 1: Import the Required Libraries
Before we can implement linear regression using scikit-learn, we need to import the required libraries. We will be using the following libraries:
Import the required libraries to implement linear regression using scikit-learn in Python:
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_scoreStep 2: Load the Dataset
The next step is to load the dataset that we will be using to train our linear regression model. We will be using the California Housing dataset, which is the modern standard for regression tasks.
Note: The Boston Housing dataset was removed from scikit-learn 1.4 due to ethical concerns. We use the California Housing dataset instead.
Load the dataset that we will be using to train our linear regression model using fetch_california_housing in Python:
from sklearn.datasets import fetch_california_housing
california = fetch_california_housing()
california_df = pd.DataFrame(california.data, columns=california.feature_names)
california_df['MedHouseVal'] = california.targetStep 3: Explore the Dataset
Before we can train our linear regression model, we need to explore the dataset to understand its structure and features. We can do this by converting the dataset into a pandas dataframe and using the head() function to display the first few rows of the data.
Convert the dataset into a pandas dataframe in Python:
california_df.head()Step 4: Prepare the Data for Training
The next step is to prepare the data for training our linear regression model. We will be using the AveRooms and AveBedrms features as our independent variables, and the MedHouseVal feature as our dependent variable.
Prepare the data for training our multiple linear regression model in Python:
X = california_df[['AveRooms', 'AveBedrms']]
y = california_df['MedHouseVal']Step 5: Split the Data into Training and Testing Sets
To evaluate the performance of our linear regression model, we need to split the data into training and testing sets. We will be using the train_test_split() function from scikit-learn to split the data.
Split the Data into Training and Testing Sets in Python:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)Step 6: Train the Linear Regression Model
The next step is to train the linear regression model using the training set. We will be using the fit() method from the LinearRegression class to train the model.
Train the linear regression model using the training set in Python:
regressor = LinearRegression()
regressor.fit(X_train, y_train)Step 7: Make Predictions on the Test Data
Once the model is trained, we can use it to make predictions on the test data. We will be using the predict() method from the LinearRegression class to make predictions.
Once the model is trained, we can use it to make predictions on the test data in Python:
y_pred = regressor.predict(X_test)Step 8: Evaluate the Performance of the Model
To evaluate the performance of our linear regression model, we will be using two metrics: mean squared error (MSE) and coefficient of determination (R²). We can calculate these metrics using the mean_squared_error() and r2_score() functions from scikit-learn.
Calculate MSE and R^2 in Python:
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print('Mean Squared Error:', mse)
print('Coefficient of Determination:', r2)Step 9: Visualize the Results
Finally, we can visualize the results of our linear regression model by plotting the regression line and the actual data points. We can use the matplotlib library to create the plot.
Visualize the results of our linear regression model by plotting the regression line and the actual data points in Python:
import matplotlib.pyplot as plt
plt.scatter(X_test['AveRooms'], y_test, color='black')
plt.plot(X_test['AveRooms'], y_pred, color='blue', linewidth=3)
plt.title('Linear Regression')
plt.xlabel('Average Rooms per Dwelling')
plt.ylabel('Median House Value (in $100,000s)')
plt.show()Conclusion
In this guide, we demonstrated how to implement linear regression in Python using scikit-learn. We covered importing libraries, loading and exploring a dataset, preparing features, splitting data, training the model, making predictions, evaluating performance, and visualizing results. This pipeline provides a solid foundation for building and deploying regression models in your own machine learning projects.