Skip to content

Categorical Data

Data preprocessing is a crucial step in any machine learning project. It involves cleaning and transforming raw data into a format that can be readily analyzed by machine learning algorithms. Python provides a vast array of preprocessing techniques that can help in refining the data quality and model performance. This chapter covers foundational preprocessing steps, with a primary focus on preparing data for categorical variable handling.

Importing Data

The first step in any data preprocessing task is importing data. Python's pandas library provides a straightforward way to read data from various file formats. The read_csv() function can be used to read data from a CSV file.

python
import pandas as pd
df = pd.read_csv('data.csv')

Data Cleaning

Data cleaning is an essential aspect of data preprocessing. It involves identifying and handling missing data, outliers, and anomalies. Pandas library provides several methods for data cleaning, such as fillna(), dropna(), and replace().

python
# Fill missing values with a default
df['column'] = df['column'].fillna(0)

# Drop rows with any missing values
df = df.dropna()

# Replace specific values
df['column'] = df['column'].replace('old_value', 'new_value')

Data Transformation

Data transformation is the process of converting raw data into a format suitable for analysis. Some of the commonly used data transformation techniques are scaling, encoding, and normalization.

Scaling

Scaling is used to bring the features of a dataset onto a similar scale. This technique is useful when the features have different ranges of values. The most commonly used scaling techniques are StandardScaler and MinMaxScaler.

Encoding

Encoding is the process of converting categorical variables into numerical values. The most commonly used encoding techniques are Ordinal Encoding and One-Hot Encoding.

python
from sklearn.preprocessing import OrdinalEncoder, OneHotEncoder
import pandas as pd

# Ordinal Encoding (for ordered categories)
encoder = OrdinalEncoder()
df['category'] = encoder.fit_transform(df[['category']])

# One-Hot Encoding (for unordered categories)
ohe = OneHotEncoder(sparse_output=False, handle_unknown='ignore')
encoded_array = ohe.fit_transform(df[['category']])
# Convert the numpy array back to a DataFrame with proper column names
encoded_df = pd.DataFrame(encoded_array, columns=ohe.get_feature_names_out(['category']))

Note on data leakage: Always fit encoders on the training data only, then transform both training and test data. This prevents information from the test set from influencing the model.

Handling unseen categories: Use handle_unknown='ignore' in OneHotEncoder to avoid errors when the test set contains categories not seen during training.

Normalization

Normalization is the process of scaling numeric features to a fixed range, typically [0, 1]. Standardization, on the other hand, transforms features to have a mean of zero and a standard deviation of one. This technique is useful when features have different units of measurement or when algorithms assume normally distributed data.

Feature Selection

Feature selection is the process of selecting the most relevant features for a machine learning model. It involves identifying the most significant predictors and removing the least important ones. The most commonly used feature selection technique is SelectKBest.

python
# Prepare features (X) and target (y) from the dataframe
X = df.drop('target', axis=1)
y = df['target']

from sklearn.feature_selection import SelectKBest, f_classif

# Select the top 3 features based on ANOVA F-value
selector = SelectKBest(score_func=f_classif, k=3)
X_new = selector.fit_transform(X, y)

Conclusion

This chapter has outlined key preprocessing steps, including data cleaning, transformation, and feature selection. By applying these techniques, you can enhance data quality and optimize model performance, particularly when preparing datasets for categorical variable analysis. Properly handling categorical data with appropriate encoding and avoiding data leakage ensures robust and reproducible machine learning pipelines.

Dual-run preview — compare with live Symfony routes.