Data Preprocessing in Machine Learning (with Python Examples)

March 22, 2023

By Admin

Data preprocessing is an essential step in the machine learning pipeline, as it has a significant impact on the accuracy of the model.

The goal of data preprocessing is to clean, transform, and normalize the data, so that it can be used effectively in training a machine learning model. This article will explore the importance of data preprocessing and some of the most common techniques used to preprocess data.

Contents hide

1 Why is Data Preprocessing Important?

2 Common Data Preprocessing Techniques

2.1 Handling Missing Values

2.2 Data Normalization

2.3 Data Transformation

2.4 Outlier Detection and Removal

3 Python Code Examples

3.1 Handling missing values

3.2 Scaling data

3.3 Encoding categorical data

3.4 Splitting data into training and testing sets

4 Relevant Data Preprocessing Datasets

4.1 Iris dataset

4.2 Pima Indians Diabetes dataset

4.3 Breast Cancer Wisconsin (Diagnostic) dataset

5 Relevant entities

6 Conclusion

6.1 Related posts:

Why is Data Preprocessing Important?

Data preprocessing is critical because many machine learning algorithms require the data to be in a specific format. If the data is not in the right format, the model may not work correctly. Additionally, data preprocessing can improve the performance of the machine learning model by reducing noise, handling missing values, and correcting errors in the data. This helps to increase the accuracy and reliability of the model.

Common Data Preprocessing Techniques

Handling Missing Values

One of the most common data preprocessing techniques is handling missing values. Missing values can occur when some of the data is not available. There are several ways to handle missing values, including:

Removing the rows with missing values
Replacing the missing values with the mean, median, or mode of the column

Using a machine learning algorithm to predict the missing values

Data Normalization

Another important data preprocessing technique is data normalization. Normalization is the process of scaling the data to a specific range, such as 0 to 1. Normalization helps to reduce the impact of outliers and improve the performance of the machine learning model. There are several methods for normalizing data, including:

Min-Max Scaling

Standard Scaler
Z-score

Data Transformation

Data transformation is the process of converting data into a different format. This can be useful for converting categorical data into numerical data or transforming variables to better suit the distribution of the data. Some common data transformation techniques include:

One-hot encoding
Log Transformation
Square Root Transformation

Outlier Detection and Removal

Outliers are data points that are significantly different from the other data points. Outliers can cause problems for machine learning algorithms, as they can affect the accuracy of the model. Outlier detection and removal is the process of identifying and removing outliers from the data. Some common techniques for outlier detection include:

Z-score
IQR (Interquartile Range)

Mahalanobis Distance

Python Code Examples

Handling missing values

We will learn how to handle missing values with Pandas.

import pandas as pd
import numpy as np
df = pd.read_csv("data.csv")
df = df.replace("?", np.nan)
df = df.fillna(df.mean())

This code imports the Pandas library with the alias ‘pd’ and the NumPy library with the alias ‘np’. Pandas is used for data manipulation and analysis, while NumPy is used for numerical operations.

It then reads a CSV file named “data.csv” and loads its contents into a DataFrame called ‘df’. The DataFrame is a table-like structure that holds data in rows and columns.

In the loaded DataFrame ‘df’, this line replaces all occurrences of the string “?” with NumPy’s NaN (Not a Number) value. This is often used to represent missing or unknown data.

Lastly, it fills all the NaN (missing) values in the DataFrame ‘df’ with the mean (average) value of each column. df.mean() calculates the mean for each column, and then fillna replaces the NaN values in each column with its corresponding mean.

Scaling data

The next data preprocessing example that we will look at is to scale data using sklearn.preprocessing.MinMaxScaler.

from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
df[['column_name']] = scaler.fit_transform(df[['column_name']])

Encoding categorical data

Another data preprocessing step is to encode data. We will do so using sklearn.preprocessing.LabelEncoder.

from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
df['column_name'] = le.fit_transform(df['column_name'])

Splitting data into training and testing sets

Not a preprocessing step a priori, but data is generally split in training and testing sets using train_test_split.

from sklearn.model_selection import train_test_split
X = df.drop(['target_column'], axis=1)
y = df['target_column']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

Relevant Data Preprocessing Datasets

Iris dataset


# Python example
import pandas as pd
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"
dataset = pd.read_csv(url, header=None)
dataset.head()

Pima Indians Diabetes dataset


# Python example
import pandas as pd
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv"
dataset = pd.read_csv(url, header=None)
dataset.head()

Breast Cancer Wisconsin (Diagnostic) dataset


# Python example
import pandas as pd
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wdbc.data"
dataset = pd.read_csv(url, header=None)
dataset.head()

Relevant entities

Entities	Properties
Missing values	Represented as NaN, null or ?
Outliers	Values that lie outside the expected range
Duplicate values	Repeated values in a dataset
Noisy data	Incorrect or irrelevant data points
Scaling	Changing the range of values in a dataset
Encoding	Converting categorical data into numerical form

Conclusion

Data preprocessing is a crucial step in the data science pipeline as it helps to ensure that the data is in a suitable format for analysis and modeling.