Data preprocessing is an essential step in the machine learning pipeline, as it has a significant impact on the accuracy of the model.
The goal of data preprocessing is to clean, transform, and normalize the data, so that it can be used effectively in training a machine learning model. This article will explore the importance of data preprocessing and some of the most common techniques used to preprocess data.
Why is Data Preprocessing Important?
Data preprocessing is critical because many machine learning algorithms require the data to be in a specific format. If the data is not in the right format, the model may not work correctly. Additionally, data preprocessing can improve the performance of the machine learning model by reducing noise, handling missing values, and correcting errors in the data. This helps to increase the accuracy and reliability of the model.
Common Data Preprocessing Techniques
Handling Missing Values
One of the most common data preprocessing techniques is handling missing values. Missing values can occur when some of the data is not available. There are several ways to handle missing values, including:
- Removing the rows with missing values
- Replacing the missing values with the mean, median, or mode of the column
- Using a machine learning algorithm to predict the missing values
Data Normalization
Another important data preprocessing technique is data normalization. Normalization is the process of scaling the data to a specific range, such as 0 to 1. Normalization helps to reduce the impact of outliers and improve the performance of the machine learning model. There are several methods for normalizing data, including:
- Min-Max Scaling
- Standard Scaler
- Z-score
Data Transformation
Data transformation is the process of converting data into a different format. This can be useful for converting categorical data into numerical data or transforming variables to better suit the distribution of the data. Some common data transformation techniques include:
- One-hot encoding
- Log Transformation
- Square Root Transformation
Outlier Detection and Removal
Outliers are data points that are significantly different from the other data points. Outliers can cause problems for machine learning algorithms, as they can affect the accuracy of the model. Outlier detection and removal is the process of identifying and removing outliers from the data. Some common techniques for outlier detection include:
- Z-score
- IQR (Interquartile Range)
- Mahalanobis Distance
Python Code Examples
Handling missing values
We will learn how to handle missing values with Pandas.
import pandas as pd
import numpy as np
df = pd.read_csv("data.csv")
df = df.replace("?", np.nan)
df = df.fillna(df.mean())
This code imports the Pandas library with the alias ‘pd’ and the NumPy library with the alias ‘np’. Pandas is used for data manipulation and analysis, while NumPy is used for numerical operations.
It then reads a CSV file named “data.csv” and loads its contents into a DataFrame called ‘df’. The DataFrame is a table-like structure that holds data in rows and columns.
In the loaded DataFrame ‘df’, this line replaces all occurrences of the string “?” with NumPy’s NaN (Not a Number) value. This is often used to represent missing or unknown data.
Lastly, it fills all the NaN (missing) values in the DataFrame ‘df’ with the mean (average) value of each column. df.mean()
calculates the mean for each column, and then fillna
replaces the NaN values in each column with its corresponding mean.
Scaling data
The next data preprocessing example that we will look at is to scale data using sklearn.preprocessing.MinMaxScaler.
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
df[['column_name']] = scaler.fit_transform(df[['column_name']])
Encoding categorical data
Another data preprocessing step is to encode data. We will do so using sklearn.preprocessing.LabelEncoder.
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
df['column_name'] = le.fit_transform(df['column_name'])
Splitting data into training and testing sets
Not a preprocessing step a priori, but data is generally split in training and testing sets using train_test_split.
from sklearn.model_selection import train_test_split
X = df.drop(['target_column'], axis=1)
y = df['target_column']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
Relevant Data Preprocessing Datasets
Iris dataset
# Python example
import pandas as pd
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"
dataset = pd.read_csv(url, header=None)
dataset.head()
Pima Indians Diabetes dataset
# Python example
import pandas as pd
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv"
dataset = pd.read_csv(url, header=None)
dataset.head()
Breast Cancer Wisconsin (Diagnostic) dataset
# Python example
import pandas as pd
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wdbc.data"
dataset = pd.read_csv(url, header=None)
dataset.head()
Relevant entities
Entities | Properties |
---|---|
Missing values | Represented as NaN, null or ? |
Outliers | Values that lie outside the expected range |
Duplicate values | Repeated values in a dataset |
Noisy data | Incorrect or irrelevant data points |
Scaling | Changing the range of values in a dataset |
Encoding | Converting categorical data into numerical form |
Conclusion
Data preprocessing is a crucial step in the data science pipeline as it helps to ensure that the data is in a suitable format for analysis and modeling.