Scikit-Learn’s preprocessing.OrdinalEncoder in Python (with Examples)

Welcome to this article where we dive into the realm of machine learning preprocessing using Scikit-Learn’s OrdinalEncoder. Preprocessing is a crucial step in any machine learning pipeline. The OrdinalEncoder is one of the Scikit-Learn Encoders used for handling ordinal categorical data.

Sklearn Preprocessing OrdinalEncoder in Matplotlib
Scikit-learn Preprocessing OrdinalEncoder in Python

Understanding Ordinal Categorical Data

Ordinal categorical data consists of non-numeric values that have a clear order or ranking, like education levels or customer satisfaction ratings.

The Role of OrdinalEncoder

The OrdinalEncoder is designed to transform ordinal categorical variables into numerical values while preserving the order information.

Handling Ordinal Variables

OrdinalEncoder addresses the challenge of encoding ordinal variables by mapping categories to ordered numerical values.

Working Principle

OrdinalEncoder takes a list of categories and assigns them corresponding ordinal values.

Use Cases

  • Education levels
  • Socioeconomic status
  • Customer satisfaction ratings

Benefits of OrdinalEncoder

  • Preserves the ordinal relationship between categories
  • Enables numerical representation of ordinal data for machine learning models
  • Useful when applying algorithms that require numeric input

Challenges and Considerations

OrdinalEncoder assumes a meaningful order in the categories, which might not always be accurate.

Applying OrdinalEncoder

OrdinalEncoder is commonly used when dealing with ordinal categorical features, either as a standalone preprocessing step or as part of a more extensive data transformation process.

Python Code Examples

Example 1: Using Scikit-Learn Preprocessing OrdinalEncoder


from sklearn.preprocessing import OrdinalEncoder
import numpy as np

data = np.array([['Low'],
                 ['Medium'],
                 ['High'],
                 ['Medium'],
                 ['Low']])

encoder = OrdinalEncoder(categories=[['Low', 'Medium', 'High']])
encoded_data = encoder.fit_transform(data)

print("Original Data:")
print(data)
print("\nEncoded Data:")
print(encoded_data)

Visualize Scikit-Learn Preprocessing OrdinalEncoder with Python


import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.preprocessing import OrdinalEncoder

# Load the Iris dataset
iris = load_iris()
X = iris.data
species = iris.target_names[iris.target]

# Apply OrdinalEncoder to species
encoder = OrdinalEncoder()
species_encoded = encoder.fit_transform(species.reshape(-1, 1))

# Plot the original and encoded data
plt.figure(figsize=(10, 5))

plt.subplot(1, 2, 1)
plt.scatter(X[:, 0], X[:, 1], c=iris.target)
plt.title('Original Data')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')

plt.subplot(1, 2, 2)
plt.scatter(X[:, 0], X[:, 1], c=species_encoded)
plt.title('Encoded Data')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')

plt.tight_layout()
plt.show()

This code uses the Matplotlib library to visualize the effect of the Scikit-Learn Preprocessing OrdinalEncoder on the Iris dataset. It loads the Iris dataset, applies the OrdinalEncoder to encode the species labels, and then creates a side-by-side comparison of the original and encoded data.

Sklearn Preprocessing OrdinalEncoder in Matplotlib
Scikit-learn Preprocessing OrdinalEncoder in Python

Sklearn Encoders

Scikit-Learn provides three distinct encoders for handling categorical data: LabelEncoder, OneHotEncoder, and OrdinalEncoder.

  • LabelEncoder converts categorical labels into sequential integer values, often used for encoding target variables in classification.
  • OneHotEncoder transforms categorical features into a binary matrix, representing the presence or absence of each category. This prevents biases due to category relationships.
  • OrdinalEncoder encodes ordinal categorical data by assigning numerical values based on order, maintaining relationships between categories. These encoders play vital roles in transforming diverse categorical data types into formats compatible with various machine learning algorithms.
EncoderAdvantagesDisadvantagesBest Use Case
LabelEncoderSimple and efficient encoding. Useful for target variables. Preserves natural order.Doesn’t create additional features. Not suitable for features without order.Classification tasks where labels have a meaningful order.
OneHotEncoderPrevents bias due to category relationships. Useful for nominal categorical features. Compatible with various algorithms.Creates high-dimensional data. Potential multicollinearity issues.Machine learning algorithms requiring numeric input, especially for nominal data.
OrdinalEncoderMaintains ordinal relationships. Handles meaningful order. Useful for features with inherent hierarchy.May introduce unintended relationships. Not suitable for nominal data.Features with clear ordinal rankings, like education levels or ratings.

Python Example


from sklearn.preprocessing import LabelEncoder, OneHotEncoder, OrdinalEncoder
import pandas as pd

# Create a sample dataset
data = pd.DataFrame({
    'color': ['Red', 'Blue', 'Green', 'Red', 'Blue'],
    'size': ['Small', 'Large', 'Medium', 'Medium', 'Small'],
    'class': ['A', 'B', 'C', 'A', 'C']
})

# Using LabelEncoder
label_encoder = LabelEncoder()
data_label_encoded = data.copy()
for column in data.columns:
    data_label_encoded[column] = label_encoder.fit_transform(data[column])

# Using OneHotEncoder
onehot_encoder = OneHotEncoder()
data_onehot_encoded = onehot_encoder.fit_transform(data[['color', 'size']]).toarray()

# Using OrdinalEncoder
ordinal_encoder = OrdinalEncoder(categories=[['Small', 'Medium', 'Large']])
data_ordinal_encoded = ordinal_encoder.fit_transform(data[['size']])

print("Original Data:")
print(data)

print("\nLabel Encoded Data:")
print(data_label_encoded)

print("\nOneHot Encoded Data:")
print(data_onehot_encoded)

print("\nOrdinal Encoded Data:")
print(data_ordinal_encoded)

To learn more, read our blog post on Scikit-learn encoders.

Important Concepts in Scikit-Learn Preprocessing OrdinalEncoder

  • Categorical data and its types
  • Understanding ordinal categorical data
  • Order-preserving encoding techniques
  • Handling nominal data vs. ordinal data
  • Mapping categories to numerical values

To Know Before You Learn Scikit-Learn Preprocessing OrdinalEncoder?

  • Basics of categorical data and its significance in machine learning
  • Understanding of ordinal relationships in data
  • Familiarity with encoding techniques for categorical variables
  • Experience using Scikit-Learn for machine learning tasks
  • Appreciation of how different encoders handle categorical data

What’s Next?

  • Exploration of other Scikit-Learn preprocessing techniques
  • Introduction to feature scaling and normalization
  • Handling missing data in machine learning
  • Advanced encoding methods (Target Encoding, Frequency Encoding)
  • Application of preprocessing techniques in real-world datasets
  • Building complete machine learning pipelines

Relevant Entities

EntitiesProperties
Scikit-Learn OrdinalEncoderConverts ordinal categorical variables into numeric values while preserving order.
Ordinal Categorical DataNon-numeric values with meaningful order, like education levels.
Ordinal VariablesCategorical features with a distinct order or ranking.
Numerical MappingAssigning numerical values based on the order of categories.
Use CasesEducation levels, customer satisfaction ratings, socioeconomic status.
Preserved OrderEnsuring that ordinal relationships are maintained after encoding.

Sources

  1. scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OrdinalEncoder.html" target="_blank" rel="noreferrer noopener">Scikit-Learn Documentation on OrdinalEncoder
  2. A Comprehensive Guide to Different Types of Categorical Data Encoding
  3. onehotencoder-vs-labelencoder-vs-dictvectorizor" target="_blank" rel="noreferrer noopener">When to Use OneHotEncoder vs. LabelEncoder vs. DictVectorizer?
  4. How to Prepare Categorical Data for Deep Learning in Python
  5. All About Categorical Variable Encoding

Conclusion

The Scikit-Learn OrdinalEncoder is a valuable tool for converting ordinal categorical data into numerical values that retain the order information. By understanding how to use it effectively, data scientists can enhance the quality of their machine learning models when dealing with ordinal features.