Feature Engineering in Machine Learning (with Python Examples)

February 17, 2023

By Admin

Feature engineering is a process of selecting, transforming and extracting relevant features from data to train machine learning models.

Feature engineering is one of the most important steps in the machine learning workflow, and it can have a significant impact on the performance of the trained model.

In this article, we will explore the concept of feature engineering, its importance in machine learning, and some common techniques used for feature engineering.

Contents hide

1 What is Feature Engineering?

2 Importance of Feature Engineering

3 Common Techniques for Feature Engineering

4 Python code Examples

4.1 Feature engineering with scikit-learn

5 Useful Python Libraries for Feature engineering

6 Datasets useful for Feature engineering

6.1 Titanic

6.2 Boston Housing

6.3 Adult Census Income

7 Relevant entities

8 Important Concepts in Feature engineering

9 Conclusion

9.1 What is Feature engineering?

9.2 Why is Feature engineering important?

9.3 What are some common Feature engineering techniques?

9.4 What are the challenges of Feature engineering?

9.5 Related posts:

What is Feature Engineering?

Feature engineering is the process of selecting, transforming and extracting features from raw data to create a dataset that is suitable for training a machine learning model. In other words, feature engineering is the process of creating features that best represent the underlying problem that we are trying to solve.

The process of feature engineering involves the following steps:

Selection of relevant features

Preprocessing of features (cleaning, normalization, transformation, etc.)
Extraction of new features (if required)

Importance of Feature Engineering

The quality of the features used to train a machine learning model is one of the most important factors that determine the performance of the model. A well-engineered set of features can make even a simple machine learning algorithm perform well, whereas a poorly designed set of features can make even the most advanced machine learning algorithm fail.

Some of the benefits of feature engineering include:

Improvement in the accuracy of the model
Reduction in overfitting of the model

Reduction in the dimensionality of the data
Improved interpretability of the model

Common Techniques for Feature Engineering

There are several techniques that can be used for feature engineering in machine learning. Some of the most common techniques are:

Imputation: Imputation is the process of replacing missing values in a dataset with some meaningful value. There are several techniques for imputation such as mean imputation, median imputation, mode imputation, etc.
Scaling: Scaling is the process of transforming the range of the features to a common scale. This is useful when different features have different scales, and we want to give equal importance to all the features. Some common scaling techniques are Min-Max scaling and Z-score normalization.
Encoding: Encoding is the process of converting categorical data into numerical data. Some common encoding techniques are one-hot encoding, label encoding, and target encoding.

Feature Selection: Feature selection is the process of selecting a subset of the most relevant features for training the model. Some common feature selection techniques are correlation-based feature selection, mutual information-based feature selection, and model-based feature selection.
Feature Extraction: Feature extraction is the process of creating new features from existing features. Some common feature extraction techniques are Principal Component Analysis (PCA), Linear Discriminant Analysis (LDA), and t-SNE.
Feature Transformation: Feature transformation modifies original set of features to create new sets of features. It helps to better represents the patterns and structure in the data.

Python code Examples

Feature engineering with scikit-learn


from sklearn.datasets import load_boston
from sklearn.preprocessing import PolynomialFeatures, StandardScaler

# Load the Boston Housing dataset
data = load_boston()

# Split the data into features and target
X, y = data.data, data.target

# Apply polynomial feature engineering to the features
poly = PolynomialFeatures(degree=2, include_bias=False)
X_poly = poly.fit_transform(X)

# Apply standard scaling to the features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_poly)

Useful Python Libraries for Feature engineering

NumPy: np.concatenate(), np.vstack(), np.hstack(), np.column_stack()
Pandas: pd.concat(), pd.get_dummies(), pd.cut(), pd.qcut(), pd.pivot_table(), pd.melt()
Scikit-learn: PolynomialFeatures(), StandardScaler(), MinMaxScaler(), MaxAbsScaler(), RobustScaler(), Binarizer(), FunctionTransformer(), KBinsDiscretizer()

Feature-engine: OrdinalEncoder(), RareLabelEncoder(), OneHotEncoder(), NumericalImputer(), CategoricalImputer(), MeanEncoder(), CountFrequencyEncoder()
Category Encoders: OrdinalEncoder(), OneHotEncoder(), TargetEncoder(), CountEncoder(), CatBoostEncoder()
TextBlob: TextBlob(), sentiment.polarity, sentiment.subjectivity

NLTK: word_tokenize(), stopwords.words(), PorterStemmer(), SnowballStemmer(), WordNetLemmatizer()

Datasets useful for Feature engineering

Titanic


# Python example
import pandas as pd
url = 'https://web.stanford.edu/class/archive/cs/cs109/cs109.1166/stuff/titanic.csv'
df = pd.read_csv(url)
df.head()

Boston Housing


# Python example
from sklearn.datasets import load_boston
boston = load_boston()
df = pd.DataFrame(boston.data, columns=boston.feature_names)
df['target'] = boston.target
df.head()

Adult Census Income


# Python example
import pandas as pd
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data'
columns = ['age', 'workclass', 'fnlwgt', 'education', 'education-num', 'marital-status', 'occupation', 'relationship', 'race', 'sex', 'capital-gain', 'capital-loss', 'hours-per-week', 'native-country', 'income']
df = pd.read_csv(url, header=None, names=columns, na_values='?')
df.head()

Relevant entities

Entity	Properties
Feature	A measurable property or characteristic of a phenomenon being observed or studied.
Feature extraction	The process of automatically extracting relevant features from raw data.
Feature selection	The process of selecting a subset of relevant features to use in model training.
Dimensionality reduction	The process of reducing the number of features used in a model, often through techniques like principal component analysis (PCA) or t-SNE.
Encoding	The process of converting categorical data into a numerical format that can be used in machine learning models.
Transformations	Mathematical operations that can be applied to features to create new features or improve their usefulness in machine learning models.

Important Concepts in Feature engineering

Data Preprocessing
Feature Extraction

Feature Scaling
Feature Selection
Dimensionality Reduction

Feature Construction
Handling Missing Data
Handling Categorical Data

Handling Text Data
Handling Time-Series Data
Handling Image Data

Handling Audio Data
Handling Spatial Data
Feature Crosses

Feature Interactions

Conclusion

Feature engineering is a critical step in the machine learning workflow that can have a significant impact on the performance of the trained model. It involves selecting, transforming, and extracting relevant features from the data to create a dataset that is suitable for training a machine learning model.

What is Feature engineering?

Process of selecting and transforming raw data into features that can be used in a machine learning model.

Why is Feature engineering important?

Good feature engineering can significantly improve model accuracy and generalization.

What are some common Feature engineering techniques?

Feature extraction, feature selection, dimensionality reduction, encoding, transformations.

What are the challenges of Feature engineering?

Time-consuming, requires domain knowledge, potential for overfitting, can be hard to automate.