Feature Engineering in Machine Learning (with Python Examples)

Feature engineering is a process of selecting, transforming and extracting relevant features from data to train machine learning models.

Feature engineering is one of the most important steps in the machine learning workflow, and it can have a significant impact on the performance of the trained model.

In this article, we will explore the concept of feature engineering, its importance in machine learning, and some common techniques used for feature engineering.

What is Feature Engineering?

Feature engineering is the process of selecting, transforming and extracting features from raw data to create a dataset that is suitable for training a machine learning model. In other words, feature engineering is the process of creating features that best represent the underlying problem that we are trying to solve.

The process of feature engineering involves the following steps:

  • Selection of relevant features
  • Preprocessing of features (cleaning, normalization, transformation, etc.)
  • Extraction of new features (if required)

Importance of Feature Engineering

The quality of the features used to train a machine learning model is one of the most important factors that determine the performance of the model. A well-engineered set of features can make even a simple machine learning algorithm perform well, whereas a poorly designed set of features can make even the most advanced machine learning algorithm fail.

Some of the benefits of feature engineering include:

  • Improvement in the accuracy of the model
  • Reduction in overfitting of the model
  • Reduction in the dimensionality of the data
  • Improved interpretability of the model

Common Techniques for Feature Engineering

There are several techniques that can be used for feature engineering in machine learning. Some of the most common techniques are:

  1. Imputation: Imputation is the process of replacing missing values in a dataset with some meaningful value. There are several techniques for imputation such as mean imputation, median imputation, mode imputation, etc.
  2. Scaling: Scaling is the process of transforming the range of the features to a common scale. This is useful when different features have different scales, and we want to give equal importance to all the features. Some common scaling techniques are Min-Max scaling and Z-score normalization.
  3. Encoding: Encoding is the process of converting categorical data into numerical data. Some common encoding techniques are one-hot encoding, label encoding, and target encoding.
  4. Feature Selection: Feature selection is the process of selecting a subset of the most relevant features for training the model. Some common feature selection techniques are correlation-based feature selection, mutual information-based feature selection, and model-based feature selection.
  5. Feature Extraction: Feature extraction is the process of creating new features from existing features. Some common feature extraction techniques are Principal Component Analysis (PCA), Linear Discriminant Analysis (LDA), and t-SNE.
  6. Feature Transformation: Feature transformation modifies original set of features to create new sets of features. It helps to better represents the patterns and structure in the data.

Python code Examples

Feature engineering with scikit-learn


from sklearn.datasets import load_boston
from sklearn.preprocessing import PolynomialFeatures, StandardScaler

# Load the Boston Housing dataset
data = load_boston()

# Split the data into features and target
X, y = data.data, data.target

# Apply polynomial feature engineering to the features
poly = PolynomialFeatures(degree=2, include_bias=False)
X_poly = poly.fit_transform(X)

# Apply standard scaling to the features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_poly)

Useful Python Libraries for Feature engineering

  • NumPy: np.concatenate(), np.vstack(), np.hstack(), np.column_stack()
  • Pandas: pd.concat(), pd.get_dummies(), pd.cut(), pd.qcut(), pd.pivot_table(), pd.melt()
  • Scikit-learn: PolynomialFeatures(), StandardScaler(), MinMaxScaler(), MaxAbsScaler(), RobustScaler(), Binarizer(), FunctionTransformer(), KBinsDiscretizer()
  • Feature-engine: OrdinalEncoder(), RareLabelEncoder(), OneHotEncoder(), NumericalImputer(), CategoricalImputer(), MeanEncoder(), CountFrequencyEncoder()
  • Category Encoders: OrdinalEncoder(), OneHotEncoder(), TargetEncoder(), CountEncoder(), CatBoostEncoder()
  • TextBlob: TextBlob(), sentiment.polarity, sentiment.subjectivity
  • NLTK: word_tokenize(), stopwords.words(), PorterStemmer(), SnowballStemmer(), WordNetLemmatizer()

Datasets useful for Feature engineering

Titanic


# Python example
import pandas as pd
url = 'https://web.stanford.edu/class/archive/cs/cs109/cs109.1166/stuff/titanic.csv'
df = pd.read_csv(url)
df.head()

Boston Housing


# Python example
from sklearn.datasets import load_boston
boston = load_boston()
df = pd.DataFrame(boston.data, columns=boston.feature_names)
df['target'] = boston.target
df.head()

Adult Census Income


# Python example
import pandas as pd
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data'
columns = ['age', 'workclass', 'fnlwgt', 'education', 'education-num', 'marital-status', 'occupation', 'relationship', 'race', 'sex', 'capital-gain', 'capital-loss', 'hours-per-week', 'native-country', 'income']
df = pd.read_csv(url, header=None, names=columns, na_values='?')
df.head()

Relevant entities

Entity Properties
Feature A measurable property or characteristic of a phenomenon being observed or studied.
Feature extraction The process of automatically extracting relevant features from raw data.
Feature selection The process of selecting a subset of relevant features to use in model training.
Dimensionality reduction The process of reducing the number of features used in a model, often through techniques like principal component analysis (PCA) or t-SNE.
Encoding The process of converting categorical data into a numerical format that can be used in machine learning models.
Transformations Mathematical operations that can be applied to features to create new features or improve their usefulness in machine learning models.

Important Concepts in Feature engineering

  • Data Preprocessing
  • Feature Extraction
  • Feature Scaling
  • Feature Selection
  • Dimensionality Reduction
  • Feature Construction
  • Handling Missing Data
  • Handling Categorical Data
  • Handling Text Data
  • Handling Time-Series Data
  • Handling Image Data
  • Handling Audio Data
  • Handling Spatial Data
  • Feature Crosses
  • Feature Interactions

Conclusion

Feature engineering is a critical step in the machine learning workflow that can have a significant impact on the performance of the trained model. It involves selecting, transforming, and extracting relevant features from the data to create a dataset that is suitable for training a machine learning model.

What is Feature engineering?

Process of selecting and transforming raw data into features that can be used in a machine learning model.

Why is Feature engineering important?

Good feature engineering can significantly improve model accuracy and generalization.

What are some common Feature engineering techniques?

Feature extraction, feature selection, dimensionality reduction, encoding, transformations.

What are the challenges of Feature engineering?

Time-consuming, requires domain knowledge, potential for overfitting, can be hard to automate.