Log Transformation in Machine Learning (with Python Examples)


In data analysis and machine learning, log transformation is a feature transformation technique used to modify the values of a numeric variable by taking the logarithm of each value. The logarithm function used in the transformation is typically the natural logarithm (base e) or the logarithm with base 10.

In machine learning, log transformation can be used to normalize data, reduce the impact of outliers, and make data more suitable for certain types of analyses. In this article, we’ll delve into the concept of log transformation and explore its various applications in machine learning.

In this tutorial we will show how log transformation works in Python and produce the following output on the california housing project.

What is Log Transformation?

Logarithm transformation is a mathematical operation that is used to reduce the scale of data. It is based on the mathematical concept of logarithms, which is the inverse of exponential functions. Logarithms transform multiplication operations into addition operations, and division operations into subtraction operations. This transformation is useful in machine learning as it reduces the range of values and compresses the larger values, making it easier to analyze data with large variations.

Applications of Log Transformation in Machine Learning

Normalizing Data

Log transformation can be used to normalize data that is not normally distributed. In many machine learning algorithms, it is essential to have normally distributed data to ensure the best performance. By taking the log of the data, we can transform it into a more normal distribution, making it easier to analyze and model.

Reducing the Impact of Outliers

Outliers can have a significant impact on machine learning algorithms, leading to biased models. By applying a log transformation to the data, we can reduce the impact of outliers by compressing the larger values. This makes it easier to build a model that is not dominated by extreme values.

Making Data More Suitable for Certain Types of Analyses

In some cases, the relationship between variables in a dataset may not be linear. By applying a log transformation to one or more variables, we can transform the data into a form that is more suitable for certain types of analyses. For example, if the relationship between two variables is exponential, taking the log of both variables can transform the relationship into a linear one.

How to Apply Log Transformation in Python

Log Transformation with NumPy

In Python, the NumPy library provides the log function to apply log transformation to a dataset. Here is a simple example:


import numpy as np
#Create a sample dataset
x = np.array([1, 2, 3, 4, 5])

#Apply log transformation
log_x = np.log(x)

#Print results
print("Original dataset: ", x)
print("Log transformed dataset: ", log_x)
Original dataset:  [1 2 3 4 5]
Log transformed dataset:  [0.         0.69314718 1.09861229 1.38629436 1.60943791]

Log Transformation with Scikit-Learn


This code snippet demonstrates the usage of the FunctionTransformer class from the sklearn.preprocessing module in scikit-learn, a popular machine learning library in Python.

from sklearn.preprocessing import FunctionTransformer
import numpy as np
#Create a sample dataset
X = np.array([[1, 2], [3, 4]])

#Define a log transformer
log_transformer = FunctionTransformer(np.log)

#Apply log transformation
log_X = log_transformer.transform(X)

#Print results
print("Original dataset: \n", X)
print("Log transformed dataset: \n", log_X)
Original dataset: 
 [[1 2]
 [3 4]]
Log transformed dataset: 
 [[0.         0.69314718]
 [1.09861229 1.38629436]]

Visualize Log Transformation on the California Housing Dataset


import seaborn as sns
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import fetch_california_housing

# Load the California Housing dataset
california = fetch_california_housing()

# Convert the dataset into a Pandas DataFrame
df = pd.DataFrame(california.data, columns=california.feature_names)

# Apply log transformation to the 'AveRooms' column
df['AveRooms'] = np.log(df['AveRooms'])

# Use Seaborn's regplot function to create a scatterplot with a regression line
sns.regplot(x='AveRooms', y='AveBedrms', data=df)

# Show the plot
plt.show()

Useful Python Libraries for Log transformation

  • NumPy: numpy.log()
  • Pandas: pandas.Series.apply()
  • SciPy: scipy.stats.boxcox()
  • Scikit-learn: sklearn.preprocessing.FunctionTransformer()
  • Math: math.log()

Datasets useful for Log transformation

California Housing dataset

import pandas as pd
from sklearn.datasets import fetch_california_housing

# Load the California Housing dataset
california = fetch_california_housing()

# Convert the dataset into a Pandas DataFrame
df = pd.DataFrame(california.data, columns=california.feature_names)

To Know Before You Learn Log transformation?

  • Basic knowledge of algebra and calculus
  • Familiarity with Python programming language and its syntax
  • Understanding of data preprocessing and feature scaling
  • Basic knowledge of probability distributions
  • Understanding of linear regression and its assumptions

Important Concepts in Log transformation

  • Skewed data
  • Feature scaling
  • Normalization
  • Distribution
  • Outliers
  • Interpretation of transformed data

What’s Next?

  • Data scaling
  • Feature engineering
  • Feature selection
  • Feature transformation
  • Regularization techniques
  • Principal component analysis (PCA)
  • Kernel methods
  • Decision trees and random forests

Relevant entities

EntityProperty
LogarithmMathematical function
Data transformationStatistical technique
SkewnessMeasure of asymmetry
MeanCentral tendency
MedianCentral tendency
VarianceMeasure of spread
Standard deviationMeasure of spread

Sources:

  • “Logarithmic Transformation.” Wikipedia, Wikimedia Foundation, 27 Jan. 2022, https://en.wikipedia.org/wiki/Logarithmic_transformation.
  • Brownlee, Jason. “A Gentle Introduction to Logarithmic Transform for Machine Learning.” Machine Learning Mastery, 9 June 2021, https://machinelearningmastery.com/logarithmic-transforms-for-machine-learning/.
  • “Log Transformation: How to Use it for Better Data Science Insights?” Analytics Vidhya, 2 Nov. 2021, https://www.analyticsvidhya.com/blog/2021/11/log-transformation-how-to-use-it-for-better-data-science-insights/.
  • “Feature Engineering Techniques for Machine Learning.” DataCamp, https://www.datacamp.com/courses/feature-engineering-for-machine-learning-in-python.
  • “Python Log() Function.” GeeksforGeeks, 14 Jan. 2022, https://www.geeksforgeeks.org/python-log-function/.

Conclusion

Log transformation is a powerful technique that can be used to normalize data, reduce the impact of outliers, and make data more suitable for certain types of analyses. It is a relatively simple operation that can have a significant impact on the performance of machine learning models. By understanding the concept of log transformation and its various applications, you can make more informed decisions when working with highly skewed data in machine learning.