Decision Trees and Random Forests (with Python Examples)

Decision trees and random forests are popular machine learning algorithms used for both regression and classification problems. They are simple and easy to interpret, making them an ideal choice for beginners. These algorithms are based on the concept of breaking down data into smaller, more manageable subsets and make predictions based on the most relevant feature within each subset.

Decision Trees

A decision tree is a flowchart-like structure that represents a series of decisions and their possible consequences. In machine learning, the decision tree algorithm uses this structure to make predictions about an unknown outcome by considering different possibilities. The algorithm starts by selecting the feature with the highest information gain and splits the data into subsets based on this feature. This process is repeated recursively until all data is separated into distinct subsets.

Random Forests

Random forests are an extension of decision trees. Instead of constructing a single decision tree, the algorithm creates multiple decision trees and combines their predictions to get the final result. This process of combining multiple trees is known as an ensemble method and is used to reduce the variance and increase the stability of the model. In random forests, each tree is trained on a random subset of the data and a random subset of the features, which results in different trees having different strengths and weaknesses. The final prediction is made by combining the predictions of all the trees, typically by taking a majority vote.

Advantages and Disadvantages of Decision Trees and Random Forests

Advantages:

  • Easy to understand and interpret: Decision trees and random forests provide a clear visual representation of the decision-making process, making it easy to understand and interpret.
  • Can handle missing data: Unlike many other algorithms, decision trees and random forests can handle missing data without the need for imputation.
  • Fast and efficient: Both decision trees and random forests are fast and efficient algorithms, making them ideal for large datasets.
  • Robust to outliers: Both algorithms are robust to outliers and do not require normalization of data.

Disadvantages:

  • Overfitting: Decision trees can easily overfit the data, meaning they can fit the training data too well and perform poorly on new data.
  • Not suitable for continuous variables: Decision trees are not ideal for continuous variables, as they divide the data into discrete bins.
  • Bias towards dominant classes: Both decision trees and random forests can have a bias towards the dominant class, leading to a higher misclassification rate for minority classes.

Python code Examples

Decision Trees


import pandas as pd
from sklearn import tree
from sklearn.model_selection import train_test_split
Load dataset
data = pd.read_csv("data.csv")
X = data.drop("target", axis=1)
y = data["target"]

Split dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

Train decision tree model
clf = tree.DecisionTreeClassifier()
clf = clf.fit(X_train, y_train)

Predict target values using the test set
y_pred = clf.predict(X_test)

Evaluate model accuracy
accuracy = clf.score(X_test, y_test)
print("Accuracy:", accuracy)

Random Forest


import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
#Load dataset
data = pd.read_csv("data.csv")
X = data.drop("target", axis=1)
y = data["target"]

#Split dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

#Train random forest model
clf = RandomForestClassifier()
clf = clf.fit(X_train, y_train)

#Predict target values using the test set
y_pred = clf.predict(X_test)

#Evaluate model accuracy
accuracy = clf.score(X_test, y_test)
print("Accuracy:", accuracy)

Useful Python Libraries for Decision trees and random forests

scikit-learn: DecisionTreeClassifier, DecisionTreeRegressor
– xgboost: XGBClassifier, XGBRegressor
– LightGBM: LGBMClassifier, LGBMRegressor

Datasets useful for Decision trees and random forests

Iris Dataset


# Load Iris dataset from sklearn library
from sklearn.datasets import load_iris
iris = load_iris()
X, y = iris.data, iris.target

Breast Cancer Wisconsin (Diagnostic) Dataset


# Load Breast Cancer Wisconsin (Diagnostic) dataset from sklearn library
from sklearn.datasets import load_breast_cancer
bcw = load_breast_cancer()
X, y = bcw.data, bcw.target

Wine Quality Dataset


# Load Wine Quality dataset using pandas library
import pandas as pd
df = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv", sep=";")
X = df.drop("quality", axis=1)
y = df["quality"]

Important Concepts in Decision Trees and Random Forests

  • Split Criterion
  • Information Gain
  • Gini Impurity
  • Bagging and Bootstrapping
  • Random Subspaces
  • Out-of-Bag Error
  • Variable Importance

Relevant entities

Entities Properties
Decision tree A decision tree is a flowchart-like tree structure where an internal node represents feature(or attribute), the branch represents a decision rule, and each leaf node represents the outcome.
Random Forest A random forest is an ensemble learning method for classification, regression and other tasks, that operate by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes (classification) or mean prediction (regression) of the individual trees.
Root node The root node is the first and topmost node in a tree structure, which doesn’t have any parent node.
Leaf node A leaf node is a node with no children. It represents a decision on the specific input instance.
Split A split is a process of dividing a node into two or more child nodes based on a certain rule or criterion.

Frequently asked questions

What is the difference between decision trees and random forests?

Decision trees are a single tree that makes predictions, while random forests are an ensemble of decision trees.

Why are random forests better than decision trees?

Random forests reduce overfitting by combining predictions from multiple trees.

How do decision trees and random forests work?

They split data into smaller subsets based on the feature values and make predictions by combining the results of individual trees.

When to use decision trees vs random forests?

Use decision trees for interpretability and random forests for improved accuracy and reduced overfitting.

Conclusion

In conclusion, decision trees and random forests are simple, efficient, and interpretable machine learning algorithms that can handle missing data and outliers. However, they can also be prone to overfitting and bias towards dominant classes. When choosing between decision trees and random forests, it is important to consider the specific problem and data, as well as the desired trade-off between accuracy and interpretability.

References

Decision Tree – Wikipedia