search
Search
Login
Unlock 100+ guides
menu
menu
web
search toc
close
Comments
Log in or sign up
Cancel
Post
account_circle
Profile
exit_to_app
Sign out
What does this mean?
Why is this true?
Give me some examples!
search
keyboard_voice
close
Searching Tips
Search for a recipe:
"Creating a table in MySQL"
Search for an API documentation: "@append"
Search for code: "!dataframe"
Apply a tag filter: "#python"
Useful Shortcuts
/ to open search panel
Esc to close search panel
to navigate between search results
d to clear all current filters
Enter to expand content preview
icon_star
Doc Search
icon_star
Code Search Beta
SORRY NOTHING FOUND!
mic
Start speaking...
Voice search is only supported in Safari and Chrome.
Navigate to

Comprehensive Guide on Random Forest

schedule Aug 11, 2023
Last updated
local_offer
Machine LearningPython
Tags
mode_heat
Master the mathematics behind data science with 100+ top-tier guides
Start your free 7-days trial now!

Prerequisite

You must already understand how the decision tree model works. If not, please visit our guide.

What is random forest

Random forest is a machine learning model that involves building multiple decision trees in a random manner to perform classification or regression. The advantage random forests have over standard decision trees is that random forests are much less prone to overfitting due to injecting randomness when building the model.

Motivating example

Consider the following dataset:

gender

group

gpa

is_pass

male

A

2.8

true

male

B

3.7

false

female

A

3.9

false

female

C

2.1

true

The first step of the random forest is to create a bootstrapped dataset of the same size as the original dataset. This means that we must sample with replacement from the original dataset. For instance, the following could be an example of a bootstrapped dataset:

gender

group

gpa

is_pass

female

A

3.9

false

female

A

3.9

false

male

A

2.8

true

male

B

3.7

false

Since we are sampling with replacement, notice how this new dataset contains two of the same records.

The next step is to build a decision tree using this bootstrapped dataset. However, the catch here is that we must only use a random subset of features at each step. The number of features to consider is a hyper-parameter that you can freely choose. For instance, suppose we choose 2 features randomly at each step.

For the first step, the 2 features randomly selected turned out to be as follows:

gender

group

gpa

is_pass

female

A

3.9

false

female

A

3.9

false

male

A

2.8

true

male

B

3.7

false

Suppose for the sake of this example that the chosen split was female vs male. Our decision tree will be as follows:

Now that since we have selected gender as the split, we no longer consider this feature for subsequent splits:

gender

group

gpa

is_pass

female

A

3.9

false

female

A

3.9

false

male

A

2.8

true

male

B

3.7

false

We now only have 2 features left (group and gpa), and so these 2 features will be chosen as candidates to compute the next split.

The only difference between this process and building a standard decision tree is that:

  • we use a bootstrapped dataset to build decision trees in random forest

  • we only select a subset of features when considering candidates to compute splits at each step

The random forest repeats the above steps to build multiple (e.g. over 100) different decision trees. Each of these decision trees will very likely be different since we are bootstrapping and randomly selecting features at each step.

Performing classification

After a random forest is built, the model should have 100s of different decision trees. Each decision tree will perform classification, and we keep a count of the results. For instance, suppose out of 100 decision trees, 70 decision trees predicted a pass while 30 trees predicted a fail. We simply take the majority vote and conclude that the student will pass the exam.

NOTE

Bootstrapping the original data set and then performing aggregation is called bagging.

Evaluation

Not all observations in the original dataset will end up in the bootstrapped datasets. These observations, which are called out-of-bag samples, are therefore not used to construct the decision tree. We could use these out-of-bag samples to evaluate the performance of the decision trees. Note that an observation will likely be an out-of-bag sample for many bootstrapped datasets, that is, an observation will likely not be used in the construction of many decision trees.

For instance, suppose we have a random forest of 100 decision trees. Suppose an observation was not selected in the bootstrapped dataset for 30 decision trees. This means that 30 decision trees did not use this observation for their construction. We can therefore obtain a predicted label for this observations based on the 30 decision trees. Since out-of-bag samples have known true labels, we would be able to tell whether the prediction is correct or not.

Now, we perform the process for all the other out-of-bag samples. For instance, suppose out of 100 observations, we have 70 out-of-bag samples. This would mean that we would end up with 70 evaluation results, that is, we know whether or not 70 of these out-of-bag samples were correctly classified. We can easily obtain the classification accuracy by computing the proportions of correct predictions.

Implementing Random Forest on Python's Scikit-learn

Suppose we wanted to build a random forest to classify the type of iris given 4 features such as sepal length.

To begin, import the required modules:

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn import datasets
import pandas as pd
import numpy as np

We then read the dataset and convert its type into Pandas DataFrame:

bunch_iris = datasets.load_iris()
# Construct a DataFrame from the Bunch Object
data = pd.DataFrame(data=np.c_[bunch_iris['data'], bunch_iris['target']],
columns=bunch_iris['feature_names'] + ['target'])
data.head()
sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) target
0 5.1 3.5 1.4 0.2 0.0
1 4.9 3.0 1.4 0.2 0.0
2 4.7 3.2 1.3 0.2 0.0
3 4.6 3.1 1.5 0.2 0.0
4 5.0 3.6 1.4 0.2 0.0

We then split the dataset into features and target:

# Break into X (features) and y (target)
X = data.iloc[:,1:4]
y = data.iloc[:,-1]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=50)
print("Number of rows of X_train:", X_train.shape[0])
print("Number of rows of y_train:", y_train.shape[0])
print("Number of rows of X_test:", X_test.shape[0])
print("Number of rows of y_test:", y_test.shape[0])
Number of rows of X_train: 120
Number of rows of y_train: 120
Number of rows of X_test: 30
Number of rows of y_test: 30

We then train our random forest and compute performance metrics using the testing set:

# n_estimators is the number of decision trees you want to build for our forest
model = RandomForestClassifier(n_estimators=50, random_state=50)
model.fit(X_train, y_train)
y_test_predicted = model.predict(X_test)
print(classification_report(y_test, y_test_predicted))
precision recall f1-score support
0.0 1.00 1.00 1.00 9
1.0 1.00 0.83 0.91 12
2.0 0.82 1.00 0.90 9
accuracy 0.93 30
macro avg 0.94 0.94 0.94 30
weighted avg 0.95 0.93 0.93 30

We see that the classification accuracy using the testing set is 0.93.

robocat
Published by Isshin Inada
Edited by 0 others
Did you find this page useful?
thumb_up
thumb_down
Comment
Citation
Ask a question or leave a feedback...
thumb_up
0
thumb_down
0
chat_bubble_outline
0
settings
Enjoy our search
Hit / to insta-search docs and recipes!