Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Random Forest Algorithm (Machine Learning) – Detailed Explanation

Table of Contents

  1. What is Random Forest?
  2. Role in Machine Learning
  3. What is Decision Tree?
  4. How Random Forest Works
  5. Bagging Concept
  6. Feature Randomness
  7. Training Process Step-by-Step
  8. Classification and Regression
  9. Mathematical Concept
  10. Hyperparameters
  11. Advantages
  12. Disadvantages
  13. Real Life Applications
  14. Python Implementation
  15. Random Forest vs Decision Tree
  16. Interview Questions
  17. Conclusion

1. What is Random Forest?

Random Forest is a popular Supervised Machine Learning Algorithm that makes predictions using multiple Decision Trees. It is essentially an Ensemble Learning Method.

Multiple Decision Trees work together and combine their results to produce the Final Output.

If it’s a Classification Problem, Majority Voting is used.

If it’s a Regression Problem, Average is taken.


2. Role in Machine Learning

Random Forest is used for:

  • Classification
  • Regression
  • Feature Selection
  • Fraud Detection
  • Recommendation System
  • Medical Diagnosis
  • Stock Prediction
  • Spam Detection

It is very effective at reducing Overfitting.


3. What is Decision Tree?

To understand Random Forest, you must first understand Decision Tree.

Decision Tree is a Tree Structure that contains:

  • Root Node
  • Branches
  • Leaf Nodes

Example

Suppose we need to predict whether a student will pass.

Decision Tree may ask:

  • Attendance > 75%?
  • Study Hours > 4?
  • Assignment Complete?

Based on these questions, the Final Decision is made.

But a single Decision Tree often Overfits.

Random Forest is used to solve this problem.


4. How Random Forest Works

Random Forest creates multiple Decision Trees.

Each Tree:

  • Uses a different Data Sample
  • Uses different Features
  • Trains independently

The Output of all Trees is combined to create the Final Result.

That’s why it is more Accurate and Stable.


5. Bagging Concept

Bagging stands for:

Bootstrap Aggregation

Here:

  1. Random Sampling is done from the Dataset
  2. Each Sample creates a separate Tree
  3. Predictions from all Trees are combined

Example

If the Dataset has 1000 Rows:

  • Tree-1 → Random 1000 Samples
  • Tree-2 → Different Random Samples
  • Tree-3 → Another Random Sample

This way many Trees are created.


6. Feature Randomness

Random Forest doesn’t use all Features.

At each Split, a random subset of Features is selected.

Example

Suppose the Dataset has 20 Features.

At one Split, maybe 5 Features are randomly selected.

This results in:

  • Trees are not identical
  • Diversity increases
  • Overfitting decreases

7. Training Process Step-by-Step

Step 1: Collect Dataset

Training Data is collected.

Step 2: Bootstrap Sampling

Random Sampling creates different Subsets.

Step 3: Create Multiple Decision Trees

Each Sample trains a separate Tree.

Step 4: Random Feature Selection

Each Split uses some Random Features.

Step 5: Prediction

All Trees make Predictions.

Step 6: Final Output

Classification → Majority Voting

Regression → Average


8. Classification and Regression

Classification

When Output is a Category:

Examples:

  • Spam / Not Spam
  • Disease / No Disease
  • Cat / Dog

Then Majority Vote is taken.

Regression

When Output is Numeric:

Examples:

  • House Price
  • Temperature
  • Sales Prediction

Then Average of all Trees is taken.


9. Mathematical Concept

Suppose:

  • Total Trees = N
  • Each Tree Prediction = T1, T2, T3…

Classification

Final Prediction:

Majority Vote

Regression

Final Prediction Formula:

Average = \frac{T_1 + T_2 + T_3 + ... + T_N}{N}

Here the Average Output of all Trees is taken.


10. Hyperparameters

1. n_estimators

How many trees to create.

n_estimators = 100

2. max_depth

How deep the tree goes.

3. min_samples_split

Minimum samples needed to split.

4. min_samples_leaf

Minimum samples in a Leaf Node.

5. max_features

How many features to randomly select.

6. bootstrap

Whether to use Bootstrap Sampling.


11. Advantages

1. High Accuracy

Random Forest is generally very accurate.

2. Less Overfitting

Using multiple trees reduces overfitting.

3. Handles Noise

Works well even with noisy data.

4. Handles Missing Values

Can work even with some missing data.

5. Provides Feature Importance

Can determine which features are important.

6. Works Well with Large Dataset

Can handle large datasets.


12. Disadvantages

1. Slow Training

Training takes more time due to many trees.

2. Uses More Memory

Storing multiple trees requires more memory.

3. Hard to Interpret

Decision Tree is easy to understand but Random Forest is harder.

4. May Be Heavy for Real-time Systems

Large Forests can be slow for prediction.


13. Real Life Applications

Medical Diagnosis

Detecting diseases.

Fraud Detection

Detecting bank fraud.

Recommendation System

Recommending movies or products.

Stock Market Analysis

Predicting market trends.

Agriculture

Crop prediction.

Cyber Security

Malware detection.


14. Python Implementation

Import Libraries

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

Load Dataset

data = load_iris()
X = data.data
y = data.target

Train Test Split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

Create Model

model = RandomForestClassifier(n_estimators=100)

Training

model.fit(X_train, y_train)

Prediction

y_pred = model.predict(X_test)

Accuracy

accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

15. Random Forest vs Decision Tree

TopicDecision TreeRandom Forest
AccuracyLowerHigher
OverfittingMoreLess
SpeedFastRelatively Slower
ComplexitySimpleComplex
StabilityLowerHigher
Trees1Many

16. Interview Questions

Question 1: What is Random Forest?

It is an Ensemble Learning Algorithm that uses many Decision Trees to make the Final Prediction.

Question 2: Why is Overfitting less in Random Forest?

Because many Trees are used and Random Sampling and Feature Selection are applied.

Question 3: What is Bagging?

The method of training Multiple Models using Bootstrap Sampling is called Bagging.

Question 4: How does Random Forest work in Classification?

The Final Class is determined by taking the Majority Voting of all Trees.


17. Conclusion

Random Forest is currently one of the most popular and powerful Machine Learning Algorithms.

It is:

  • Accurate
  • Stable
  • Robust
  • Overfitting Resistant

That’s why it is widely used in Data Science, AI, Cyber Security, Medical Field, Finance, and many other fields.

If you want to learn Machine Learning, Random Forest is definitely worth learning well.