Ensembling Learning

Definition:

Ensemble learning is a machine learning technique where multiple models are combined to produce a better overall prediction than any single model alone.

Advantages

Improved Predictive Accuracy:
Combining several models reduces errors and improves overall performance.

Disadvantages

Low Interpretability:
Ensembles (like Random Forest, Gradient Boosting, Stacking) are often difficult to understand because they merge many models.

Why Ensembles Help

Statistical Stability:
When the dataset is limited, many different hypotheses (possible models) may fit the data equally well.
A traditional algorithm picks only one hypothesis, which may:
- perform well on training data
- but perform poorly on unseen data
Ensemble methods reduce this risk by combining many hypotheses.

Key Terms

Hypothesis:
Any possible model or outcome generated from the data.
Hypothesis Space:
The set of all possible hypotheses a learning algorithm can choose from.
Limitation of Algorithms:
Due to computational constraints, algorithms cannot guarantee finding the absolute best hypothesis in the entire hypothesis space.

Bagging (Bootstrap Aggregating)

Bagging is a homogeneous weak learner ensemble method where multiple models are trained independently in parallel and their outputs are combined to produce a final prediction.
Example: Random Forest

Benefits of Bagging

Reduces overfitting
Improves accuracy
Handles unstable models (like decision trees)
Different experiments on the same input may produce different outcomes, increasing diversity

Steps of Bagging

Create multiple subsets from the original dataset
- Done randomly with replacement (bootstrap sampling)
- Subsets have almost equal size and similar feature values
Select observations with replacement for each subset
Train multiple models in parallel, each on a different subset
Collect predictions form all models
Combine the predictions
- Classification: Use majority voting
- Regression: Take the average of all model outputs

Bagging ensemble learning process showing bootstrapped datasets generated from training data and multiple models combined for improved performance

Boosting

Boosting is an ensemble method in which models are arranged in sequence to create a strong classifier. The process involves building models sequentially, where each model aims to correct the error made by the previous model.

Algorithm of Boosting

Initialize the dataset
- Assign equal weights to all data points.
Train the first model
- Identify the wrongly classified data points.
Update weights
- Increase weights of wrongly classified points
- Decrease weights of correctly classified points power
Check accuracy
- If the desired accuracy is reached → go to Step 5
- Else → repeat from Step 2
Stop
- The boosting process ends.

Random Forest

Random Forest is a supervised learning algorithm used for both classification and regression. It works by randomly creating a forest of decision trees, trained in parallel using the bagging technique.

The final prediction is based on:

Majority Vote → for classification
Average of Predictions → for regression

Algorithm (Random Forest)

Bootstrap Sampling
If the training set has examples, randomly select N data points with replacement from the original dataset.
This sample becomes the training set for one decision tree.
Random Feature Selection
If there are input features, choose M features at each node .
The value of remains fixed throughout tree construction.
Prediction of New Data
- Pass the new input through each decision tree.
- Each tree gives its own classification or regression output.
- Combine the outputs using:
  - Majority vote (classification)
  - Average (regression)

Advantages of Random Forest

Efficient on large datasets
Handles a large number of input features without feature deletion
Provides feature importance estimates
Deals well with missing data
Models (forests) can be saved and reused
Works for both classification and regression
Helps detect variable interactions

Disadvantages of Random Forest

For regression, it cannot predict beyond the range of training data
Large model size (many trees) → more memory usage and slower predictions
Acts like a black box → difficult to interpret