Machine Learning

Bagging algorithms in Python

We can either use a single algorithm or combine multiple algorithms in building a machine learning model. Using multiple algorithms is known as ensemble learning.  Ensemble learning gives better prediction results than single algorithms. The most common types of ensemble learning techniques are bagging and boosting.

In Bagging, multiple homogenous algorithms are trained independently and combined afterward to determine the model's average.

Boosting is an ensemble technique, where we train multiple homogenous algorithms sequentially. These individual algorithms create a final model with the best results. The performance of one algorithm is influenced by the performance of the previously built algorithm.

This tutorial will use the two approaches in building a machine learning model.

In the first approach, we will use a single algorithm known as the Decision Tree Classifier to build and find the accuracy of a model.

In the second approach, we will use the Bagging Classifier and the Random Forest Classifier to build the same model and find its accuracy.

Prerequisites
Bagging vs Boosting
How Bagging works
Bootstrapping
Parallel training
Aggregation
Benefits of using Bagging algorithms
Dataset used
Loading the dataset
Dataset scaling
Splitting the dataset
Model building using Decision Tree Classifier
Getting the mean accuracy score
Implementing the bagging algorithms
Building the model using Bagging Classifier
Fitting the model
Accuracy score
Building the model using Random Forest Classifier
Conclusion
References

Prerequisites

To follow along with this tutorial, the reader should know the following:

Python programming.
Machine learning workflows.
Machine learning in Python using Scikit-learn.
Data analysis using Pandas.

NOTE: In this tutorial, we will use Google Colab notebook.

Bagging vs Boosting

As mentioned above, in Bagging, multiple homogenous algorithms are trained independently in parallel, while in Boosting, multiple homogenous algorithms are trained sequentially. The image below shows the difference between Bagging and Boosting.

Bagging Vs Boosting

Image Source: Pluralsight

In this tutorial, we will only be focusing on implementing Bagging algorithms. To implement Boosting algorithms, read this article.

How Bagging works

The Bagging technique is also known as Bootstrap Aggregation and can be used to solve both classification and regression problems. In addition, Bagging algorithms improve a model's accuracy score.

These algorithms prevent model overfitting and reduce variance.

Overfitting is when the model performs well using the training dataset but poorly using the testing dataset, meaning the model will not be able to make accurate predictions.

Variance is used to describe the changes within a model. For example, a variance occurs when you train the model using different splits.

Bagging algorithms are used to produce a model with low variance. To understand variance in machine learning, read this article.

Bagging comprises three processes: bootstrapping, parallel training, and aggregation.

Bootstrapping

Bootstrapping is a data sampling technique used to create samples from the training dataset.

Bootstrapping samples the rows and columns of the training dataset with replacement randomly. We can select the same data samples multiple times using sampling with replacement.

The bootstrapping process generates multiple subsets from the original datasets. The multiple subsets have equal tuples and can be used as a training dataset.

The image below shows the bootstrapping process:

Bootstrapping Process

Image Source: Data Aspirant

Parallel training

The process of bootstrapping generates multiple subsets. On each subset, a machine learning algorithm is fitted.

The fitting algorithm is trained using multiple subsets to produce various models. The various models produced are called weak learners or base models.

We will have multiple base models trained in parallel by this stage.

The image below shows the parallel training process:

Parallel Training

Image Source: Data Aspirant

Aggregation

Aggregation is the last stage in Bagging. The multiple predictions made by the base models are combined to produce a single final model. The final model will have low variance and a high accuracy score.

The final model is produced depending on the voting technique used. We have two common voting techniques used, hard voting and soft voting.

Hard Voting

Hard voting is also known as majority voting. Hard voting is majorly used when dealing with a classification problem.

In classification problems, the prediction made by each base model is seen as a vote. This is because the most common prediction made by the base models is the right prediction.

The image below shows the hard voting process:

Hard Voting

Image Source: Data Aspirant

Soft Voting

Soft voting is majorly used when dealing with a regression problem. In soft voting, we find the average of the predictions made by the base models. The average value is what is taken as the prediction result.

The image below shows the hard vs soft voting side-by-side:

Hard vs Soft voting

Image Source: Medium

The image below further shows the process of Bagging. Again, the image clearly describes how each process is done.

How Bagging works

Benefits of using Bagging algorithms

Bagging algorithms improve the model's accuracy score.
Bagging algorithms can handle overfitting.
Bagging algorithms reduce bias and variance errors.
Bagging can easily be implemented and produce more robust models.

Now that we have discussed the theory, let us implement the Bagging algorithm using Python.

Dataset used

We will use the diabetes dataset to predict if a person has diabetes or not. The collected dataset has Age and blood pressure features, which help the model determine if the person has diabetes.

To download the diabetes dataset, click here.

Loading the Dataset

We will use the Pandas library to load the dataset we downloaded using the link above.

Let us import Pandas.

import pandas as pd

To load the dataset, use this code:

df = pd.read_csv("/content/diabetes.csv")

Lets now see how our dataset is structured using the following code:

df.head()

The dataset structure is shown in the image below:

Diabetes dataset

Our dataset has columns such as Age and blood pressure from the image above.

The features will be used as input for the model. The last column labeled Outcome will be an output column. The Outcome column is either labeled 0(non-diabetic) or 1(diabetic).

Let us check for missing values in this dataset.

Checking for missing values

Missing values makes the dataset incomplete. The dataset with missing values leads to inconsistent results and poor model performance.

To check for missing values, use this code:

df.isnull().sum()

The output is shown below:

Missing Values

From the image above, there are no missing values. Therefore, our dataset is complete and ready for use.

Adding x and y variables

We need to specify the x and y variables. The x variable will hold all the input columns, while the y variable will hold the output column.

In our case, our output column is the Output column. The remaining columns will be used as model inputs.

We add the x and y variables using the following code:

X = df.drop("Outcome",axis="columns")
y = df.Outcome

The next step is to scale our dataset.

Dataset scaling

Dataset scaling is transforming a dataset to fit within a specific range. For example, you can scale a dataset to fit within a range of 0-1, -1-1, or 0-100.

Dataset scaling ensures that no data point value is left out during model training.

To understand the concept of dataset scaling better, read this article

We have different scales for dataset scaling. In our case, we use the StandardScaler to scale our dataset.

To further understand how the StandardScaler works to perform scaling, click here.

Let us import the StandardScaler

from sklearn.preprocessing import StandardScaler

To scale our input columns(The x variable), use this code:

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

If the code is executed, it will scale the entire dataset. To see some of the scaled datasets, use this code:

X_scaled[:3]

The output will show the 4th row of the scaled dataset:

Scaled dataset

After scaling the dataset, we can split the scaled dataset.

Splitting the Dataset

We will split the scaled dataset into training and testing. To split the dataset, we will use the train_test_split method.

from sklearn.model_selection import train_test_split

To perform the splitting, use this code:

X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, stratify=y, random_state=10)

The code above will use the default splitting ratio when splitting the dataset. 80% of the data will be the training set and 20% the testing set.

To check the number of data samples in the training set, use this code:

X_train.shape

The output is shown below:

(576, 8)

We can also see the size of the testing dataset using the following code:

X_test.shape

The output is shown below:

(192, 8)

Model building using Decision Tree Classifier

The decision tree classifier is the Scikit-learn algorithm used for classification. To import this algorithm, use this code:

from sklearn.tree import DecisionTreeClassifier

We will use k-fold cross-validation to build our decision tree classifier. In addition, K-fold cross-validation allows us to split our dataset into various subsets or portions.

The model is then trained using each subset and gets the accuracy scores after each iteration. Finally, the mean accuracy score is calculated. K refers to the number of subsets/portions we split the dataset.

For a detailed understanding of K-fold cross-validation, read this guide.

Let us import the method that will help to perform K-fold cross-validation.

from sklearn.model_selection import cross_val_score

To use the cross_val_score method, run this code:

scores = cross_val_score(DecisionTreeClassifier(), X, y, cv=5)
scores

In the code above, we will split the dataset into five folds. It produces the model accuracy score after each iteration, as shown in the output below:

array([0.69480519, 0.68181818, 0.71428571, 0.77777778, 0.73856209])

Getting the Mean Accuracy Score

The mean score is obtained using the following code:

scores.mean()

The output is shown below:

0.7214497920380273

Using the cross-validation score, we get the accuracy score to be 0.7214497920380273. We can build the same model using the bagging algorithms to compare the accuracy scores.

Implementing the Bagging algorithms

Let us first build the model using the BaggingClassifier.

Building the model using Bagging Classifier

The BaggingClassifier classifier will follow all the bagging steps and build an optimized model. The BaggingClassifier will fit the weak/base learners on the randomly sampled subsets.

Next, it will use the voting techniques to produce an aggregated final model. Finally, we will use the DecisionTreeClassifier algorithm as our weak/base learners.

To import the BaggingClassifier, use this code:

from sklearn.ensemble import BaggingClassifier

To use this model, use this code:

bag_model = BaggingClassifier(
base_estimator=DecisionTreeClassifier(), 
n_estimators=100, 
max_samples=0.8, 
bootstrap=True,
oob_score=True,
random_state=0
)

In the code snippet above, we have used the following parameters: base_estimator - This represents the algorithm used as the base/weak learners. We will use the DecisionTreeClassifier algorithm as our weak/base learners.

n_estimators - This represents the number of weak learners used. We will use 100 decision trees to build the bagging model.

max_samples - The maximum number of data that is sampled from the training set. We use 80% of the training dataset for resampling.

bootstrap - Allows for resampling of the training dataset without replacement.

oob_score - Used to compute the model's accuracy score after training.

random_state - Allows us to reproduce the same dataset samples. Furthermore, it ensures that the same ratio is used when producing the multiple subsets.

The next step is to fit the initialized model into our training set.

Fitting the Model

Fitting will enable the model to learn from the training dataset to understand the dataset and gain helpful insight.

To fit the model, use this code:

bag_model.fit(X_train, y_train)

Finally, let us calculate the model accuracy score.

Accuracy Score

To get the accuracy score, run this code:

bag_model.oob_score_

The accuracy score is shown below:

0.7534722222222222

The model improves the accuracy score. For example, the accuracy score improved from 0.7214497920380273 to 0.7534722222222222.

We can also check the accuracy score using the testing dataset to determine if our model is overfitting. To get the accuracy score, use this code:

bag_model.score(X_test, y_test)

The output is shown below:

0.7760416666666666

The accuracy score shows that our model is not overfitting. Overfitting occurs when we get a lower accuracy when using the testing dataset.

Let us now use the Random Forest Classifier.

Building the model using Random Forest Classifier

Random Forest Classifier has several decision trees trained on the various subsets. This algorithm is a typical example of a bagging algorithm.

Random Forests uses bagging underneath to sample the dataset with replacement randomly. Random Forests samples not only data rows but also columns. It also follows the bagging steps to produce an aggregated final model.

Let us import the Random Forest Classifier.

from sklearn.ensemble import RandomForestClassifier

To use the RandomForestClassifier algorithm, run this code:

scores = cross_val_score(RandomForestClassifier(n_estimators=50), X, y, cv=5)

We have also used the K-fold cross-validation to train the model. To get the mean accuracy score, use this code snippet:

scores.mean()

The output is shown below:

0.7618029029793736

The accuracy score is high, showing that the bagging algorithms increase the model accuracy score. It also prevents model overfitting.

Conclusion

This tutorial guided a reader on how to implement bagging algorithms in Python. We discussed the difference between Bagging and boosting. We also went through all the steps involved in Bagging, clearly illustrating how each step works.

Furthermore, we built a diabetes classification model using both a single algorithm and bagging algorithms. The bagging algorithms produced a model with a higher accuracy score, indicating that bagging algorithms are best suited for building better models.

To get the Python code, click here.