Building an Ensemble Learning Based Regression Model using Python

Machine learning models are always evaluated based on their performance using specific metrics like; accuracy, precision, Mean Squared Error (MSE), etc. Each type of machine learning problem has its evaluation metrics.

Building high-performance models (models with low errors), therefore, depends on how well the evaluation metric score is. In this tutorial, we will build a performance-driven linear regression model using ensemble learning.

Prerequisites

To follow through the tutorial, you'll need:

To know the basics of Python.
To have a Kaggle account.
To know the basics of Machine Learning.

Introduction

Linear regression is a statistical method of modeling the relationship between independent variables (x) and dependent variables (y). It uses independent variables (features) to predict dependent variables (target).

Ensemble learning is a machine learning technique that seeks to achieve a better predictive model performance by combining decisions from different models.

For our model's evaluation, we will be using RMSE (Root Mean Squared Error).

NB: Regression problems cannot be measured using accuracy metric since the goal is to measure how close the predicted values are to the expected values and not to evaluate how correct the prediction is. Hence we use errors to evaluate our models.

Setting up your environment

Before building our model, we will first go to Kaggle and create a new notebook and rename it to Create_Folds.

Notebook

After that, download the data from Kaggle and add it to your environment using the Add Data button and upload the downloaded data as Dataset.

data

HINT: To flawlessly upload your data to Kaggle, compress the datasets.

Creating k-folds

Once done with setting up the environment, we will move on to creating k-folds for our dataset.

Cross-validation is a validation technique used to evaluate machine learning models on a finite dataset. It is quite popular as it is easier to understand and results in less biased predictions than other methods like train/test split.

It is also best that you create folds that you'll be using throughout the modeling process whenever you're starting with a machine learning problem.

Importing the necessary libraries

Before proceeding, we need to import the following necessary libraries:

import numpy as np 
import pandas as pd 
from sklearn import model_selection

Read data

We will now proceed to load our dataset into our notebook. We will use pandas library's read_csv() function to read the data as they constitute as csv files.

The code is as shown below:

train_data = pd.read_csv('/kaggle/input/Dataset/train.csv')
test_data = pd.read_csv('/kaggle/input/Dataset/test.csv')
submission = pd.read_csv('/kaggle/input/Dataset/sample_submission.csv')

Creating the folds

As shown below, let's create a new column with the name kfold on the last column.

train_data['kfold'] = -1

We will then proceed to create 5 folds using the following code block:

kf  = model_selection.KFold(n_splits= 5,shuffle = True, random_state=42)

for fold, (train_indicies,valid_indicies)in enumerate(kf.split(X=train_data)):
    train_data.loc[valid_indicies, "kfold"]=fold

After running the cell above, we will output the new csv file (train_kfolds.csv) with kfolds by running the code block below:

train_data.to_csv('train_kfolds.csv', index=False)

Here's the Kaggle notebook, which you can copy and edit.

Building a regression model

After creating the kfolds, we will download the train_kfolds.csv from the output data on our Create_kFolds notebook.

We'll then follow the same steps on setting up your environment to create a new notebook called RegressionModel and upload the Dataset and train_kfolds.csv data.

After we're done with the environment setup, we'll proceed to build our model.

Importing necessary libraries

To build our regression model, we need to import the following libraries:

import pandas as pd
import  numpy  as  np
from  sklearn.preprocessing  import  OrdinalEncoder
from  sklearn.model_selection  import  train_test_split
from  sklearn.ensemble  import  RandomForestRegressor
from  sklearn.metrics  import  mean_squared_error
from  xgboost  import  XGBRegressor

Once done, we will then proceed to read our data.

Read data

We will read our newly uploaded data, Dataset2 and trainfolds using the following code block below:

data = pd.read_csv('/kaggle/input/trainfolds/train_kfolds.csv')
test_data = pd.read_csv('/kaggle/input/Dataset2/test.csv')
submission = pd.read_csv('/kaggle/input/Dataset2/sample_submission.csv')

Feature selection

We will select the useful features from our dataset and remove the not so useful/impactful features. The not so useful features in this dataset would be; id, target, and kfold.

To select the useful features, run the following block of code:

useful_features = [i for i in data.columns if i not in ("id", "target","kfold")]
object_cols = [col for col in useful_features if "cat" in col]
test_data = test_data[useful_features]

Modeling

To build our model, we will run the following block of code:

final_predictions =[]

for fold in range(5):
    xtrain = data[data.kfold != fold].reset_index(drop=True)
    xvalid = data[data.kfold == fold].reset_index(drop=True)
    xtest = test_data.copy()
    
    ytrain = xtrain.target
    yvalid = xvalid.target
    
    xtrain = xtrain[useful_features]
    xvalid = xvalid[useful_features]
    
    # Data Encoding 
    oe = OrdinalEncoder()
    xtrain[object_cols] = oe.fit_transform(xtrain[object_cols])
    xvalid[object_cols] = oe.transform(xvalid[object_cols])
    xtest[object_cols] = oe.transform(xtest[object_cols])
    
    # Model Training
    model = XGBRegressor(random_state = fold, n_jobs=5)
    model.fit(xtrain, ytrain)
    preds_valid = model.predict(xvalid)
    preds_test = model.predict(xtest)
    final_predictions.append(preds_test)
    print(fold, mean_squared_error(yvalid, preds_valid, squared=False))

For each fold, we will encode the data and then train the model using XGBoost (Extreme Gradient Boosting), an ensemble learning technique to boost the performance of our model.

XGBoost is a regularized boosting technique that provides high predictive power and is faster than other boosting techniques. We will then evaluate each fold individually and print out the results of the model.

Model evaluation

After individually evaluating each fold, we will now evaluate our model's performance by getting the mean predictions on our test data.

To do this, use the following code block:

preds = np.mean(np.column_stack(final_predictions), axis=1)

To see how our model performed, we will output the results of our model's prediction using the following code:

submission.target =preds
submission.to_csv("submission1.csv", index=False)

To see an output of our submission file, run the following code:

sub = pd.read_csv('/kaggle/output/submission1.csv')
sub

Bonus: You can submit a late submission to 30 days ML Kaggle challenge and see how your model performs, i.e., if you had signed up for the challenge earlier.

Hyperparameter optimization

In this process, we'll fine-tune and optimize our model's algorithm parameters until we achieve the desired result.

A few common XGBoost parameters with a large effect on the model perfomance include; n_jobs, max_depth, learning_rate, n_estimators, colsample_bytree, and subsample.

To fine-tune our model, add the following changes to the XGBoost regressor:

model = XGBRegressor(random_state = fold, n_jobs=5, learning_rate =0.1, subsample=0.8,
                         max_depth = 5, min_child_weight = 1, gamma = 0, scale_pos_weight = 1)

Once you run the above code, you'll see our model's result improve slightly better than our first example. You can continue changing the parameters until it meets the desired goal. For example, you can target a value like 0.7100 to measure your models' success.

You can also look at scikit-learn's GridSearchCV or Optuna which makes it easier to fine-tune your model.

Read more on this detailed hyperparameter tuning article that goes beyond the scope of this tutorial.

Here's the Kaggle notebook for our regression model.

Conclusion

Building a performance-driven model is not a very easy task. It involves refining our model again and again until we get the desired outcome.

Either way, mastering the art of modeling can be very rewarding, whether it is in a machine learning or a data science project, or a competition.

Happy coding!

Peer Review Contributions by: Willies Ogola

Author

Adhinga Fredrick

Fredrick is a Computer Technology student with interests in Python for Web development, Machine Learning, and Data Science. When away from the computer, he loves adventure and learning new things.

More Articles by Author