arrow left
Back to Developer Education

Machine Learning Model Explanation with Dalex

Machine Learning Model Explanation with Dalex

Machine learning engineers build models but hide the implementation from the end-users. The end-users apply the model without knowing how the model works behind the scenes. They do not know the functions, methods, and algorithms that train the model. <!--more--> End-users need an open and transparent model to understand how the model works in terms of the functions, the algorithms and how it makes the predictions. Together, these parameters help users understand the facts contributing to the model's decision.

Dalex is an open-source library that explains and analyzes machine learning models. It makes the model transparent and open to the end-users.

Using Dalex, the end-users can gain insightful information from the model. This tutorial will build a bank marketing model and then use Dalex to explain it.

Table of contents

Prerequisites

To follow along with this tutorial, the reader will need to have:

Google Colab is a platform we can use to quickly build and run machine learning models with a fast CPU and GPU.

Dataset preparation

This section will guide the reader in preparing the bank marketing dataset that will train the model. The model will predict whether a customer will subscribe to a monthly saving plan or not. You can get the dataset from here.

Dataset exploration

We need to explore the dataset to know its variables. So let us first import Pandas to load the dataset.

import pandas as pd

Use this code to load the dataset:

df = pd.read_csv("/content/bank_marketing_dataset.csv")

To see the Data Frame, run the code snippet below:

df.head()

Data Frame output:

Data Frame

The output shows rows and columns in the dataset. To see all the columns in the dataset, use the code snippet below:

df.columns

The columns output:

Columns output

From the output above, the dataset has 21 columns. First, Dalex will show us how each column contributed to the model's final results. Then, we will select the input variables and the target variable from the dataset.

Selecting the input and target variables

The input variables are all the columns that train the model from which it learns to solve the classification problem. The target variable is the model's output.

It is what the model wants to predict. For example, our model will determine if a customer will join a monthly saving plan or not, so this is our target output.

Xfeatures = df[['age', 'job', 'marital', 'education', 'default', 'housing', 'loan',
       'contact', 'month', 'day_of_week', 'duration', 'campaign', 'pdays',
       'previous', 'poutcome', 'emp.var.rate', 'cons.price.idx',
       'cons.conf.idx', 'euribor3m', 'nr.employed',]]
ylabels = df['y']

The first 20 columns are the input variables/features, and the last column is the target variable.

Splitting the dataset

We need to split the bank marketing dataset into two sets. The training and the testing sets. Use this code to split the dataset into training and testing sets.

from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test = train_test_split(Xfeatures,ylabels,test_size=0.3,random_state=7)

We have prepared and split the dataset. We will start now building the model.

Building the model

Machine learning pipeline

We will build the model using a machine learning pipeline. A machine learning pipeline is a way of automating and simplifying the machine learning workflow.

We have various libraries that support the implementation of a machine learning pipeline, but in this article, we will use Scikit-learn Pipeline class.

To import the Pipeline class, use this code:

isfrom sklearn.pipeline import Pipeline

To implement a machine learning pipeline, we need to initialize all the stages in machine learning.

The pipeline stages are as follows:

Data to data stage

This stage involves the data transformation methods. First, these methods transform the input data into the format the model requires. Then, the algorithm uses the transformed data for model training.

Let us import the transformer methods.

from sklearn.preprocessing import StandardScaler

We use the StandardScaler to initialize the data-to-data stage. This method will convert the dataset to a specified range and ensure consistency.

Data to model stage

This stage uses the data estimator algorithm to train the model using the transformed data. In this stage, we use the Scikit-learn algorithms for classification.

Let us import the Scikit-learn algorithm.

from sklearn.linear_model import LogisticRegression

We will use the LogisticRegression to train the bank marketing model. Let us combine the two stages to create the pipeline.

Combining the stages

Combining the stages allows the Pipeline class to initialize and run all the pipeline stages simultaneously.

pipe_lr = Pipeline(steps=[('std_scaler',StandardScaler()),('lr',LogisticRegression())])

We added the StandardScaler and LogisticRegression to the Pipeline class. We will use the fit function to train the initialized pipeline stages.

Using the fit function

The fit function fits the Pipeline to the training set. The pipeline will then learn from the dataset. The output of this process is the final trained bank marketing model.

pipe_lr.fit(x_train,y_train)

The output of the final model:

Final model

The output shows all the initialized steps and the algorithm that trains the model. Let us get the accuracy score of this model:

Accuracy Score

Use this code below to get the accuracy score of the model:

print("LR:",pipe_lr.score(x_test,y_test))

The accuracy score:

LR: 0.9105770008901837

This accuracy might seem like a high accuracy score (91.057%). However, we need to know all the variables contributing to this accuracy score. We will use Dalex to explain the model and gain an insightful understanding.

Getting started with Dalex

To install Dalex, execute this command in a terminal:

pip install dalex

Import the dalex library to your python code.

import dalex as dx

Dalex explains the model in two ways; overall model explanation and single prediction explanation.

Overall model explanation

This method explains the whole structure of the trained model. In addition, it will show the functions and algorithms that build the model. The user will know the dataset that trained the model and how each variable in the dataset contributed to the general model performance.

Single prediction explanation

This method focuses on the single prediction of the model. It shows the variables that have contributed to that specific prediction. The user will know if a prediction is correct or not.

Let us start with the overall model explanation.

Implementing the overall model explanation

We implement the overall model explanation as follows:

Using the Explainer function

We use the Explainer function to display the structure of the model.

exp = dx.Explainer(pipe_lr,x_train,y_train)

Output:

Overall model explanation

From the output, the dataset has 28831 rows 20 cols. We also know the target and input variables used. The output also shows the algorithms that trained the model and model information. Using this output, the user will have a better understanding of the model.

Checking the model performance

Next, we can also display the performance scores using this code:

exp.model_performance()

This code will display all the performance scores:

Overall model explanation

The accuracy score of the model is 0.91034. It is slightly lower than the one we had gotten earlier, but it is still a high accuracy score. It is the accurate/true score of the model.

Variable contribution to the model performance

Let us see how each variable in the dataset contributed to the model performance.

exp.model_parts()

Variable contribution output:

Variable contribution

The output shows how each variable in the dataset contributed to the model performance. The contributions are from the least to highest contribution. day_of_week variable has the least contribution while the emp.var.rate variable has the highest contribution.

We also show the variable contribution using a plot diagram.

Variable contribution using a plot diagram

We will use plotly to plot the variable contribution diagram.

pip install plotly

To plot, use this code:

exp.model_parts().plot()

Plot diagram output:

Plot diagram

Let us now implement the single prediction explanation.

Implementing a single prediction explanation

To implement the single prediction explanation, we first need to use the model to make a single prediction. We will then explain the output.

Using the model to make a single prediction

Select the data sample that the model will predict.

data_sample = x_test.iloc[7]

To see the selected data sample, use this code snippet:

data_sample

The data sample output:

Data sample

Let us check the expected prediction.

y_test.iloc[7]

The expected prediction is the actual classification of the data samples.

1

The expected prediction is 1. Let us now use the model to classify this data sample.

Classifying the data sample

We use the predict function and pass in the data sample:

print("LR:",pipe_lr.predict([data_sample]))

The output:

LR: [1]

It is an accurate prediction, but still, we need to know how each variable in the dataset contributed to this prediction.

Variable contribution to the single prediction

We first need to convert the data sample to a DataFrame that Dalex can understand.

data_frame = pd.DataFrame(data_sample).T

To see the Data Frame, use this code:

data_frame

The output:

Sample Data Frame

Let us now see the variable contribution to the single prediction:

explanation = exp.predict_parts(data_frame)
explanation.result

Variable contribution output:

Variable contribution

The output shows how the variables contributed to the single prediction. The contributions are from the highest to least contribution. emp.var.rate variable has the highest. day_of_week variable has the least.

We can also show the same using a plot diagram.

Using a plot diagram

To plot, use this code:

explanation.plot()

Plot

The plot breaks down all the variables in the dataset. The variables with the light-green color have positive contributions to the prediction. Conversely, the variables with the red color have negative contributions.

Using this plot, we can see how each variable contributed to the prediction - both positively and negatively. We have finished building a machine learning model and explained its results to make it transparent.

Conclusion

We have gone over model explanation with Dalex. We discussed the importance of model explanation and how it builds transparent models. We prepared the dataset for the model to use. We also implemented a machine learning pipeline that simplifies the machine learning process.

After training the model, we implemented Dalex for the model explanation. Using Dalex, we did an overall model explanation and a single prediction explanation.

To get the Google Colab notebook for this tutorial, click here.

Happy coding!

References


Peer Review Contributions by: Jerim Kaura

Published on: Jun 9, 2022
Updated on: Jul 12, 2024
CTA

Start your journey with Cloudzilla

With Cloudzilla, apps freely roam across a global cloud with unbeatable simplicity and cost efficiency