Building a Glass-box model using InterprateML
Glass box models are models that are transparent to the user. In a glass box model, all the features and the model parameters are known to the user. They also know the criteria used by the model to reach its prediction results and conclusion. <!--more--> This gives full transparency. One can know how the model works and functions behind the scenes. Glass box models are more robust and easily adaptable in a production environment.
Traditionally, many machine learning engineers have been building black-box models. However, in black-box models, the users do not know the internal workings of the model. As a result, users cannot understand these models and accept the model blindly.
Black-box models hide the actual risk of using these models. The glass box model is the better alternative that gives you full exposure to the model.
This tutorial will build both a black-box model and a glass box model. First, we will build a glass box using InterpretMl. InterpretMl has methods and attributes that explain every decision made by a model.
Table of contents
- Prerequisites
- Advantages of glass box models
- Exploring-dataset
- Black box model
- Labels and features
- Split the dataset
- Model accuracy score
- Glass box model
- Making a prediction
- Interpreting the model results
- Global Explanation
- Local Explanation
- Conclusion
- References
Prerequisites
To follow along with this tutorial, you need the following:
- Have Python programming skills.
- Know how to build machine learning models.
- Know how to work with the Sckit-learn library.
- Know how to run Google Colab notebooks.
Advantages of glass box models
Glass box models have the following advantages.
- Help data scientists and machine learning engineers understand the behavior of models.
- Glass box models can be easily debugged.
- Glass boxes are more accurate; The prediction results made by the model are accurate and can be trusted.
- They are simple to train and use.
- Reduces risk when adopted in a business.This is because business stakeholders clearly understand how the model works. This helps minimize the occurrence of risks.
Install InterpretMl
To use InterpretMl, we need to install it. Let us install it using the following code.
! pip install interpret
Now that we have installed InterpretMl, let us start working with our dataset.
Exploring dataset
We need to import the exploratory data analysis packages to work with our dataset. These packages are used for data analysis and manipulation.
import pandas as pd
import numpy as np
We will use a dataset collected from different banks. The dataset contains various information about the bank customers. For example, it predicts if a bank customer will subscribe to a monthly bank payment or not.
A snip of the dataset is shown below.
To get the full dataset used in this tutorial, click here then use the snippet below to load the dataset.
df = pd.read_csv("/content/bank-full.csv",sep=';')
The code snippet below enables us to see the structure of our dataset.
df.head()
The structure is shown below.
We also need to format our dataset. The dataset should have the same data type for uniformity during model training. Add this code snippet to show the types.
df.dtypes
The output is shown below.
From the image above, the dataset is not uniform. For example, we have int64
and object
, we need to convert all the datatypes into int64
. int64
are easily readable by the model because they are numbers. object
are categories or groups.
The process of converting object
to int64
is known as categorical encoding. For a detailed explanation on categorical encoding, click here
To convert our data type run this code.
df1 = pd.DataFrame({col: df[col].astype('category').cat.codes for col in df}, index=df.index)
In the code above cat.codes
assigns numeric numbers to the categories or groups in our dataset. To see the results run this code.
df1.head()
The output is shown below.
From the image above, we can see that our dataset is converted into numeric values. Now that our dataset is appropriately formatted, we can start model building.
Black box model
We will begin by building a black-box model. But, first, let us import the machine learning packages.
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split,cross_val_score
Let us see what we have imported.
LogisticRegression: this is the algorithm used to train our black box classification model.
train_test_split: it will split the dataset into a train set and a test set.
cross_val_score: will be used to calculate the accuracy score of the machine learning model.
Labels and features
Features are the X variable
in our dataset. This represents all the input columns that will be used during model training.
Xfeatures = df1[['age', 'job', 'marital', 'education', 'default', 'balance', 'housing','loan', 'contact', 'day', 'month', 'duration', 'campaign', 'pdays', 'previous', 'poutcome']]
Labels are the y variable
in our dataset. This represents the output column during the prediction phase.
ylabels = df1['y']
Split the dataset
To split the dataset, run this command.
x_train,x_test,y_train,y_test = train_test_split(Xfeatures,ylabels,test_size=0.3,random_state=7)
The dataset will be split using test_size=0.3
. 70%
of the dataset will be used for training and 30%
for testing.
We will build the model using the LogisticRegression
algorithm.
lr_model = LogisticRegression()
lr_model.fit(x_train,y_train)
We initialize the algorithm as LogisticRegression()
. We then fit the model into our dataset. The fit
method enables the model to learn from the x_train
and y_train
. The model can gain enough knowledge to make predictions.
Model accuracy score
We calculate the accuracy score using the following code.
lr_model.score(x_test,y_test)
The results are shown below.
0.8905927455028015
The accuracy score is 89.05927455028015%
.
This is a black-box model, and it does not explain to the user how it reached that accuracy score. In addition, it does not show the contributing features to this accuracy score.
We need a glass box model to help solve this problem for these reasons.
Glass box model
As mentioned earlier, we will use InterpretMl to build a glass box model. Use the following code snippet to import the InterpretMl.
import interpret
We will use the ExplainableBoostingClassifier
algorithm to build the glass box model. This algorithm is an in-built InterpretMl algorithm that builds a transparent model.
ExplainableBoostingClassifier
has a combination of techniques and algorithms. This builds a highly efficient model and accurate model. It also improves the model performance significantly.
Use the snippet below to import ExplainableBoostingClassifier.
from interpret.glassbox import ExplainableBoostingClassifier
We can now initialize the algorithm and fit it into our training set.
ebm = ExplainableBoostingClassifier()
ebm.fit(x_train,y_train)
As shown below, this process will train and build our glass box model.
We can now calculate the accuracy score of our model.
ebm.score(x_test,y_test)
The accuracy score is shown below.
0.9081391919787674
The accuracy score is 90.81391919787674%
. The accuracy is higher compared to the black box model accuracy score.
Making a prediction
We will use the following data sample to predict using the model.
ex1 = x_test.iloc[8]
This code will select the 9th
row in our testing set.
print(ebm.predict([ex1]))
print(ebm.predict_proba([ex1]))
The ebm.predict
method will print the prediction results while the ebm.predict_proba
will print the prediction probability below.
[0]
[[0.93202639 0.06797361]]
The prediction is 0
, showing that the customer will not subscribe to the monthly bank payment. The prediction probability is 0.93202639
, a very high probability of occurrence.
Since this is a glass box model, we can use InterpretMl to explain why the model made this prediction. It also shows us what features contribute to the accuracy score.
Interpreting the model results
We can now explore the methods and attributes used for interpretation.
dir(interpret)
This will show us all the methods and attributes as shown below.
From the image above, InterpretMl has different methods and attributes. The one we are most interested in is the show
method. This method will give us a clear view of all the contributing factors to any decision made by our model.
Let us import the show
method.
from interpret import show
We have two techniques for explaining the model results: global explanation and local explanation. Global explanation explains the overall structure and behavior of any model. Local explanation explains the individual prediction or classification made by the model. It is more specific than a global explanation.
Global explanation
To perform a global explanation, run this code.
ebm_global = ebm.explain_global()
The explain_global()
method is used to explain the model. To see the explanation in a user interface, run this code.
show(ebm_global)
The show
method will display a user interface as shown below.
The image above gives a summary of the overall importance of each feature. The features are arranged according to their level of importance. The duration
feature has the highest importance. It contributed significantly to the model's accuracy score and the prediction results, while the education
features had the least contribution.
Local explanation
To perform a local explanation, run this code.
ebm_local = ebm.explain_local(x_test,y_test)
The ebm.explain_local
method will explain the predictions. To display a user interface, run this code.
show(ebm_local)
The interface is shown below.
The features are grouped into features that contributed to the prediction result and those against the prediction results. Features for the prediction results are colored orange
. On the other hand, those against the prediction are colored blue
. Using the UI, users can now know the role played by each feature. This leads to more transparent models that people can easily understand.
Conclusion
In this tutorial, we have learned how to build a black-box model and a glass box model. We also discussed the advantages of the glass box model and why they are more preferred to the black box.
After building both models, we compared the results to see which was better. Finally, using InterpretMl, we explained our black-box model, which gives users a detailed understanding of the model works.
Glass box models are always the better alternative, and users can easily trust them. This reduced risks when using the model in production. To get this notebook , click this link
References
- Google Colab notebook
- Black box vs Glass box
- Advantages of glass box models
- Sckit-learn documentation
- InterpretMl documentation
Peer Review Contributions by: Mercy Meave