How to Analyze Machine Learning Models using SHAP

Explainable AI describes the general structure of the machine learning model. It analyzes how the model features and attributes impact the model results.

Model analyzing determines the logical reasoning of the model when making a prediction and explains the decisions made by the model. The concept of analyzing these prediction results is known as explainable AI. Explainable AI enables us to understand the prediction results made and builds trust and confidence when using the model.

During production, we must have a model that we can trust. If a model fails in production, it may have a great impact on the business operation. This leads to business losses.

In this tutorial, we will start by building a diabetes prediction model. We will then use SHAP to analyze and explain the prediction results made by this model.

Prerequisites
Importance of Explainable AI
Modelling dataset
Importing exploratory data analysis packages
Checking for missing values
Checking for our data types
Diabetic vs non-diabetic
Adding features and labels
Dataset scaling
Splitting our dataset
Building the model
Calculate the accuracy score
Testing our model
Making a single prediction
Getting started with SHAP
Initialize KernelExplainer
Creating shapley values
Force plot diagram
Analyzing the force plot
Plotting a summary plot
Conclusion
References

Prerequisites

To understand this tutorial, a reader should:

Install Python on their machine.
Understand Python programming.
Know how to build machine learning models using Scikit-learn
Know how to use Google colab notebooks for projects.
Know how to use Pandas and Numpy for machine learning.

Importance of Explainable AI

The reasons are as follows.

Understand the model internal functionalities and decision-making processes: Through Explainable AI, we can understand the internal functionalities of the model. It gives us a deeper understanding of how these models made the decisions it did. It enables the user to understand the criteria used to make these decisions.
Reduce model bias: When analyzing the machine learning model, we select the best features to build the model. This reduces the bias made by the model during prediction.
Improve the model performance: It improves model performance by selecting the best features to make predictions.
Helps in model debugging: Through model debugging, it removes bugs and errors in the model.
Identify the wrong prediction: Explainable AI determines the wrong predictions made by a model. The model which makes the wrong prediction will not be deployed to production.

Let's start working with our dataset.

Modeling dataset

The dataset collected contains information for both diabetic and non-diabetic patients. This dataset will be used to train our model.

Our model will learn from this dataset, understand patterns. The model will improve from experience. It eventually uses this information to make predictions.

Let's have a look at this dataset.

Dataset image

From the image above, our image does not have column names or headers. To add column names or headers, we import the exploratory data analysis (EDA) packages.

To download this dataset, click here

Importing exploratory data analysis packages

We will import two packages, Pandas and Numpy.

We use Pandas to import our dataset. It also adds column names into our dataset. Numpy will be used for mathematical operations.

Let's initialize our column names.

Initializing column names

We initialize the column name as follows.

names = ["Num_of_Preg","Glucose_Conc","BP","Skin_Thickness","TwoHour_Insulin","BMI","DM_Pedigree","Age","Class"]

The names shown above are the information collected from patients. The information is used to determine if a person is diabetic or non-diabetic.

Let's add these names to our dataset.

df = pd.read_csv("diabetes-prediction.csv",names=names)

From the code above we have used Pandas to import our dataset. It also adds the initialized column names. Let's see if the column names have been added to our dataset.

df.head()

Dataset image

Let's check for missing values in our dataset.

Checking for missing values

To check for missing values, run this command.

df.isna().sum()

Let's see the output.

Missing values

From the image above we do not have any missing values. This shows that all of the data points are available. From here, we can check the data types of our dataset column.

Checking for our data types

To check the data types, use this code.

df.dtypes

The output is shown below.

Data types

From the image above we have two data types: int64 and float64. These are both numbers, these show that our data types are uniform.

NOTE: All the data types must be in numbers. Numbers are more machine-readable compared to other data types. If you find that your data types are not numbers, you need to convert them to either int64 or `float64.

Our dataset contains information on both diabetic and non-diabetic patients. Let's check how they are distributed in our dataset.

Diabetic vs non-diabetic

To check for the numbers, run this command.

df.groupby('Class').size()

The output is as shown below.

Class
0    500
1    268
dtype: int64

From the output, we have two classes 0 and 1. 0 represents non-diabetic patients while 1 represents diabetic patients.

From the numbers we can see we have 500 non-diabetic patients and 268 diabetic patients.

We need to add information on both diabetic and non-diabetic patients. This allows the model to learn from both of them. This helps to reduce model bias.

Adding features and labels

Features are the unique attributes and variables in the dataset. They are used as input for the model. Features will train our model during the training phase.

Labels are the target or the output variable. Labels are what the model is trying to predict during the prediction phase.

From our dataset, we have 8 features. Thes are: Num_of_Preg, Glucose_Conc, BP, Skin_Thickness, TwoHour_Insulin, BMI, DM_Pedigree and Age.

Let's add these features.

df.iloc[:,0:8]

We use iloc to select from the 0 index of our dataset to the 8 index.

The output is as shown below.

Features

The image above shows all the 8 features added to our dataset. We now save the features into a variable called Xfeatures.

Xfeatures = df.iloc[:,0:8]

Let's now add our label.

Our label is the class column. The class column is labeled as either 1 or 0. 0 represents non-diabetic patients while 1 represents diabetic patients.

Ylabels = df['Class']

Let's now scale our dataset.

Dataset scaling

Dataset scaling ensures that our dataset fits into the model. It removes the overlapping dataset using a specified scale. It also converts all the input data types of int64 to float64.

Let's import the Python package that will be used to scale our dataset.

from sklearn.preprocessing import MinMaxScaler as Scaler

The MinMaxScaler will be used to scale our dataset.

For further reading on how the MinMaxScaler works, read this article.

Since we are scaling the input variables, we will scale our Xfeatures. Let's run this code to scale our Xfeatures.

scaler = Scaler()
X = scaler.fit_transform(Xfeatures)

From the code above we have initialized our scaler function as Scaler(). We also use the fit_transform method. It ensures that our Xfeatures fits into the model during the training phase.

We then need to transform our scaled dataset into a data frame. Data frame will structure our dataset into rows and columns as shown below.

X = pd.DataFrame(X,columns=names[0:8])

To see our scaled data frame run this command.

X.head()

The output is shown below.

Data frame

Let's now split our dataset.

Splitting our dataset

We split the dataset into two sets, the training set, and test set. The training set is used during the training phase. The model learns from this dataset.

The test set is used to evaluate the model performance. It also measures the model's accuracy score.

Let's import the package required to split our dataset into two.

from sklearn.model_selection import train_test_split

We use train_test_split to split our dataset.

X_train,X_test,y_train,y_test = train_test_split(X,Ylabels,test_size=0.2,random_state=42)

In the code above we used test_size=0.2. This implies that 80% will be used as a training set. The remaining 20% will be the test set.

Let's start building our model.

Building the model

Let's start by importing the machine learning packages.

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

As shown above, we have imported the following.

LogisticRegression: This is the algorithm used to train our model.
accuracy_score: This method is used to calculate the accuracy score of the model when making predictions.

Let's now build the model using the LogisticRegression algorithm.

logit = LogisticRegression()
logit.fit(X_train,y_train)

In the code above, we have initialized our algorithm as LogisticRegression(). We then use the fit() method to fit our model into the train set.

The model will learn from the training set and understand patterns. It will also improve from experience. It eventually uses the information it gained to make predictions.

Let's calculate the accuracy score of this model.

Calculate the accuracy score

We calculate the accuracy score using the accuracy_score.

print("Accuracy Score of Logistic::",logit.accuracy_score(X_test,y_test))

The accuracy score is as shown below.

Accuracy Score of Logistic:: 0.7727272727272727

When converted into a percentage is 77.27%. As we continue training our model, the accuracy score will increase.

Testing our model

During this phase, we use our model to make predictions. Our model should predict if a given input sample belongs to diabetic or non-diabetic.

Let's extract a sample input from the test set. The sample input is the 1st data point of the test set dataset. Its represented by the 0 index of the array.

X_test.values[0]

The values of the sample input are shown below.

array([0.35294118, 0.3483871 , 0.34693878, 0.33333333, 0.22458629,
       0.32310838, 0.15029889, 0.36666667])

Let's now make a single prediction using these values.

Making single a prediction

To make a prediction, we use the predict() method as shown below.

logit.predict(np.array(X_test.values[0]).reshape(1,-1))

We also use np.array to read the first index of the array. Then, we use reshape(1,-1) to output the prediction results in a singular column.

Let's see the prediction results.

array([0])

The prediction result is 0. This shows that the input sample belongs to a non-diabetic person.

We now need to explain this prediction result. This enables us to see why and how the model reached this conclusion.

Let's get started with SHAP.

Getting started with SHAP

Shapley Additive Explanations (SHAP) is a game-theoretic technique that is used to analyze results. It explains the prediction results of a machine learning model. It uses Shapley values.

Shapley values are weights assigned to the model features. It shows how each feature contributed to the prediction results. It determines the impact of features on the prediction results.

Let's install SHAP. We can install SHAP using the following command.

!pip install shap

Let's import SHAP.

import shap

We also need to initialize JavaScript. JavaScript enables us to plot visualization diagrams. SHAP uses different plotting techniques to explain the prediction results.

Initializing JavaScript

shap.initjs()

Let's now start explaining the prediction results. We use the KernelExplainer function to explain the prediction results made above. KernelExplainer is a SHAP function that works well with the LogisticRegression algorithm.

Let's initialize the KernelExplainer function.

Initialize KernelExplainer

We need to add KernelExplainer to our train set to learn from the dataset. This makes it easier for the function to explain the prediction results.

explainer = shap.KernelExplainer(logit.predict_proba, X_train)

In the code above we have passed the X_train. This enables the KernelExplainer to understand patterns. We also specify the algorithm used as logit. logit represents the LogisticRegression algorithm.

From here, we can now create Shapley values. Shapley values will assign weights to the features available in the data set.

Creating Shapley values

This code will create the Shapley values of the Xfeatures available in the input sample. We are using our input sample from the previous section. We will explain the prediction made at this instance.

shap_values = explainer.shap_values(X_test.iloc[0,:])

To see the generated Shapley values run this command.

X_test.iloc[0,:]

The generated Shapley values are as shown below.

Shapley Values

We now use these generated values to plot a force plot diagram.

Force plot diagram

Force plots are diagrams that give a visual representation of how each feature contributed to the prediction results. They give an intuitive understanding of how the weights impacted the prediction results.

To plot a force plot diagram run this command.

shap.force_plot(explainer.expected_value[0], shap_values[0], X_test.iloc[0:])

The plotting function has the following parameters.

expected_value[0]

This represents the input sample to be used to make the prediction. In the previous section, we used the same input sample to make the prediction.

The sample input is the 1st data point of the test set. Its represented by the 0 index in the array. SHAP will explain why this input sample was predicted as a non-diabetic person.

shap_values[0]

We specify the shapely values as 0. This selects the shapely values that contribute to non-diabetic prediction.

X_test.iloc[0:]

The iloc method selects the 1st data point of the X_test array.

The force plot output is as shown below.

Force plot

Let's analyze this force plot.

Analyzing the force plot

The force plots have the following attributes.

Base value

This is the average prediction probability after the training phase. In our case, the base value is 0.67. This value is used to determine if a prediction is true or false.

If we get an output value that is below 0.67, it will imply that our model has a wrong prediction. However, if we get a value that is equal to or greater than 0.67, it will imply the model made the right prediction.

Output value

This is the real prediction value after training. In our case, the output value is 0.70

Assigned feature weights and value

They contribute to a prediction score of 0.70. This value is shown in bold.

Red color block

This represents features that influence the prediction results positively. They push the prediction results from the base value of 0.67 to the output value of 0.70

From the image above we have features such as: TwoHour_Insulin and Glucose_Conc.

Blue color block

This represents features that influence the prediction results negatively. They try to drag the prediction value from the base value of 0.67.

Size of the color block

The size of the color block represents the importance of the features. The larger the size, the greater impact the feature had on prediction results.

For example, in the red color block, the Glucose_Conc has the largest size. This indicates that it had a greater impact on the prediction results.

In the blue color block, Age has the greatest impact.

From the above force plot, we can make the following conclusion.

Our model was able to make the right prediction.

This is because the Output value is 0.70. This value is above the base value of 0.67.

We can also use a summary plot. Summary plots are used to show how all the features contributed to the prediction results.

Plotting a summary plot

To plot a summary plot, use this code.

shap.summary_plot(shap_values,X_test)

The output of the summary plot is shown below.

Summary plot

From the summary plot, we can see that Glucose_Conc has the highest impact. Skin_Thickness has the least impact.

We have now successfully analyzed our prediction results.

To get the Google Colab link for this tutorial, click here

Conclusion

In this tutorial, we learned how to analyze machine learning models using SHAP. We started by learning the importance of analyzing machine learning prediction results. This helps humans trust and confidence when using models.

We then moved to dataset pre-processing. This involved cleaning our data to ensure that there is no missing values. It also involves scaling the dataset to ensure it fits. After this, we were able to build a machine learning model. The model was able to predict if a patient is diabetic or not.

Finally, we used SHAP to explain the prediction results of our model. After the analysis, we concluded that our model made the right prediction. This is because the output value was higher than the base value.

Happy coding!

References

Peer Review Contributions by: Collins Ayuya

Author

James Omina

James Omina is an undergraduate student undertaking his Bachelor of Science in Computer Science. He is interested in cyber security, and mobile application development. He is passionate about Machine Learning and its application in the real world.