Diagnosis of Diabetes using Support Vector Machines

In this guide, we will learn how to use machine learning to diagnose if a patient has diabetes. We can do this by using their medical records. We will use the Support Vector Machine Algorithm (from Sci-kit Learn) to build our model.

The GitHub repo for this project is here.

Prerequisite

A PC with Jupyter Notebook.
Basic Python knowledge.
Basic knowledge of Support Vector Machines
Diabetes dataset from Kaggle

Outline

Exploratory Data Analysis with Pandas-Profiling
Feature Extraction
Split Dataset into Training and Test Set
Creating the SVM Model
Diagnosing a New Patient
Assess Model Performance

Exploratory data analysis with pandas-profiling

The pandas-profiling library helps us do quick exploratory data analysis with minimal effort.

To install pandas-profiling, run the code below:

pip install pandas-profiling

If you are using Anaconda, then you can run the following code in Anaconda Prompt:

conda install -c conda-forge pandas-profiling

Now we can import pandas-profiling and generate a report for our dataset. Before we load our dataset, let us import the libraries we will be using.

# importing libraries
import numpy as np
import pandas as pd
import pandas_profiling
from sklearn import svm
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report

We import our dataset using the read_csv function from pandas. We pass the dataset filename as an argument.

# Getting Data
dataset = pd.read_csv("diabetes.csv")

# generate report with panda-profiling
profile = dataset.profile_report(title='Diabetes Profiling Report')
profile

Pandas-profiling gives us the dataset statistics.

Overview of the Dataset

Overview of the Dataset

From the report above, we have nine variables and 768 rows. There are no missing data in the dataset. There is no duplicate data.

Pandas-profiling also looks at each of the nine variables. For each variable, it gives us descriptive statistics. It generates a histogram that shows the data distribution of each variable.

Histogram - Pregnancy

Histogram - Pregnancy

Histogram - Glucose & Blood Pressure

Histogram - Glucose & Blood Pressure

Histogram - Skin Thickness & Insulin

Histogram - Skin Thickness & Insulin

Histogram - BMI & Pedigree

Histogram - BMI & Pedigree

Histogram - Age & Outcome

Histogram - Age & Outcome

We can see the mean, minimum, and maximum values of each variable. We can observe the correlation plot between each of the variables.

Features Interaction

Interaction of features

We can see the Pearson, Spearman, Kendall, and Phik correlation matrix heat map.

Feature Correlations

Correlations of features

We can visualize the missing values and know exactly where we have missing values in the dataset. None of the variables contains any missing value, so we can proceed to build our model.

Visualization of Missing Values

Visualization of missing values

Finally, we can view the first ten rows and last ten rows of the dataset.

First Rows of the dataset

First rows of the dataset

Last Rows of the dataset

Last rows of the dataset

Feature extraction

We separate the features and the target variable. We have eight features.

# Extract Features
X = dataset.iloc[:, :8]
X.head()

Dataset Features

Dataset features

Our target variable is the outcome column. The value 1 represents patients with diabetes, while 0 represents patients without diabetes.

# Extract Class Labels
y = dataset["Outcome"]
y.head()

Class Labels of the Dataset

Class labels

Split the dataset into training and test sets

We split our dataset into the training and test set. We use 75% of our dataset for training the model, and we use the remaining 25% for testing the model after training.

# Split Dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.75, random_state=0)
    
print(X_train.shape)

Shape of X_train

print(y_train.shape)

Shape of y_train

print(X_test.shape)

Shape of X_test

print(y_test.shape)

Shape of y_test

We can see the amount of data that we will use for training and testing.

X_train.head()

Training Set

Training Set

We need to normalize the features in our training set. Normalizing adjusts each column in our dataset to have a mean of 0 and a standard deviation of 1. It will make the training process faster.

# Normalize Features
scaler = StandardScaler()
scaler.fit(X_train)
X_train = scaler.transform(X_train)

Let's view the first five rows of our training set after normalization.

# View first 5 rows
X_train[:5, :]

Normalized Training Set

Normalized Training Set

Creating the SVM model

The Sci-kit Learn library has four SVM kernels. We have the linear, poly, rbf, and sigmoid kernels. We do not know which of these kernels will give us a better decision boundary.

So we iterate through the kernels and see which one gives us the best decision boundary for the dataset. The decision boundary is the hyperplane or curve that separates the positive class and the negative class. It could be linear or non-linear.

Decision Boundary

Image Source: Logistic Regression and Decision Boundary

The polynomial and RBF kernels are suitable when the classes are not linearly separable.

We fit the SVM model for each kernel to our training set. We make predictions on our training set to see which kernel will give us the highest accuracy score.

We call this Hyper-Parameter Optimization.

# SVM Kernels
for k in ('linear', 'poly', 'rbf', 'sigmoid'):
    model = svm.SVC(kernel=k)
    model.fit(X_train, y_train)
    y_pred = model.predict(X_train)
    print(k)
    print(accuracy_score(y_train, y_pred))

Accuracy of SVM Kernels

Accuracy of the SVM kernels

The RBF (radial basis function) kernel gives us the highest accuracy score. So for this dataset, it offers the best decision boundary. The RBF kernel finds a decision boundary that separates 82.4% of the patients correctly. Now let us create our model using the RBF kernel.

# Using the best model
model = svm.SVC(kernel='rbf')
model.fit(X_train, y_train)

Diagnosing a new patient

We use our model to make predictions on a new patient.

# Making a Single Prediction
# 'pregnancies', 'glucose', 'bpressure', 'skinThickness'
# 'insulin', 'bml', 'pedigree', 'age'
    
patient = np.array([[ 1., 150., 70., 45., 0., 40., 1.5, 25]])
    
# Normalize the data with the values used in the training set
patient = scaler.transform(patient)
    
model.predict(patient)

Result of Single Prediction

Let's create a numpy array containing the new patient record. We normalize the data before passing it to the model for prediction. We use the transform method this time instead of the fit_transform method.

That will use the same mean and standard deviation that normalized the training set. The result is 1, so the patient has diabetes. Let us see what our model will predict if we change the glucose level from 150 to 50.

patient = np.array([[ 1., 50., 70., 45., 0., 40., 1.5, 25]])
    
# Normalize the data
patient = scaler.transform(patient)
    
model.predict(patient)

Result of Single Prediction

We get 0, which means this patient does not have diabetes. Let us view our test set.

# Viewing Test Set
X_test

Test Set

Now let's try to diagnose the third patient in the test set (with id 113). Remember, the index of the third patient is two since we start counting from 0.

# Checking the third patient in the test set with index 2
X_test.iloc[2]

Details of the Third Patient

# Convert dataframe to a numpy array
t_patient = np.array([ X_test.iloc[2]])

# Predicting on third patient in Test Set
t_patient = scaler.transform(t_patient)
    
print("Model's Prediction:", model.predict(t_patient))
print("Actual Prediction:", y_test.iloc[2])

Prediction of the Third Patient

We can see that our model prediction is 0, and the actual prediction is also 0. This means our model made the correct prediction for this patient. The third patient does not have diabetes.

Assess model performance

Let us see the accuracy of the entire test set.

# Accuracy on Testing Set
X_test = scaler.transform(X_test)
y_pred = model.predict(X_test)
print("Accuracy Score:", accuracy_score(y_test, y_pred))

Accuracy Score of the Test Set

We normalize the test set before making predictions. We have an accuracy of 77.60, which is lower than what we had on the training set. This is because the test set contains data our model has not seen before.

What if our model had predicted that no one had diabetes? What would be our accuracy in that case? Let us find out what the accuracy would be if our model had predicted 0 (no diabetes) for all the patients.

To do this, we create an array of zeros with the same shape as our test set classes.

# Comparison to All-Zero Prediction
y_zero = np.zeros(y_test.shape)
print(accuracy_score(y_test, y_zero))

Accuracy Score of All-Zero Prediction

When we compare our test set with an all-zero array, we get an accuracy score of 67.7%. Our model accuracy is 67.7%, even when it predicts that no one in the test set has diabetes.

This means our dataset is unbalanced. There are more samples of the class without diabetes in our dataset. So accuracy score will not help us evaluate our model. We can measure the performance of our model by using Precision and Recall.

Precision tells us what fraction has diabetes from all the patients our model predicted to have diabetes. Recall gives us the fraction our model correctly detected as having diabetes out of all the diabetic patients.

A model with high precision helps us avoid treating people without diabetes. But we may end up not treating some patients with diabetes. A model with high recall allows us to treat all patients with diabetes. Yet we may end up treating patients that do not have diabetes.

What we need is a trade-off between precision and recall. That is where f1-score comes in. The f1-score finds a good balance between precision and recall.

Precision is given as: True Positives/(True Positives + False Positives)

Recall is given as: True Positives/(True Positives + False Negatives)

The true positives are the patients that have diabetes, and our model predicted to have diabetes.
The false positives are the patients without diabetes, but our model predicted to have diabetes.
The true negatives are the patients without diabetes, and our model predicted as not having diabetes.
The false negatives are the patients that have diabetes, but our model predicted as not having diabetes.

Let us calculate the precision, recall, and f1 score. We can also generate a classification report.

# Compute precision, recall and f1 score
from sklearn.metrics import recall_score, precision_score, f1_score
    
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
    
print("Precision is", precision)
print("Recall is", recall)
print("F1 score is", f1)

Precision Recall & F1 Score

Precision, Recall, & F1-Score

We can also generate a classification report.

# Generate classification report
print(classification_report(y_test, y_pred))

Classification Report

Classification Report

Our precision, recall, and f1-score are approximately 0.71, 0.52, and 0.60 respectively. The model is not too good. For a healthcare problem, we could end up misdiagnosing patients that have diabetes. This is why we pay more attention to the recall score. We can improve our results by collecting more data.

Conclusion

In this guide, we learned how to use the four SVM kernels from Sci-kit Learn to build a machine learning model. Different kernels work better on distinct datasets. You can use pandas-profiling to do quick exploratory data analysis.

Accuracy score is not a good metric for evaluating a dataset with skewed classes. That is a dataset with imbalanced classes, where there are more samples of one class than the other.

We can use precision, recall, and f1-score to check our model. We can improve our model performance by collecting more data.

Happy coding!

Peer Review Contributions by: Saiharsha Balasubramaniam

Author

Iniabasi Affiah

Ini-Abasi Affiah is a machine learning developer with an interest in healthcare and finance. He is also a YouTuber at Team Techdom channel.

Cloudzilla is FREE for React and Node.js projects

Deploy GitHub projects across every major cloud in under 3 minutes. No credit card required.

Get Started for Free