Diagnosis of Diabetes using Support Vector Machines
In this guide, we will learn how to use machine learning to diagnose if a patient has diabetes. We can do this by using their medical records. We will use the Support Vector Machine Algorithm (from Sci-kit Learn) to build our model. <!--more--> The GitHub repo for this project is here.
Prerequisite
- A PC with Jupyter Notebook.
- Basic Python knowledge.
- Basic knowledge of Support Vector Machines
- Diabetes dataset from Kaggle
Outline
- Exploratory Data Analysis with Pandas-Profiling
- Feature Extraction
- Split Dataset into Training and Test Set
- Creating the SVM Model
- Diagnosing a New Patient
- Assess Model Performance
Exploratory data analysis with pandas-profiling
The pandas-profiling library helps us do quick exploratory data analysis with minimal effort.
To install pandas-profiling, run the code below:
pip install pandas-profiling
If you are using Anaconda, then you can run the following code in Anaconda Prompt:
conda install -c conda-forge pandas-profiling
Now we can import pandas-profiling and generate a report for our dataset. Before we load our dataset, let us import the libraries we will be using.
# importing libraries
import numpy as np
import pandas as pd
import pandas_profiling
from sklearn import svm
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report
We import our dataset using the read_csv function from pandas. We pass the dataset filename as an argument.
# Getting Data
dataset = pd.read_csv("diabetes.csv")
# generate report with panda-profiling
profile = dataset.profile_report(title='Diabetes Profiling Report')
profile
Pandas-profiling gives us the dataset statistics.
Overview of the Dataset
From the report above, we have nine variables and 768 rows. There are no missing data in the dataset. There is no duplicate data.
Pandas-profiling also looks at each of the nine variables. For each variable, it gives us descriptive statistics. It generates a histogram that shows the data distribution of each variable.
Histogram - Pregnancy
Histogram - Glucose & Blood Pressure
Histogram - Skin Thickness & Insulin
Histogram - BMI & Pedigree
Histogram - Age & Outcome
We can see the mean, minimum, and maximum values of each variable. We can observe the correlation plot between each of the variables.
Interaction of features
We can see the Pearson, Spearman, Kendall, and Phik correlation matrix heat map.
Correlations of features
We can visualize the missing values and know exactly where we have missing values in the dataset. None of the variables contains any missing value, so we can proceed to build our model.
Visualization of missing values
Finally, we can view the first ten rows and last ten rows of the dataset.
First rows of the dataset
Last rows of the dataset
Feature extraction
We separate the features and the target variable. We have eight features.
# Extract Features
X = dataset.iloc[:, :8]
X.head()
Dataset features
Our target variable is the outcome column. The value 1 represents patients with diabetes, while 0 represents patients without diabetes.
# Extract Class Labels
y = dataset["Outcome"]
y.head()
Class labels
Split the dataset into training and test sets
We split our dataset into the training and test set. We use 75% of our dataset for training the model, and we use the remaining 25% for testing the model after training.
# Split Dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.75, random_state=0)
print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)
We can see the amount of data that we will use for training and testing.
X_train.head()
Training Set
We need to normalize the features in our training set. Normalizing adjusts each column in our dataset to have a mean of 0 and a standard deviation of 1. It will make the training process faster.
# Normalize Features
scaler = StandardScaler()
scaler.fit(X_train)
X_train = scaler.transform(X_train)
Let's view the first five rows of our training set after normalization.
# View first 5 rows
X_train[:5, :]
Normalized Training Set
Creating the SVM model
The Sci-kit Learn library has four SVM kernels. We have the linear, poly, rbf, and sigmoid kernels. We do not know which of these kernels will give us a better decision boundary.
So we iterate through the kernels and see which one gives us the best decision boundary for the dataset. The decision boundary is the hyperplane or curve that separates the positive class and the negative class. It could be linear or non-linear.
Image Source: Logistic Regression and Decision Boundary
The polynomial and RBF kernels are suitable when the classes are not linearly separable.
We fit the SVM model for each kernel to our training set. We make predictions on our training set to see which kernel will give us the highest accuracy score.
We call this Hyper-Parameter Optimization.
# SVM Kernels
for k in ('linear', 'poly', 'rbf', 'sigmoid'):
model = svm.SVC(kernel=k)
model.fit(X_train, y_train)
y_pred = model.predict(X_train)
print(k)
print(accuracy_score(y_train, y_pred))
Accuracy of the SVM kernels
The RBF (radial basis function) kernel gives us the highest accuracy score. So for this dataset, it offers the best decision boundary. The RBF kernel finds a decision boundary that separates 82.4% of the patients correctly. Now let us create our model using the RBF kernel.
# Using the best model
model = svm.SVC(kernel='rbf')
model.fit(X_train, y_train)
Diagnosing a new patient
We use our model to make predictions on a new patient.
# Making a Single Prediction
# 'pregnancies', 'glucose', 'bpressure', 'skinThickness'
# 'insulin', 'bml', 'pedigree', 'age'
patient = np.array([[ 1., 150., 70., 45., 0., 40., 1.5, 25]])
# Normalize the data with the values used in the training set
patient = scaler.transform(patient)
model.predict(patient)
Let's create a numpy array containing the new patient record. We normalize the data before passing it to the model for prediction. We use the transform method this time instead of the fit_transform method.
That will use the same mean and standard deviation that normalized the training set. The result is 1, so the patient has diabetes. Let us see what our model will predict if we change the glucose level from 150 to 50.
patient = np.array([[ 1., 50., 70., 45., 0., 40., 1.5, 25]])
# Normalize the data
patient = scaler.transform(patient)
model.predict(patient)
We get 0, which means this patient does not have diabetes. Let us view our test set.
# Viewing Test Set
X_test
Now let's try to diagnose the third patient in the test set (with id 113). Remember, the index of the third patient is two since we start counting from 0.
# Checking the third patient in the test set with index 2
X_test.iloc[2]
# Convert dataframe to a numpy array
t_patient = np.array([ X_test.iloc[2]])
# Predicting on third patient in Test Set
t_patient = scaler.transform(t_patient)
print("Model's Prediction:", model.predict(t_patient))
print("Actual Prediction:", y_test.iloc[2])
We can see that our model prediction is 0, and the actual prediction is also 0. This means our model made the correct prediction for this patient. The third patient does not have diabetes.
Assess model performance
Let us see the accuracy of the entire test set.
# Accuracy on Testing Set
X_test = scaler.transform(X_test)
y_pred = model.predict(X_test)
print("Accuracy Score:", accuracy_score(y_test, y_pred))
We normalize the test set before making predictions. We have an accuracy of 77.60, which is lower than what we had on the training set. This is because the test set contains data our model has not seen before.
What if our model had predicted that no one had diabetes? What would be our accuracy in that case? Let us find out what the accuracy would be if our model had predicted 0 (no diabetes) for all the patients.
To do this, we create an array of zeros with the same shape as our test set classes.
# Comparison to All-Zero Prediction
y_zero = np.zeros(y_test.shape)
print(accuracy_score(y_test, y_zero))
When we compare our test set with an all-zero array, we get an accuracy score of 67.7%. Our model accuracy is 67.7%, even when it predicts that no one in the test set has diabetes.
This means our dataset is unbalanced. There are more samples of the class without diabetes in our dataset. So accuracy score will not help us evaluate our model. We can measure the performance of our model by using Precision and Recall.
Precision tells us what fraction has diabetes from all the patients our model predicted to have diabetes. Recall gives us the fraction our model correctly detected as having diabetes out of all the diabetic patients.
A model with high precision helps us avoid treating people without diabetes. But we may end up not treating some patients with diabetes. A model with high recall allows us to treat all patients with diabetes. Yet we may end up treating patients that do not have diabetes.
What we need is a trade-off between precision and recall. That is where f1-score comes in. The f1-score finds a good balance between precision and recall.
Precision is given as: True Positives/(True Positives + False Positives)
Recall is given as: True Positives/(True Positives + False Negatives)
- The true positives are the patients that have diabetes, and our model predicted to have diabetes.
- The false positives are the patients without diabetes, but our model predicted to have diabetes.
- The true negatives are the patients without diabetes, and our model predicted as not having diabetes.
- The false negatives are the patients that have diabetes, but our model predicted as not having diabetes.
Let us calculate the precision, recall, and f1 score. We can also generate a classification report.
# Compute precision, recall and f1 score
from sklearn.metrics import recall_score, precision_score, f1_score
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
print("Precision is", precision)
print("Recall is", recall)
print("F1 score is", f1)
Precision, Recall, & F1-Score
We can also generate a classification report.
# Generate classification report
print(classification_report(y_test, y_pred))
Classification Report
Our precision, recall, and f1-score are approximately 0.71, 0.52, and 0.60 respectively. The model is not too good. For a healthcare problem, we could end up misdiagnosing patients that have diabetes. This is why we pay more attention to the recall score. We can improve our results by collecting more data.
Conclusion
In this guide, we learned how to use the four SVM kernels from Sci-kit Learn to build a machine learning model. Different kernels work better on distinct datasets. You can use pandas-profiling to do quick exploratory data analysis.
Accuracy score is not a good metric for evaluating a dataset with skewed classes. That is a dataset with imbalanced classes, where there are more samples of one class than the other.
We can use precision, recall, and f1-score to check our model. We can improve our model performance by collecting more data.
Happy coding!
Peer Review Contributions by: Saiharsha Balasubramaniam