Building a Machine Learning Classification Model with PyCaret
PyCaret is a machine learning (ML) library that is written in Python. It allows developers to train and deploy ML models. When compared with other open-source ML libraries such as scikit-learn, it is a good alternative low-code library that can be used to perform complex machine learning tasks with only a few lines of code. <!--more-->
PyCaret is a machine learning (ML) library that is written in Python. It allows developers to train and deploy ML models. When compared with other open-source ML libraries such as scikit-learn, it is a good alternative low-code library that can be used to perform complex machine learning tasks with only a few lines of code.
We will be using PyCaret with the Default of Credit Card Clients Dataset dataset from Kaggle to predict whether a customer will default on payment or not. This prediction will be based on several features which we'll see in this tutorial.
Prerequisites
A reader needs to:
- Use Jupyter Notebook or Google Colab. In my tutorial, I used Google Colab.
- Be familiar with the Python programming language.
- Install Python 3.x
- Install the latest version of PyCaret. Currently, PyCaret 2.3 is the latest version. Its release notes are available here.
Outline
- What is PyCaret?
- Why use PyCaret?
- Functionalities of PyCaret
- Getting started
- Loading custom dataset from Kaggle using Pandas
- Training and evaluating our ML classification model
- Testing our model
- Saving our model
- Wrapping up
What is PyCaret?
PyCaret is a machine learning library that is written in Python. It allows developers to train and deploy ML models in an easy and fast way.
PyCaret has really good documentation that explains more about PyCarets features on their website. The documentation contains a lot more information on how to get started using it. In our tutorial, we are only using it for classification. But, you can also use this library to perform clustering, regression, anomaly detection, and natural language processing tasks.
Why use PyCaret?
- It is an open-source library. It is available to anyone interested in using it.
- It is built using Python. Most developers are familiar with this programming language.
- It is fast. Within only a few minutes, developers can deploy complex models.
- It is a low-code ML library. Since you spend less time coding, it makes developers more productive.
- It is a Python wrapper that is based on existing libraries such as scikit-learn. Thus, doesn't require a separate learning curve.
- It integrates seamlessly with other Python environments such as PyCharm. Developers can integrate PyCaret into their existing ML workflows with ease.
- It is ideal for both students and experienced developers.
Functionalities of PyCaret
- Data preparation.
- Model training.
- Hyperparameter tuning.
- Analysis and interpretability.
- Model selection.
- Experiment logging.
Let's now get started with PyCaret. The first step involves installing and importing dependencies.
Getting started
There are three main dependencies that we are going to import: PyCaret, Pandas, and Shap.
- PyCaret
PyCaret will be our main dependency. It allows us to leverage the ML pipeline to build our models.
- Pandas
We are using pandas to load our CSV data into the data frame. We use this library to read, clean, and manipulate our dataset to be able to build a custom machine learning model.
- Shap
Shap helps us interpret machine learning model results.
Let's now install these dependencies.
Installing dependencies
Installing dependencies is relatively straightforward using the pip install
command. Since you're using Google Colab, the pip
command should automatically have been installed. Just type in the following code:
pip install pycaret pandas shap
If you're installing these dependencies using your local jupyter notebook, no need to put the exclamation !
before the pip
command.
Importing dependencies
Let's now import these dependencies into our Google Colab:
import pandas as pd
from pycaret.classification import *
Loading custom dataset from Kaggle using Pandas
Let's go ahead and download the Default of Credit Card Clients Dataset dataset from Kaggle. Grab the downloaded dataset from the Downloads folder on your computer and copy it into the Google Colab folder that you're working on.
We can then load this dataset in our Colab using the pandas library:
df = pd.read_csv(UCI_Credit_Card.csv)
To view it, let's type in the following:
df.head()
Alternatively, there exists a built-in PyCaret's data repository. Using the get_data()
function, you can directly load the data into your Colab. But, this option requires you to have an internet connection.
from pycaret.datasets import get_data
credit_dataset = get_data('credit')
![Loading the Default of Credit Card Clients Dataset](https://sparkling-desk-070a6c243e.media.strapiapp.com/1232_loaded_dataset_27970aa25c.png
Training and evaluating our ML classification model
To train and evaluate our ML model, we need to use the setup()
function. The function creates our ML transformation pipeline and initializes the environment in PyCaret. PyCaret's rules state that it must be the first function to be called before executing any other function.
The setup()
function takes in two parameters; data, and target. An extra parameter can be added called categorical_features
and numeric_features
if you want PyCaret to infer data types in your dataset i.e., infer features with numerical data types into categorical types used in classification. But, we won't use that extra parameter today. I'll introduce it in a follow-up article.
exp_name = setup(data = credit_dataset, target='default', session_id=5041)
As shown above, running the code generates information concerning the pre-processing pipeline which is constructed when setup()
is executed. For example, we have 14 numeric features
and 9 categorical features
in our data.
In our experiment, we've also used the session_id = 5041
parameter. We've set ours to attain reproducibility. Using it is not compulsory, but excluding it will prompt a random number to be generated.
With our experiment set up, all that's left to do now is to go on and train the model.
best_model = compare_models()
The code above is going to train our model. To train it, we run the compare_models()
function. This function trains all models in the model library and scores them using the commonly used classification metrics: Accuracy, AUC, Recall, Precision, F1, Kappa. The results obtained show a list of the best-performing models at a particular point in time.
In our case, the Ridge Classifier is our best-performing model. The list contains different learning algorithms. But, we are only interested in the learning algorithm that is the best performing. We drop the rest.
Testing our model
predict_model(best_model)
The accuracy recorded after testing our model is 0.8159
. There isn't much difference with the accuracy recorded earlier of 0.8228
. This could be due to overfitting or other factors that may need investigation.
By employing techniques such as as early stopping and dropout will help us prevent our model from overfitting and would reduce significantly the difference between training and validation accuracy.
Prediction on our dataset
To perform prediction on our credit_dataset
dataset, type in the following code:
prediction = predict_model(best_model, data = credit_dataset)
prediction.tail()
Please note that the Label
column has now been added at the end of our dataset. This Label
column denotes our prediction. The value 1
predicting true (customer will default) while a 0
predicts false (customer won't default). We use the head()
and tail()
functions to perform predictions on the first 5 and last 5 rows respectively. You can play around with it. To perform prediction on all the rows, remove the prediction.tail()
code.
Saving our model
The last thing that we have to do is go ahead and save this model. We save our model using the following code:
save_model(best_model, model_name='ridge-model')
Loading our saved model
To load our saved model, type in the following code:
model = load_model('ridge-model')
With a few lines of code, our transformation pipeline and model have successfully Loaded!
Please find the full code here.
Wrapping up
That, in a nutshell, is how to get started with PyCaret. PyCaret is a very strong competitor to scikit-learn, and I do not doubt that it will become one of the most used libraries such as the likes of TensorFlow and pandas. Feel free to try and build your ML classification model using your custom dataset.
That wraps it up! Happy coding!
Further reading
Peer Review Contributions by: Collins Ayuya