Machine Learning using Pandas Profiling and Scikit-learn Pipeline
Pandas profiling is a Python library that performs an automated Exploratory Data Analysis. It automatically generates a dataset profile report that gives valuable insights. For example, we can know which variables to use and which ones we can drop using the profile report. <!--more--> A machine learning pipeline is used to automate the machine learning development stages. These stages are dataset ingestion, dataset preprocessing, feature engineering, model training, model evaluation, making predictions, and model deployment.
A machine learning pipeline is made of multiple initialized steps. It uses the steps to automate the machine learning development stages. The steps are initialized in sequential order so that one's output is used as an input for the next. Therefore, the pipeline steps need to be well-organized for faster model implementation.
Many libraries support the implementation of a machine learning pipeline. We will focus on the Scikit-Learn library. The library provides a Pipeline class that automates machine learning. We will build a customer churn model using Pandas Profiling and Scikit-learn Pipeline.
Table of content
- Prerequisites
- How the Scikit-learn Pipeline works
- Transformers
- Estimators
- Benefits of using Scikit-learn Pipeline
- Dataset used
- Automated Exploratory Data Analysis with Pandas Profiling
- Overview
- Variables
- Interactions
- Correlations
- Missing values
- Sample
- Dataset splitting
- Importing transformer methods and classes
- Drop columns transformer
- Numeric Transformers
- Categorical transformer
- Combining the initialized transformers
- Applying the transformers
- Adding the final estimator
- Fitting the pipeline
- Getting accuracy score on the training set
- Getting accuracy score on the testing set
- Conclusion
- References
Prerequisites
To follow along with this article, a reader should:
- Know how to build a machine learning model.
- Know how to implement Scikit-learn algorithms.
- Understand machine learning workflows.
- Understand steps in dataset preprocessing.
How the Scikit-learn Pipeline works
Scikit-learn Pipeline is a powerful tool that automates the machine development stages. It has a sequence of transformation methods followed by a model estimator function assembled and executed as a single process to produce a final model.
The Scikit-learn Pipeline steps are in two categories:
- Transformers.
- Estimators.
Transformers
This step contains all the Scikit-Learn methods and classes that perform data transformation. Data transformation is an important stage in machine learning.
It converts the raw dataset into a format that the model can understand and easily use. In addition, data transformation performs feature engineering and dataset preprocessing.
Feature engineering gets relevant and unique attributes from the dataset called features. The model then uses the features as input during training. Dataset preprocessing involves cleaning, formatting, and removing noise from the dataset.
Some of the most common activities involved in dataset preprocessing are as follows:
-
Removing outliers: Outliers are data points that deviate from the other observations in the dataset. It ensures we have data points that conform to the expected behaviour of the dataset.
-
Imputing missing values: Dataset imputation replaces missing values in a dataset with some generated values. It ensures that we have a complete dataset before feeding it to the model.
-
Dataset standardization: Dataset standardization transforms a dataset to fit within a specific range/scale. For example, you can scale a dataset to fit within a range of 0-1 or -1-1. It will ensure that our dataset values have a unit variance of 1 and a mean of 0.
For a better understanding of the dataset standardization, you could read this article.
- Handling Categorical Variables: In handling categorical values, we convert categorical data into integer values. One-Hot encoding is one of the methods that perform this process.
Estimators
Estimators take the processed dataset as an input and fit the model into the dataset. Estimators then train the model, which will be used to make predictions.
Estimators are the Scikit-learn algorithms that perform classification, regression, and clustering. Common estimators are Logistic Regression, Decision Tree Classifier, K-NN clustering algorithm, Naive Bayes algorithm, and Random Forest Classifier.
Benefits of using Scikit-learn Pipeline
- Faster model implementation through automation.
- It produces models with a very high accuracy score.
- Model debugging to remove errors during model training.
- Produces a more robust and scalable model.
Dataset used
We will use the telecommunication dataset that contains information about their customers. This dataset will train a customer churn model. To download the dataset, use this link.
After downloading the dataset, we load the dataset using Pandas. To import Pandas, use this code:
import pandas as pd
We can use Pandas to load the dataset.
df=pd.read_csv("customer-churn-dataset")
We will view the loaded dataset using this command:
df
The output:
Let us now start automated exploratory data analysis using the Pandas Profiling.
Automated Exploratory Data analysis with Pandas Profiling
To install the Pandas Profiling library, use this command:
!pip install -U pandas-profiling
We will use Pandas Profiling to generate a profile report. The report will give the dataset overview and dataset variables.
We generate the profile report using this code:
profile = ProfileReport(df, title='Churn Data Report', explorative=True)
profile.to_notebook_iframe()
The title of the generated report will be Churn Data Report
. The profile report will have the following sections:
Overview
The overview section produces the following output:
From the generated report, the dataset has 21 variables and 7043 observations/data points. The dataset has no missing values and duplication rows. The image also shows the variable types, which are categorical (13), boolean (6), and numerical (2).
Variables
This section shows all the dataset variables. In addition, it provides useful characteristics and information about the variables.
The outputs below show some of the important variables:
customerID and gender
SeniorCitizen and partner
Dependents and tenure
InternetService and OnlineSecurity
MonthlyCharges and TotalCharges
Interactions
The interaction section has the following output:
The interaction section shows the relationship between two variables using a scatter plot. For example, the image above shows the relationship between tenure
and monthly charges
.
Correlations
The correlation section shows the relationship between the dataset variables using Seaborn’s heatmap. Pandas Profiling allows toggling between the four main correlations plots.
These plots are the Phik (φk), Kendall’s τ, Spearman’s ρ, and Pearson’s r.
The correlations section produces the following output:
The image above shows the Phik (φk)
correlation plot. We can easily toggle between the four main correlations plots to view the plots. By clicking the Toggle correlations descriptions
button, we will view a detailed description of each correlation plot.
Missing values
This section shows if there are missing values in the dataset.
The image shows the number of data points in each variable. All the variables have the same number of data points (7043). It shows there are no missing values in the dataset.
Sample
This section displays the first 10 rows and the last 10 rows of our dataset.
First 10 rows
Last 10 rows
This marks the end of automated Exploratory Data Analysis using the Pandas Profiling. The library provides a descriptive analysis of our dataset and better understands the churn dataset. Let us now specify the X and y variables of our dataset.
X and y varables
The X variables represent all the independent variables in a dataset which are the model inputs. The y variable is dependent, which is the model output.
To add the X and y variables, use this code:
X = df.drop(columns=['Churn'])
y = df['Churn']
From the code above, the Churn
variable is the y
variable, and the remaining variables are the X
variable.
Dataset splitting
Let us import the method used for dataset splitting.
from sklearn.model_selection import train_test_split
We will split the dataset into two sets using the following code:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=124)
We use test_size=0.30
from the code above, which is the splitting ratio. 70% of the dataset will be for model training and 30% for model testing.
Variable types
Using Pandas Profiling, we were able to see that the dataset has three variable types. The variable types are: categorical (13), boolean (6) and numerical (2).
We need to specify the columns that belong to these variable types. We use the code below:
numeric_features = ['tenure', 'TotalCharges']
categorical_features = ['SeniorCitizen', 'Partner', 'Dependents', 'PhoneService', 'InternetService','OnlineSecurity','OnlineBackup','DeviceProtection','TechSupport','StreamingTV','StreamingMovies','Contract']
The code selects the columns that have categorical and numerical values.
Selecting the unused columns
We select all the unused columns using the following code:
drop_feat= ['customerID','gender','PhoneService','MultipleLines', 'PaperlessBilling','PaymentMethod']
We will drop the selected columns from our dataset. To drop this column, we will use one of the Scikit-learn Pipeline transformer methods. Let us first import all the transformer methods and classes.
Importing transformer methods and classes
As mentioned earlier, the Scikit-learn Pipeline steps has two categories. Transformers and Estimators. The pipeline will have a sequence of transformers followed by a final estimator.
We create transformers using various Sckit-learn methods and classes which perform data transformation. Let us import all the transformer methods and classes we will use in this tutorial.
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
From the code above, we have imported the following transformer methods:
- ColumnTransformer: It is a Scikit-learn class that applies the transformers to our columns. It also combines various transformers into a single transformer.
- SimpleImputer: It is a Scikit-learn class that imputes missing values. It will replace missing values in a dataset with some generated values.
- StandardScaler: It performs dataset standardization. It will ensure that our dataset values have a unit variance of 1 and a mean of 0.
For a better understanding of the dataset standardization process, read this article
OneHotEncoder: It performs categorical encoding. The method converts categorical data into integer values using a one-hot scheme.
For a better understanding of OneHotEncoder, read this article.
Let us now create our first transformer using these methods.
Drop columns transformer
The first transformer will drop the unused columns.
drop_transformer = ColumnTransformer(transformers=[('drop_columns', 'drop', drop_feat)], remainder='passthrough')
The unused columns are in the drop_feat
variable. The remainder='passthrough'
will enable the model to use the remaining columns in the dataset.
We will then add the drop_transformer
to the Pipeline
class. However, first, let us import the Pipeline
class from Scikit-learn.
Import the 'Pipeline' class
We import the Pipeline
class as follows:
from sklearn.pipeline import Pipeline
The Pipeline
assembles all the initialized transformers and the final estimator. It then executes them as a single process to produce a final model.
To add the drop_transformer
, use this code:
pipeline = Pipeline([('drop_column', drop_transformer)])
Next, we fit the pipeline.
pipeline.fit(X_train)
It fits the model to the training set. It produces the following output:
The output shows the drop_transformer
added to the Pipeline
class. The next step is to use the transform
method to drop the unused columns.
transformed_train=pipeline.transform(X_train)
To see the transformed dataset, use this code:
transformed_train
The output is shown below:
Numeric Transformers
The numeric transformers will perform data imputation and standardization.
To initialize these transformers, use this code:
numeric_transformer = Pipeline(steps=[
('meanimputer', SimpleImputer(strategy='mean')),
('stdscaler', StandardScaler())
])
From the code above, SimpleImputer
will perform data imputation. strategy='mean'
replaces the missing values using the generated mean in each column. The StandardScaler()
method performs data standardization.
Categorical transformer
It will handle the categorical values in the dataset. Then, we will use the OneHotEncoder
method to convert the categorical data into integer values.
categorical_transformer = Pipeline(steps=[
('onehotenc', OneHotEncoder(handle_unknown='ignore'))
])
We then combine these initialized transformers.
Combining the initialized transformers
We use the following code:
col_transformer = ColumnTransformer(transformers=[('drop_columns', 'drop', drop_feat),
('numeric_processing',numeric_transformer, numeric_features),
('categorical_processing', categorical_transformer, categorical_features)
], remainder='drop')
We have used ColumnTransformer
to combine all the initialized transformers. numeric_processing
transforms the numeric_features
, while categorical_processing
transforms the categorical_features
. We save the final transformer in the col_transformer
variable.
Let us add it to the Pipeline
class.
Adding the 'col_transformer' transformer
To add the col_transformer
to Pipeline
class, use this code:
pipeline = Pipeline([('transform_column', col_transformer)])
Next, we fit the pipeline to the train set.
pipeline.fit(X_train)
It produces the following output:
The image above shows all the added transformers. The next step is to use the transform
method to apply the transformers to the columns.
Applying the transformers
transformed_train=pipeline.transform(X_train)
To see the transformed dataset, use this code:
pd.DataFrame(transformed_train)
Applying the transformers to test dataset
Use this code:
pipeline.transform(X_test)
To see the transformed test dataset, use this code:
Adding the final estimator
The last step in the Scikit-learn Pipeline is to add an estimator. An estimator is an algorithm that trains the machine learning model. We will use the LogisticRegression
as the estimator.
from sklearn.linear_model import LogisticRegression
To add the estimator to the Pipeline
class, use this code:
pipeline = Pipeline([
('transform_column', col_transformer),
('logistics', LogisticRegression())
])
From the image above, the Pipeline
class has all the transformers (col_transformer) and the final estimator (LogisticRegression).
Let us fit the model into the dataset.
Fitting the pipeline
To fit the pipeline, use this code:
pipeline.fit(X_train, y_train)
The pipeline will identify patterns in the training set.
Getting accuracy score on the training set
To get the accuracy score, use the following code:
pipeline.score(X_train, y_train)
The output is shown below:
0.7953346855983773
It is a good accuracy score and shows the model has a 79.533% chance of making correct predictions.
Let us evaluate the model using the testing set.
Getting accuracy score on the testing set
We get the accuracy score using the following code:
pipeline.score(X_test, y_test)
The output is shown below:
0.8078561287269286
When we compare the two accuracy scores, the accuracy score on the testing set is better. It shows that the model still performs well using the testing set, which is new to the model.
Conclusion
In this tutorial, we learned how to build a machine learning model using Pandas Profiling and Scikit-learn Pipeline. The tutorial explained how the Scikit-learn Pipeline works and the key pipeline steps. Pandas Profiling generated a profile report that shows the dataset overview.
After this process, we implemented our transformers using Scikit-learn transformer methods and classes. Then, we stacked these transformers together, and the other added a final estimator. Finally, we trained the customer churn model using the Telecommunication dataset.
The model had a good accuracy score using the training and the testing dataset. To access the complete Google Colab notebook for this tutorial, click here.
References
- Scikit-learn Pipeline documentation
- Pipelines and composite estimators
- Scikit-learn ColumnTransformer
- Categorical encoding
- Scikit-learn OneHotEncoder
- Dataset standardization
- Introductioon to Pandas Profiling
Peer Review Contributions by: Jerim Kaura