Multi-Output Classification with Machine Learning
Multi-output classification is a type of machine learning that predicts multiple outputs simultaneously. In multi-output classification, the model will give two or more outputs after making any prediction. In other types of classifications, the model usually predicts only a single output.
<!--more-->
An example of a multi-output classification model is a model that predicts the type
and color
of fruit simultaneously. The type of fruit
can be, orange, mango and pineapple. The color
can be, red, green, yellow, and orange. The multi-output classification solves this problem and gives two prediction results.
In this tutorial, we will build a multi-output text classification model using the Netflix dataset. The model will classify the input text as either TV Show
or Movie
. This will be the first output. The model will also classify the rating as: TV-MA
, TV-14
, TV-PG
, R
, PG-13
and TV-Y
. The rating will be the second output. We will use Scikit-Learn MultiOutputClassifier
algorithm to build this model.
Table of contents
- Prerequisites
- Netflix Dataset
- Loading the dataset
- Output columns distribution
- Text cleaning
- Importing important packages
- Adding features and labels
- Dataset splitting
- Importing pipeline
- Making predictions
- Prediction probabilities
- Conclusion
- References
Prerequisites
To understand the concepts in this tutorial, a reader should:
- Understand Python programming
- Be able to build machine learning models.
- Understand natural langauge processing
- Know how to train models using Scikit-learn
- Use Google Colab notebook to run the Python code in this tutorial.
Netflix dataset
We will use the Netflix dataset to build our model. The image below will show how our dataset is structured.
From the image above, our dataset has four columns: title
, description
, type
, and rating
. The title
column will be the input column while type
and rating
will be the output column. We now need to load this dataset on our machine.
To download this Netflix dataset, click here
Loading the dataset
There are various exploratory data analysis (EDA) packages that will load our dataset. Let's import them using the following code:
import pandas as pd
import numpy as np
We will use Pandas to load the dataset. We will use Numpy to perform computational operations on our dataset. It also works well with arrays.
Let's now load the Netflix dataset that you have downloaded from the link above.
df = pd.read_csv("netflix_titles_dataset.csv")
To check if our dataset is loaded successfully, run the code:
df.head()
This command will output the structure of our dataset, and it shows all the columns on our dataset. It should have the same structure as the dataset you have downloaded. The output is shown below:
Now that we have loaded our dataset successfully, let's check the distribution of our target/output columns.
Output columns distribution
From our dataset, we have two output columns: type
and rating
. Column distribution is the value count of each column in the entire dataset. We will start with the type
column.
type
column
df['type'].value_counts()
The output is shown below:
Movie 4788
TV Show 2143
Name: type, dtype: int64
In the output above, we have 4788 movie
data samples and 2143 TV Show
data samples.
rating
column
To get the value count of the rating
column, use the following code:
df['rating'].value_counts()
The output is shown below:
TV-MA 2863
TV-14 1931
TV-PG 806
R 665
PG-13 386
TV-Y 280
Name: rating, dtype: int64
The output above shows the distribution of all the ratings in our dataset. We have seven ratings: TV-MA
, TV-14
, TV-PG
, R
, PG-13
and TV-Y
.
Before building our model, we also need to clean our dataset. Dataset cleaning involves correctly formatting our dataset.
Text cleaning
For text cleaning, we will convert all our text data into lower case and remove stop words. We will use the NeatText
Python package to perform this process. We will install Neattext using the following code:
!pip install neattext
Let's import the Neattext functions that we will use for text cleaning.
import neattext.functions as nfx
To convert the text data into lower case, run this command:
df['title'] = df['title'].nfx.lower()
Let's remove stopwords from our test dataset. Stopwords are the most common words that are used in any language. They have little weight on the model during training.
Removing stopwords removes words with little weight. This allows the model to focus on the words that will have a greater impact during training.
To remove the stopwords, run this code:
df['title'] = df['title'].apply(nfx.remove_stopwords)
Now that we have removed stop words and correctly formatted our dataset, let's import all the packages we will use to build the model.
Importing important packages
To import all the important packages, run this code:
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.multioutput import MultiOutputClassifier
Let's explain these packages that we have imported.
LogisticRegression It is the algorithm used to train the model.
CountVectorizer Since we are dealing with text, we need to convert the input text into vectors of numbers. Machine learning models do not understand the raw text. The converted vectors of numbers are the representation of the original text.
CountVectorizer
is the most common Python package used to perform this process.
For further reading on CountVectorizer and how they convert raw text into vectors of numbers, click here
train_test_split It is the Python package that is used for dataset splitting. In machine learning, it's essential to split a dataset into two sets. One set is to be used for training and another one for testing.
accuracy_score It is used to calculate the accuracy score of the model after training.
MultiOutputClassifier
Since we are dealing with a multi-output classification problem, we need a more specific algorithm. MultiOutputClassifier
is the most common Scikit-learn algorithm used to build this model.
We now need to specify features and labels for our model.
Adding features and labels
Features and labels are essential in any machine learning label. Features represent all the columns used by the model as inputs during training. Labels represent the output or target columns, which the model wants to predict. We add the using the following code:
Xfeatures = df['title']
ylabels = df[['type','rating']]
From this code, our feature is the title
, and we will use it as input for our model. The labels are type
and rating
, and are the output of our model. We have two labels because we are dealing with a multi-output classification problem.
The next step is to split our dataset using the train_test_split
method.
Dataset splitting
To split the dataset into two, use this code:
x_train,x_test,y_train,y_test = train_test_split(Xfeatures,ylabels,test_size=0.3,random_state=7)
In the code above, we use a test_size=0.3
. It will split our dataset so that 70%
of the dataset is used for training and 30%
for testing. We have split our dataset, and we are now ready to build the model.
To build this model, we will use the machine learning pipeline package to speed up the process of building our model. It will speed up the process by automating all the processes involved in building the model.
The machine learning pipeline will automate the process of CountVectorizer
. It will also automate the process of model training using LogisticRegression
and MultiOutputClassifier
algorithms.
We will import the Pipeline
package to implement this pipeline process.
Importing pipeline
To import the Pipeline
, use the following code:
from sklearn.pipeline import Pipeline
To build the model using this Pipeline
package, we need to initialize all the processes involved in building our model. In our case, we have two processes.
The first process is CountVectorizer
: converting raw text to vectors of numbers. The second process uses the LogisticRegression
and MultiOutputClassifier
algorithms in training the model.
Let's initialize these two processes.
Initializing the processes
These processes are usually in sequential steps. The output of one process is used as the input of the next process, as shown in the code below.
pipe_lr = Pipeline(steps=[('cv',CountVectorizer()),
('lr_multi',MultiOutputClassifier(LogisticRegression()))])
Now that we have initialized the processes, let's fit the pipeline into our training dataset. This will enable the model to learn from the dataset. To fit the pipeline, use the following code:
pipe_lr.fit(x_train,y_train)
The two processes will run automatically during this stage and produce a trained model, as shown below:
We can calculate the accuracy score of this model using the following code:
pipe_lr.score(x_test,y_test)
The accuracy score is shown below:
0.8969221004536385
The accuracy score for our model is 0.896922
. This represents 89.6922%
. It is a good accuracy score, and we can use this trained model to make predictions.
Making predictions
To make a prediction, we need to extract a sample input text. To extract a sample text, run this code:
print(x_test.iloc[0])
The output of this sample text is the midnight sky
. Let's save this text in a variable.
pred1 = x_test.iloc[0]
The model will use this input text to make a prediction. The model should classify the input text as either a Movie
or TV Show
and provide its rating.
To make this prediction, run this code:
pipe_lr.predict([pred1])
The prediction output is shown below:
array([['Movie', 'TV-MA']], dtype=object)
From the output above, the model has produced two prediction outputs. It has classified the input text as a Movie
with a rating of TV-MA
. Therefore, we have successfully built our multi-output text classification model.
We can also calculate the prediction probability of these outputs. This enables us to know why the model made these predictions.
Prediction probabilities
To calculate the probabilities, use the following code:
print(pipe_lr.classes_)
pipe_lr.predict_proba([pred1])
The output is shown below:
[array(['Movie', 'TV Show'], dtype=object), array(['PG-13', 'R', 'TV-14', 'TV-MA', 'TV-PG', 'TV-Y'], dtype=object)]
[array([[0.74445483, 0.25554517]]),
array([[0.12310188, 0.07038494, 0.21476461, 0.46916205, 0.10270243,
0.01988409]])]
From the output above, we can see Movie
had a higher probability of 0.74445483
than 0.25554517
of TV Show
. That's why the model classified the text as Movie
.
In the next prediction, TV-MA
has a higher probability of 0.46916205
than the other rating. That's why the model classified the rating as TV-MA
. Using these prediction probabilities, we can see that our model could make the right predictions.
Conclusion
In this tutorial, we have learned how to build a multi-output classification model. We started by cleaning our Netflix dataset to ensure that we correctly formatted it before use. We then used the clean dataset to build the multi-output text classification model.
We used the LogisticRegression
and MultiOutputClassifier
algorithms to train the model. We implemented all the machine learning processes using the pipeline package. It sped up the process and made our work easier.
Finally, we used our model to make predictions, and the trained model could make the right predictions. To get the multi-output classification model we have built in this tutorial, click here.
References
- Google Colab notebook
- Scikit-learn documentation
- Getting started with Neattext
- How to work with machine learnining pipeline
- MultiOutputClassifier algorithm
Peer Review Contributions by: Willies Ogola