Building a Spam Detection Model using Scikit-Learn

Spam is a large number of unsolicited messages that are sent to a large number of people. The messages may be for advertising, fraudulent purposes, or malware spreading. Spams can be informed of comments left on personal sites or emails sent in bulk.

Spam detection helps in detecting these spam messages and comments. Spam detection models filter out unwanted messages and comments. This ensures an individual receives messages or notifications that are crucial to them. When building the spam detection model, we will provide the model with a dataset that consists of spam and non-spam comments. The model will learn from this dataset and find relevant patterns that will help it to distinguish between spam and non-spam comments.

This tutorial will demonstrate how to build a machine learning model that will detect Youtube comments as spam or non-spam. We will use a dataset that contains a list of comments from popular Youtube channels to train our model. Finally, we will implement the model using the Naive Bayes algorithm.

Prerequisites
Dataset preparation
Extracting important columns
Feature extraction from text
Model building
Accuracy score of our model
Model evaluation
Making a single prediction
Making another prediction
Conclusion
References

Prerequisites

A reader should know the following to understand this tutorial clearly:

Be well equipped with Python programming skills.
Understand the concepts of machine learning.
Have some knowledge about natural language processing.
Know how to work with some of the Scikit-learn algorithms.
Know how to build a machine learning model using Google Colab notebook.

Dataset preparation

The dataset used contains a list of comments from popular Youtube channels. We will use a dataset collected from five Youtube channels. We need to prepare this dataset to be ready for use. Data preparation involves correctly formatting our dataset to make it easy for use by the model during training.

First, we need to load these datasets into our machine. Let's import the packages that will load our dataset.

import pandas as pd
import numpy as np

We will use Pandas to read the datasets and Numpy to perform mathematical operations on these datasets. We will have five datasets since we have collected the dataset from five Youtube channels.

To download the five datasets in a ZIP file, click here. After downloading the ZIP file, extract the individual datasets, which we will load onto our machine.

To load the five datasets, use the following code:

df1 = pd.read_csv("Youtube01-Channel1.csv")
df2 = pd.read_csv("Youtube02-Channel2.csv")
df3 = pd.read_csv("Youtube03-Channel3.csv")
df4 = pd.read_csv("Youtube04-Channel4.csv")
df5 = pd.read_csv("Youtube05-Channel5.csv")

Now we have five datasets, we need to concatenate or merge them. We will join the five datasets together to have a single data frame.

Datasets concatenation

We create a single data frame for the datasets and then apply the concat method to join them together.

frames = [df1,df2,df3,df4,df5]
df_merged = pd.concat(frames)

To view our merged datasets, use the following code:

df_merged

The output is shown below:

Merged dataset

From the image above, our dataset has five columns: COMMENT_ID, AUTHOR, DATE, CONTENT, and CLASS. The columns that we are most interested in are CONTENT and CLASS columns.

CONTENT column represents the actual Youtube comments. The CLASS column is labeled either 0 or 1. 0 represents non-spam comments, and 1 represents spam comments.

The merged dataset contains five datasets. We need to assign keys to our merged dataset to distinguish each dataset.

Assigning keys

Assigning keys enables the model to know the Youtube channel that a dataset belongs to. We will have five keys to represent the five datasets as shown below.

keys = ["Channel1","Channel2","Channel3","Channel4","Channel5"]

After initializing the five keys, we need to concatenate these keys into our dataset using the following code:

df_with_keys = pd.concat(frames,keys=keys)

The code above will add the keys to the dataset. It will also group the dataset according to the Youtube channels. This makes it easy for the model to understand and manipulate the dataset.

The model will easily identify useful insights and patterns from the dataset during training. To view this dataset with the added keys, use this code:

df_with_keys

The output is shown below:

Merged dataset with keys

We can save the dataset into a new variable, df.

df = df_with_keys

To check the size of the dataset, run the following code:

df.size

The output is shown below:

After combining the five datasets, we have 9780 Youtube comments.

Let's check for any missing values in our dataset.

Checking for missing values

To check for missing values, use the following code:

df.isnull().isnull().sum()

The output is shown below:

COMMENT_ID    0
AUTHOR        0
DATE          0
CONTENT       0
CLASS         0
dtype: int64

From the output above, there are no missing values. Therefore, our dataset is ready for use.

Extracting important columns

We need to extract the important columns from our dataset. As mentioned earlier, we are interested in only two columns, CONTENT and CLASS.

The CONTENT column contains the actual Youtube comments. This column will be used as an input for the model. The CLASS column contains 0 and 1 labels. This column will be used as an output or target for the model.

To extract these two columns, use this code:

df_data = df[["CONTENT","CLASS"]]

We now need to specify which column will be used as an input and which one will be used as an output. This is done using the following code:

df_x = df_data['CONTENT']
df_y = df_data['CLASS']

From the code above, df_x is the input variable and df_y is the output or target variable. After specifying our input and output variables, let's perform feature extraction.

Feature extraction from text

Feature extraction is the process of getting important characteristics from the raw text. Machine learning models do not understand the text and can not use text directly. That's why we have to perform feature extraction. The extracted features will now be used as inputs for the model.

We have to convert the raw text into a vector of numeric values during feature extraction. The vectors of numeric values represent the original raw text. Machine learning models easily understand numeric values and can use them directly.

This process of converting raw text to vectors of numeric values will be done using the CountVectorizer Python package. CountVectorizer is a powerful tool from Scikit-learn library that speeds up this feature extraction process from text.

Let's import CountVectorizer.

from sklearn.feature_extraction.text import CountVectorizer

We will then use CountVectorizer to perform feature extraction on our input variable, df_x.

corpus = df_x
cv = CountVectorizer()
X = cv.fit_transform(corpus)

In the code above, we save the input variable into a new variable, corpus. The fit_transform ensures that the CountVectorizer completely fits our input dataset and no data point is left out. Therefore, all the raw text will be converted into vectors of numeric values.

To view these vectors of numeric values, use this code. The code will convert the numeric values into an array of numbers.

X.toarray()

The output is shown below:

Converted text

We can now use this vector of numbers to build the model.

Model building

To build our machine learning model, we need to import the packages that will be useful during this process.

from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split

MultinomialNB

This is the classification method imported from the Naive Bayes algorithm. Naive Bayes algorithm has other methods such as GaussianNB, but MultinomialNB is best suited because we are working with text.

We will use the MultinomialNB method to build our spam detection model.

For a detailed understanding of the different Naive Bayes algorithm methods, click here.

train_test_split

We will use this package to split our dataset into two sets. The model will use the first set for training, and the second set for testing.

We will start by splitting the dataset.

Dataset splitting

To split the dataset, use the following code:

X_train, X_test, y_train, y_test = train_test_split(X, df_y, test_size=0.30, random_state=42)

From the code above, we have a test_size=0.30. This means the algorithm uses 70% of data for training the model, and 30% will be used to test the model.

Let's now build the model using the MultinomialNB method. First, we initialize the MultinomialNB method as follows:

clf = MultinomialNB()

After initializing this method, we fit our model into our dataset. This enables the model to learn by identifying useful insights and patterns from the dataset.

Model fitting

clf.fit(X_train,y_train)

Accuracy score of our model

To calculate the accuracy score of this trained model, use this code:

print("Accuracy of Model",clf.score(X_test,y_test)*100,"%")

The accuracy score is shown below:

Accuracy of Model 91.95046439628483 %

This is a very high accuracy score, and the model has a high chance of making accurate predictions. We can now evaluate this model using the testing dataset.

Model evaluation

We will use this model to classify the Youtube comments in the testing dataset as either spam or non-spam.

clf.predict(X_test)

We use the predict method to classify all the Youtube comments in the testing dataset. The output is shown below:

Model evaluation

From the image above, we can see the model assigned labels to our testing dataset. The labels are either 0 or 1.

We can use this model to make a single prediction.

Making a single prediction

We will use input text to predict, as shown below.

comment = ["Check this out I will be giving 50% offer on your first purchase"]
vect = cv.transform(comment).toarray()

The input text is "Check this out". We will use the model to classify the text into either spam(1) or non-spam(0). We also need to convert the input text into vectors of numeric values using cv.transform method. Finally, the numeric values will be converted into an array of numbers using the toarray() method.

To make this prediction, run this code:

clf.predict(vect)

The prediction result is shown below:

array([1], dtype=int64)

The prediction result is 1, which shows that the Youtube comment above is spam. We can use this mode to make another prediction.

Making another prediction

We will follow the same steps as above to make a second prediction.

comment1 = ["Great song Friend, it has really touched my heart"]
vect = cv.transform(comment1).toarray()
clf.predict(vect)

The prediction result is shown below:

array([0], dtype=int64)

The prediction result is 0, which shows that the comment is non-spam. Using these two predictions, we can see that our model can distinguish between spam and non-spam comments.

Conclusion

In this tutorial, we have learned how to build a spam detection model. We started by preparing our dataset to format our dataset correctly. We had five datasets that were collected from popular Youtube channels. After preparing the dataset, we used it to build our spam detection model. The model was able to distinguish between spam and non-spam comments. This was the tutorial's goal, and we have successfully built a spam detection model.

To get this spam detection model in Google Colab, click here.

References

Peer Review Contributions by: Willies Ogola

Author

James Omina

James Omina is an undergraduate student undertaking his Bachelor of Science in Computer Science. He is interested in cyber security, and mobile application development. He is passionate about Machine Learning and its application in the real world.