Detecting Malicious URL using Machine Learning
A malicious URL is a website link that is designed to promote virus attacks, phishing attacks, scams, and fraudulent activities. When a user clicks a malicious URL they can download computer viruses such as trojan horses, ransomware, worms, and spyware.
The end goal of these viruses is to access personal information, damage the user's device, and for financial gain. They may also destroy the company's network, leading to losses.
A malicious URL can also be used to lure people to submit their personal information on a fake website. This makes these people share their personal and sensitive information with unknown people. They use the information for an ulterior motive. The harm caused by these malicious URLs can be very large.
In this tutorial, we will build a machine learning model that can be able to detect these malicious URLs. We will train our model using a dataset with URLs labeled both bad
and good
. We will build the model using Scikit-learn Python library.
Table of contents
- Prerequisites
- Exploring our dataset
- Loading dataset
- Dataset cleaning
- Features and labels
- Importing packages
- Convert the text data into a vectors of numbers
- Dataset splitting
- Model building using LogisticRegression
- Fitting algorithm
- Calculating the model's accuracy score
- Making predictions
- Another prediction
- Conclusion
- References
Prerequisites
To understand this tutorial easily, a reader should:
- Have Python programming skills.
- Understand machine learning processes.
- Have some natural language processing knowledge.
- Know some Scikit-learn algorithms.
- Code using Google Colab notebook.
Exploring our dataset
To build this model, we will use a dataset with URLs labeled both bad
and good
. The model will learn from the dataset and gain useful knowledge which it will use to make predictions. The dataset is shown below.
The URLs dataset can be downloaded from here.
After successfully downloading the URLs dataset, we can now load this dataset into our notebook.
Loading dataset
We will use the Pandas package to load our package. To import Pandas, use the following code:
import pandas as pd
We can now load the dataset using the following code:
urls_data = pd.read_csv("urldata.csv")
After loading the dataset, let's see how it is structured using the following code:
urls_data.head()
The output is shown below:
From the image above, the dataset has two columns. The first column is url
which represents the actual ULR links. The second column is label
which contains both bad
and good
URLs.
From here, we now need to clean our dataset to make it ready to be used by our model during training.
Dataset cleaning
Dataset cleaning involves removing noise from our dataset. Noise is unnecessary characters in the text data, punctuations, and repetitive words.
Removing noise from our dataset will enable the model to focus only on the most important information in the dataset. This will increase the model performance. The model will be able to make accurate predictions.
In this tutorial, we will first split the texts, then remove repetitions in the dataset. Finally, we will remove the com
from each URL.
We will create a custom Python function to clean our dataset. The function is shown below.
def makeTokens(f):
tkns_BySlash = str(f.encode('utf-8')).split('/') # make tokens after splitting by slash
total_Tokens = []
for i in tkns_BySlash:
tokens = str(i).split('-') # make tokens after splitting by dash
tkns_ByDot = []
for j in range(0,len(tokens)):
temp_Tokens = str(tokens[j]).split('.') # make tokens after splitting by dot
tkns_ByDot = tkns_ByDot + temp_Tokens
total_Tokens = total_Tokens + tokens + tkns_ByDot
total_Tokens = list(set(total_Tokens)) #remove redundant tokens
if 'com' in total_Tokens:
total_Tokens.remove('com') # removing .com since it occurs a lot of times and it should not be included in our features
return total_Tokens
The function above is named makeTokens
. The function is used to split our text by slash, dash, and dot. This will ensure we have a clean text with these unnecessary characters.
The function will also remove the redundant words in the text. Finally, it removes the com
word using the total_Tokens.remove('com')
method. The function will then return a clean text.
After this text cleaning, we now need to add features and labels to our dataset.
Features and labels
Features are the unique data points in our dataset that are used as input for the model during training. Features are represented by the url
column, which is our input column.
In machine learning, a label is the model's output after prediction. It is represented using the label
column. The model's output can either be bad
or good
.
To add these features and labels, use this code:
url_list = urls_data["url"]
y = urls_data["label"]
Let's now start importing the machine learning packages.
Importing packages
To build our machine learning model, we have various Python packages that are essential for this process. We will import all the important packages and explain their functions later.
To import these packages, use this code:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
The functions of each of these packages are as follows:
LogisticRegression
This is a Scikit-learn algorithm that we will use to train our model. This algorithm will enable our model to understand patterns and relationships in our dataset. The model will gain useful knowledge and insight, which it will use to make predictions.
train_test_split
This is the function in Sklearn model selection for splitting data arrays into two subsets: for training data and for testing data.
TfidfVectorizer
This package will enable the model to understand and manipulate text data. Text is a big problem for machines, machines cannot consume text in its raw form. We need to convert text into vectors of numbers that machines can read and understand.
TfidfVectorizer
is used to convert the raw text data into vectors of numbers that represents the original text. This text is converted based on the frequency of occurrence of each word in the text data.
For further reading on how TfidfVectorizer
works, read this article.
Let's now use these packages.
Convert the text data into vectors of numbers
To convert the text data into vectors of numbers, use this code:
vectorizer = TfidfVectorizer(tokenizer=makeTokens)
We convert the text using TfidfVectorizer
and also pass makeTokens
as a parameter. makeTokens
is the function used to clean our text.
After converting text, we will save our vectors of numbers into a new variable using the following code:
X = vectorizer.fit_transform(url_list)
The next step is to split our dataset.
Dataset splitting
We will use train_test_split
to split our dataset into two sets. One set will be used for model training and the other one will be used for model testing.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
In this code, we have used test_size=0.2
. This is the ratio applied when splitting the dataset. The ratio will ensure that 80%
of the dataset is used to train the model and 20%
will be used to test the model.
We can now build the model using the LogisticRegression
.
Model building using LogisticRegression
We will initialize the LogisticRegression
algorithm using the following code:
logit = LogisticRegression()
After initializing the algorithm, we will fit the algorithm onto our training dataset. The model will learn from this dataset.
Fitting algorithm
To fit the algorithm, use this code:
logit.fit(X_train, y_train)
This process will train our model and produce the following output.
The output shows our model trained with all the best parameters. Let's now calculate the accuracy score of our model.
Calculating the model's accuracy score
To calculate the accuracy score, run this code:
print("Accuracy ",logit.score(X_test, y_test))
The output is shown below:
Accuracy 0.96163771063
The accuracy score is 96.164%
. This is a very high accuracy score and implies that our model was well trained. The model learned a lot from the dataset during the training phase and can now be used to make predictions.
Making predictions
To make predictions, we will use several URLs and see if the model can classify if the URL is bad
or good
. The URLs that we will use are shown below.
X_predict = ["https://www.section.io/engineering-education/",
"https://www.youtube.com/",
"https://www.traversymedia.com/",
"https://www.kleinehundezuhause.com",
"http://ttps://www.mecymiafinance.com",
"https://www.atlanticoceanicoilandgas.com"]
We can run predictions on these URLs using the following code:
X_predict = vectorizer.transform(X_predict)
New_predict = logit.predict(X_predict)
From the code above, we use the vectorizer.transform
method to convert the text to vectors of numbers. Then we apply the logit.predict
method to make the actual predictions.
To print the prediction results, run this code:
print(New_predict)
The prediction results are shown below.
['good' 'good' 'good' 'bad' 'bad' 'bad']
From the prediction results, the model has predicted the first three websites as good
. This is a correct prediction because these are known websites. The last websites have been classified as bad
.
Let's make another prediction.
Another prediction
The following are the website URLs.
X_predict1 = ["www.buyfakebillsonlinee.blogspot.com",
"www.unitedairlineslogistics.com",
"www.stonehousedelivery.com",
"www.silkroadmeds-onlinepharmacy.com" ]
To make the predictions use the following code:
X_predict1 = vectorizer.transform(X_predict1)
New_predict1 = logit.predict(X_predict1)
print(New_predict1)
The prediction result is shown below.
['bad' 'bad' 'bad' 'bad']
The model has classified all the URLs as bad
. Using this model we are able to classify URLs as either bad
or good
.
Conclusion
In this tutorial, we have learned how to detect malicious URLs using machine learning. We started by discussing the negative impact of clicking a malicious URL.
We learned how to clean our dataset to ensure that it is correctly formatted.
The Google Colab notebook used in this tutorial can be found here.
References
- Scikit-learn documentation
- What is a malicious URL?
- Google Colab notebook
- Logistic Regression in Python
Peer Review Contributions by: Ahmad Mardeni