Automated Plagiarism Detection Bot

Plagiarism or taking another persons ideas without proper credit or representation can feel like someone just kidnapped your idea. Actually, plagiarism derives its Latin root from plagiarius which literally means "kidnapper". So plagiarism is widely considered bad overall.

A: What's worse than someone stealing your work?

Q: Someone stealing your work and claiming it's theirs!!!

Image Source:

Plagiarism Detection

Anyway, I won't discuss the ethical upset and academic dishonesty plagiarism can bring because that's not what this article is about. If you want to know more about it I recommend you visit this article.

Overview

In this article, I will split the content in two parts:

Overview of Text Similarity
- Brief Introduction
- Math Formulation
- Example
Overview of the Algorithm
- Installation and Setup
- Code Walk-through
- Demo Plagiarism Detector

Let's get to it!

Text Similarity: Formulating the Problem

Text Similarity is determining the likeness of two textual documents.

We want to find the numerical value that will indicate how "close" or how similar two text documents are. Let's divide the steps involved in finding how two texts are similar.

We need to define a textual document in a algebraic model we can actually do useful calculations with. A word document would not a useful representation therefore we will define a useful model to work with the data.
After we have converted the text document to a useful model, we want to define (mathematical) operations that will be used as a proxy for similarity next. It is important to demonstrate how the operation will be used in the model.
Finally, we must get a useful normalized numerical value that will tell us how similar two text are. That will give us an indication whether our method is effective in finding the similarity of text documents.

The following steps are actually rather simple but how you go about answering it can vary greatly.

Math Formulation

Step 1:

A popular way of characterizing text documents is by using a Vector space model. The idea is to represent terms as vectors. A term can be anything: single word, multiple keywords or even a phrase. Each will count as a non-zero vector corresponding to a separate dimension.

According to this paper, mathematically we can define a document space as:

`$$D_i = (d_{i1}, d_{i2}, ... , d_{it})$$

$\textit{Where}$

$$ D_i~~represents~~the~~documents~~within~~a~~document~~space\ d_{it}~~represents~~the~~i^{th}~~document's~~t^{th}~term $$`

Every index term will represent a dimension in our vector space. For example, if we were to use the English dictionary as our document we would have as many dimension as the English vocabulary.

Now that we have our model let's move on!

Step 2:

Using this model, we can now apply an operation to evaluate the similarity coefficient. In this article we will use a method called the Cosine Similarity. The cosine similarity metric measures the cosine angle of two non-zero vectors.

The mathematical definition, as shown from wiki, can be defined as:

$$ {\displaystyle {\text{similarity}}=\cos(\theta )={\mathbf {A} \cdot \mathbf {B} \over \|\mathbf {A} \|\|\mathbf {B} \|}={\frac {\sum \limits _{i=1}^{n}{A_{i}B_{i}}}{{\sqrt {\sum \limits _{i=1}^{n}{A_{i}^{2}}}}{\sqrt {\sum \limits _{i=1}^{n}{B_{i}^{2}}}}}},} $$

$ where~A_i~and~B_i~are~components~of~vector~A~and~B~respectively$

Step 3:

Finally, the cosine similarity will give an outcome bounded by $ [0,1] $ (Not $[-1,1]$ because the document space is bounded in positive space only).

1 - very similar
0 - not similar

Example Plagiarism

For this example, I am going to replicate the example from Selva Prabhakaran. Let us take 3 documents on the topic of inheritance.

Document 1 is a snippet from the educba site on what is inheritance in programming.

Document 2 and Document 3 are from the Wikipedia page on inheritance(object oriented programming).

Only difference is that Document 3 is a subsection of Document 2. How do you think their similarities will compare?

Three Document Similarity Example

plagiarism_blog_post Image Source: Author -- Earl Potters

From the example above you can see three similar documents that share a central theme, namely inheritance.

With the example above, I am quantitatively measuring the similarity between the 3 documents using 3 different metrics: total common words (similar but not the same as Jaccard index), Euclidean distance and Cosine similarity.

We have limited our quantitative analysis scope by only looking at three key-words: 'inheritance', 'class', and 'object'. We have briefly talked about cosine similarity but I want to explain what it means in graphical terms.

To explain what I mean, here is a 3D projection of the 3 documents.

3D Document Projection

3D Projection

Image Source: Author -- Earl Potters

As illustrated by this example, we can see that doc 2 and doc 3 are closer in orientation by magnitude. Conversely, doc 1 and doc 3 are the closest using the Euclidean distance metric.

The graphical representation of common words is a intersection which can be seen as a Venn diagram.

Document Intersection

total common words Image Source: Author -- Earl Potters

You can see 3 intersecting circles that contain the set of all words in their document space. The intersection of circles is the intersection of both word sets.

Compare and Contrast

As you can figure out, all 3 similarity metrics have their own interpretation of similarity.

Cosine similarity checks the orientation of two documents
Euclidean distance checks the distance magnitude of two documents
Total common words checks the intersection of two documents.

Looking at the example we can see why cosine similarity is a good metric for judging similarity, rather than using total common words or Euclidean distance.

The reason for that is because total common words is very biased on file sizes while Euclidean distance is biased when comparing two different document sizes.

Plagiarism Detector Code

Now we are at the coding step! By now you should know the following things:

A document can be converted into a Vector space model where the documents are represented as vectors, mostly determined by term frequency.
Cosine Similarity is an operation on vectors that will allow us to determine the similarity of two documents.
We can display a normalized numerical value between 0-1 that indicates the similarity between two documents.

Getting Started

In this short overview I will assume you are on an Ubuntu Linux operating system with Conda package manager.

1. Installing Dependencies

You we need to install the following: Pandas and scikit-learn

$ conda install scikit-learn pandas

2. Verification

To ensure that the packages are installed properly open up your Python interpreter and run the follow code.

# loading python modules
import sklearn
import pandas as pd

# Verifying version
sklearn.show_versions()
pd.show_versions()

You may need to install pytest and hypothesis for pandas to run command pd.show_versions() Just run command$ conda install pytest hypothesis

Code Walkthrough

Now that we have everything set up it's time to code!

1. Load Python Modules

Let's first import important modules

# Load Python Modules
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity

import pandas as pd

2. Defining Data Set

Here we define the data. I have set up a list of tuples(name, data).

# define data
corpus = [
        ('doc_1', 'This is the first document.'),
        ('doc_2', 'This document is the second document.'),
        ('doc_3' ,'And this is the third one.'),
        ('doc_4', 'Is this the first document?')
    ]

3. Process the Data/Helper Functions

Using Count Vectorize Now we need to process the data. Here is a neat trick to separate the names and data into lists.

# split doc_names and doc_data
doc_names, doc_data = zip(*corpus)

Output:

doc_names -> ('doc_1', 'doc_2', 'doc_3', 'doc_4')

doc_data -> ('This is the first document.', \
            'This document is the second document.', \
            'And this is the third one.', \
            'Is this the first document?')

4. Vectorize Data

This is the most crucial step. We need to convert the data into a vector space. Luckily sklearn as a function called CountVectorizer() that will do the heavy lifting.

#create an instance of the class Countvectorizer that converts a collection of text document to a matrix of token counts
vectorizer = CountVectorizer()


# vectorize doc_data
document_term_matrix = vectorizer.fit_transform(doc_data).toarray() -> array([[0, 1, 1, 1, 0, 0, 1, 0, 1],
                                                                              [0, 2, 0, 1, 0, 1, 1, 0, 1],
                                                                              [1, 0, 0, 1, 1, 0, 1, 1, 1],
                                                                              [0, 1, 1, 1, 0, 0, 1, 0, 1]])

# returns full list of tokenized words
tokenized_words =  vectorizer.get_feature_names() -> ['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']

# output pandas table document_term_matrix
df_document_term_matrix = pd.DataFrame(data=document_term_matrix,
                                      columns= tokenized_words,
                                      index=doc_names)
df

table

Image Source: Author -- Earl Potters

Now we have our Data set in a Model!

With the vectorized data from the previous step we can calculate the cosine similarity by using cosine_similarity by sklearn.

5. Create Similarity feature aka Cosine Similarity

# return compute dot product on itself which will give the cosine_similarity matrix
cosine_matrix = cosine_similarity(document_term_matrix) -> array([[1.        , 0.79056942, 0.54772256, 1.        ],
                                                                  [0.79056942, 1.        , 0.4330127 , 0.79056942],
                                                                  [0.54772256, 0.4330127 , 1.        , 0.54772256],
                                                                  [1.        , 0.79056942, 0.54772256, 1.        ]])
# output pandas table of cosine_matrix
df_cosine_matrix = pd.DataFrame(data=cosine_matrix,
                    columns= doc_names,
                    index=doc_names)
df

table 2

Image Source: Author -- Earl Potters

6. Test Feature

Finally we can print the result and see if we have a reasonable output.

# print pandas table
print(df_document_term_matrix)
# print pandas table
print(df_cosine_matrix)

Your output should look like this.

df_document_term_matrix

table

Image Source: Author -- Earl Potters

df_cosine_matrix

table 2

Image Source: Author -- Earl Potters

7. Review/Refactor

Here is the refactored original code.

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity

import pandas as pd


class Plagiarism_Checker():

    def __init__(self, corpus, vectorizer=None):
        self.doc_names , self.doc_data = zip(*corpus)
        self._vectorizer = vectorizer


    @property
    def vectorizer(self):
        if self._vectorizer is None:
            raise TypeError("Vectorizer can't be None")
        if not hasattr(self._vectorizer, '_fit_transform'):
            self.__compute_document_term_matrix()
        return self._vectorizer

    @vectorizer.setter
    def vectorizer(self, value):
        if not isinstance(value, None):
            self._vectorizer = value
            self.__compute_document_term_matrix()
        else:
            self._vectorizer = value

    def __compute_document_term_matrix(self):
        self._vectorizer._fit_transform= self._vectorizer.fit_transform(self.doc_data)



    def get_document_term_matrix(self):
        count_vector = self.vectorizer._fit_transform.toarray()
        return count_vector


    def get_feature_words(self):
        return  self.vectorizer.get_feature_names()


    def get_document_term_matrix_dataframe(self):
        df = pd.DataFrame(data= self.get_document_term_matrix(),
                  columns= self.get_feature_words(),
                  index=self.doc_names)

        return df


    def get_cosine_similarity_dataframe(self):
        # compute cosine similarity matrix
        df = pd.DataFrame(data= cosine_similarity(self.get_document_term_matrix()),
                  columns=self.doc_names,
                  index=self.doc_names)

        return df


class Count_Vectorizer_Detector(Plagiarism_Checker):

    def __init__(self, corpus):
        super().__init__(corpus, vectorizer= CountVectorizer())

BONUS

I added tdif vectorizer which also fits with the class above

from sklearn.feature_extraction.text import TfidfVectorizer

class Tdif_Vectorizer_Detector(Plagiarism_Checker):

    def __init__(self, corpus):
        super().__init__(corpus , vectorizer=TfidfVectorizer(smooth_idf=True,use_idf=True) )




    def get_tfidf_weights_dataframe(self):
        # print idf values # need to compute self.vectorizer.idf_ if not computed
        df_idf = pd.DataFrame(self.vectorizer.idf_, index=self.get_feature_words(),columns=["idf_weights"])

         # sort ascending
        df_idf.sort_values(by=['idf_weights'])

        return df_idf

Final Demo visualizations

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

import pandas as pd


class Plagarism_Checker():

    def __init__(self, corpus, vectorizer=None):
        self.doc_names , self.doc_data = zip(*corpus)
        self._vectorizer = vectorizer


    @property
    def vectorizer(self):
        if self._vectorizer is None:
            raise TypeError("vecotrizer can't be None")
        if not hasattr(self._vectorizer, '_fit_transform'):
            self.__compute_document_term_matrix()
        return self._vectorizer

    @vectorizer.setter
    def vectorizer(self, value):
        if not isinstance(value, None):
            self._vectorizer = value
            self.__compute_document_term_matrix()
        else:
            self._vectorizer = value

    def __compute_document_term_matrix(self):
        self._vectorizer._fit_transform= self._vectorizer.fit_transform(self.doc_data)



    def get_document_term_matrix(self):
        count_vector = self.vectorizer._fit_transform.toarray()
        return count_vector


    def get_feature_words(self):
        return  self.vectorizer.get_feature_names()


    def get_document_term_matrix_dataframe(self):
        df = pd.DataFrame(data= self.get_document_term_matrix(),
                  columns= self.get_feature_words(),
                  index=self.doc_names)

        return df


    def get_cosine_simularity_dataframe(self):
        # compute cosine simularity matrix
        df = pd.DataFrame(data= cosine_similarity(self.get_document_term_matrix()),
                  columns=self.doc_names,
                  index=self.doc_names)

        return df


class Count_Vectorizer_Detector(Plagarism_Checker):

    def __init__(self, corpus):
        super().__init__(corpus, vectorizer= CountVectorizer())






class Tdif_Vectorizer_Detector(Plagarism_Checker):

    def __init__(self, corpus):
        super().__init__(corpus , vectorizer=TfidfVectorizer(smooth_idf=True,use_idf=True) )




    def get_tfidf_weights_dataframe(self):
        # print idf values # need to compute self.vectorizer.idf_ if not computed
        df_idf = pd.DataFrame(self.vectorizer.idf_, index=self.get_feature_words(),columns=["idf_weights"])

         # sort ascending
        df_idf.sort_values(by=['idf_weights'])

        return df_idf

if __name__ == "__main__":


    # define document data
    corpus = [
    ('doc_1', 'This is the first document.'),
        ('doc_2', 'This document is the second document.'),
    ('doc_3' ,'And this is the third one.'),
        ('doc_4', 'Is this the first document?')
    ]

    obj = Count_Vectorizer_Detector(corpus)

    cosine_matrix = obj.get_cosine_simularity_dataframe()

    print(cosine_matrix)

Link to my code online is here.

Link to the GitHub code is here.

References:

https://ptabdata.blob.core.windows.net/files/2017/IPR2017-01039/v20_EX1020_Salton,%201975.pdf

https://en.wikipedia.org/wiki/Cosine_similarity

https://www.machinelearningplus.com/nlp/cosine-similarity/

Peer Review Contributions by: Nadiv Gold Edelstein

Author

Earl Potters

Earl is a Junior at CU Boulder pursuing a degree in Computer Science. Earl’s passions are robotics and rugby. He is the founder of RoboBoat at CU Boulder, a robotics club that focus on designing and building ASV(Autonomous Surface Vehicles) for the annual Roboboat International Competition.

Cloudzilla is FREE for React and Node.js projects

Deploy GitHub projects across every major cloud in under 3 minutes. No credit card required.

Get Started for Free