Sentiment Analysis with Spacy and Scikit-Learn

Sentiment analysis is a subset of natural language processing and text analysis that detects positive or negative sentiments in a text.

Sentiment analysis helps businesses understand how people gauge their business and their feelings towards different goods or services.

The evaluation is done using reviews on their sites, as well as monitoring online conversations.

Sentiment analysis is used to analyze customer feedback. It helps businesses to determine whether customers are happy or frustrated with their products.

Businesses use this information to change their products to meet customers' needs.

In this tutorial, we will use Spacy to build our sentiment analysis model. We will use three datasets from IMDB, Amazon, and Yelp.

These datasets contain reviews that are either labeled positive or negative.

In addition, the dataset contains movie reviews from the IMDB dataset, product reviews from the Amazon dataset, and local business and social networking site reviews from the Yelp dataset.

Prerequisites
Introduction
Dataset used
Loading dataset
Adding headers
Merge or concatenate our datasets with the keys
Removing stop words
Loading machine learning packages
Custom transformer class
Vectorization and classifier
TfidfVectorizer method
Features and labels
Dataset splitting
Creating the pipeline
Accuracy score
Making prediction
Conclusion
References

Prerequisites

A good understanding of Python.
Some knowledge of machine learning.
Have some working knowledge of natural language processing.

NOTE: To follow along easily, use Google Colab.

Introduction

There are different types of sentiment analysis depending on the model goals.

Models that focus on the polarity of a given text: positive, neutral, and negative.
Model that focuses on feeling and emotions within a text.
Models that focus on the intentions and urgency of a customer.

Depending on these goals, they are further classified into the following groups.

Standard Sentiment Analysis.
Fine-grained Sentiment Analysis.
Emotion Detection.
Aspect-based Sentiment Analysis.
Intent Detection.

Standard Sentiment Analysis

It detects the polarity of a given text, either positive, negative, or neutral.

For Example:

"I love using your product": The polarity is Positive.
"Your product has many issues": The polarity is Negative.
"I am open to further assistance about your product": The polarity is Neutral.

Fine-grained Sentiment Analysis

It focuses on the polarity of given text but adds more options or categories such as:

Very positive
Positive
Neutral
Negative
Very Negative

For Example.

"This is the best product ever": The polarity is Very Positive.
"This product is disgusting": The polarity is Very Negative.

For a practical guide on fine-grained sentiment analysis, click here

Emotion Detection

This type detects emotions and feelings in a given text. For example, the emotions can be happiness or anger.

For Example:

"This product makes my work easier": This shows Happiness.
"It has ruined my schedule and caused me pain": This shows Anger.

For a practical guide on emotion detection, click here

Aspect-based Sentiment Analysis

Aspect-based sentiment analysis (ABSA) is a text analysis technique that categorizes data by aspect and identifies the sentiment attributed to each one.

Aspect-based sentiment analysis can analyze customer feedback by associating specific sentiments with different aspects of a product or service.

Aspects are the attributes or components of a product or service. For example: "The user experience of a new product", "the response time for a query or complaint", or "the ease of integration of new software".

Aspect sentiment analysis is important because it helps companies to sort and analyze customer data, automate processes like customer support tasks, and gain powerful insights from customer reviews.

For a practical guide on aspect-based sentiment analysis, click here

Intent Detection

This focuses on the customer's goals and intention behind a given statement. This is applied in chatbot systems to provide better answers and assistance.

For example.

"This app keeps on crashing. What should I do?": This shows "Need for assistance."

For a practical guide on intent detection, click here

In this tutorial, we will build a standard sentiment analysis model.

Dataset used

As stated, we are working with three datasets: IMDB dataset, Amazon dataset, and Yelp dataset.

The datasets contain reviews of different products or services. The reviews are labeled either 1 to show a positive review or 0 for a negative review.

Our dataset is in text format, which is easy to read and use by our model.

A snip of the three datasets in text format is shown below:

Amazon dataset snip

To download the Amazon dataset in a text format, click here

IMDB dataset snip

To download the IMDb dataset in a text format, click here

Yelp dataset snip

To download the Yelp dataset in a text format, click here

After downloading all three datasets, let's load them into our machine.

Loading dataset

We use the Pandas package to load our dataset. Pandas is used for data manipulation and analysis.

import pandas as pd

Let's use Pandas to read our three datasets as shown below.

df_yelp = pd.read_table('yelp_labelled.txt')
df_imdb = pd.read_table('imdb_labelled.txt')
df_amz = pd.read_table('amazon_cells_labelled.txt')

Since we have three different datasets, we have to concatenate them together.

frames = [df_yelp,df_imdb,df_amz]

Adding headers

Our dataset has no headers. We need to separate our dataset into two columns and give them heading names.

It will have two columns. The first column, named Message, will contain the actual review text, while the other column is named Target, will contain the label of either 1 or 0.

for colname in frames:
    colname.columns = ["Message","Target"]

Let's print the column names, as shown below:

for colname in frames:
    print(colname.columns)

The output is, as shown below:

Index(['Message', 'Target'], dtype='object')
Index(['Message', 'Target'], dtype='object')
Index(['Message', 'Target'], dtype='object')

We need to assign keys to our dataset to distinguish them since we have merged the three datasets. The dataset belongs into three groups Yelp, IMDB, and Amazon.

keys = ['Yelp','IMDB','Amazon']

Merge or concatenate our datasets with the keys

We have created three keys in the above section: Yelp, IMDB, and Amazon. Then, we add the list of keys to our dataset.

The keys enable the model to know where each dataset belongs since we have merged the three datasets. This makes it easy for our model to understand and use the dataset during the training phase.

df = pd.concat(frames,keys=keys)

To see the output of the merged keys and datasets, use this command.

df.head()

Output

The above output shows the keys added to our dataset. The output also shows the two additional columns, Message and Target, labeled either 0 or 1.

Now that we have prepared the dataset, we can now remove stop words from the dataset.

Removing stop words

Stop words are a set of commonly used words in a language. They have a lower classification power because they are not unique and make the model biased.

We remove stop words using Spacy. Let's first install Spacy into our machine.

Since we are using Google Colab in this tutorial, we install Spacy using this command.

!pip install -U spacy

After installing Spacy, let's import this library.

import spacy
from spacy.lang.en.stop_words import STOP_WORDS
nlp = spacy.load('en')

In the above code, we have imported the following.

Spacy

This is the library we will use for sentiment analysis

Stopwords

This package is used to remove the stopwords in the dataset.

We also specify the language used as English using spacy.load('en').

Let's list all the stopwords in our dataset.

stopwords = list(STOP_WORDS)

Let's remove the stopwords.

for word in stopwords:
    if word.is_stop == False and not word.is_punct:

The code snippet above removes the stopwords in the dataset. It also removes all the punctuations using word.is_punct after looping through the dataset.

Loading machine learning packages

Let's import all the packages used in building our model. We will use Scikit-learn in building the model.

import CountVectorizer,TfidfVectorizer from sklearn.feature_extraction.text
import accuracy_score from sklearn.metrics
import train_test_split from sklearn.model_selection
import TransformerMixin from sklearn.base
import LinearSVC from sklearn.svm
import Pipeline from sklearn.pipeline

In the above code, we have imported the following.

CountVectorizer
TfidfVectorizer
accuracy_score
train_test_split
TransformerMixin
LinearSVC
Pipeline

CountVectorizer

This package is used to transform the texts in our dataset into numeric values that are in vectors. These numeric values can be accessed by the model more quickly than text.

For detailed understanding about CountVectorizer, click here

TfidfVectorizer

TF-IDF(frequency-inverse document frequency) is a statistical measure that evaluates how relevant a word is in a collection of documents.

If a word is common in a given document and common in other documents, it indicates that it has less power when making a prediction.

Conversely, if a word is unique in a document, it shows it has more power in classification and predictive analysis.

For detailed understanding about TfidfVectorizer click here

accuracy_score

This package is used to calculate the model's accuracy when making a prediction.

train_test_split

This is used to split our dataset into a training set and testing set.

TransformerMixin

This is used to fit the model into the dataset during the training phase. It ensures that our model learns the patterns in the dataset.

LinearSVC

This is the support vector machine algorithm used in building the model. It uses the LinearSVC method to fit our model to the data and produce the best fit.

Pipeline

A pipeline is an important aspect of machine learning. It automates functions such as CountVectorizer, TfidfVectorizer, TransformerMixin, and LinearSVC.

This makes model building process faster and easier since all the stages are bundled together into one unit process.

Let's use these imported packages starting with TransformerMixin.

To implement the data transformer, we need to create a custom class.

Custom transformer class

class transformers(TransformerMixin):
    def data_transform(X_train, Y_train):
        return [clean_text(text) for text in X]
    def model_fit(X=text, y=text_set,):
        return text
    def set_params(best_params, set=True):
        return {}

def text_cleaning(text):
    return clean_text.label().lower_case()

The above transformers class has the following functions.

data_transform

It checks the parameters in the dataset using X_train and Y_train. It then converts them into a structure that the model can understand.

model_fit

It fits the model into the dataset. This enables the model to learn by understanding patterns in the dataset.

set_params

This method retrieves all the converted and optimized parameters to produce an optimized model.

text_cleaning

This function cleans our dataset and converts all the texts into lower case.

Let's go to the next stages.

Vectorization and classifier

In vectorization, we use CountVectorizer that converts our text dataset into numeric vectors.

The classifier is the algorithm used in building the model. In this case, we are using LinearSVC.

This is the classification method used by the support vector machine algorithm.

vectorizer = CountVectorizer(tokenizer = spacy_tokenizer, ngram_range=(1,1))
classifier = LinearSVC()

TfidfVectorizer method

This is used to check the frequency in which a given word is used in our dataset.

tfvectorizer = TfidfVectorizer(tokenizer = spacy_tokenizer)

Features and labels

Features are the independent variable in our dataset that are used as inputs when building our model. In our case, the Message column will be our feature.

Labels are what we want to predict. In our case, we are trying to predict the sentiment of a given review. So the output can be either 1 for positive or 0 for negative.

In our case, the Target column will be the label.

X = df['Message']
ylabels = df['Target']

Dataset splitting

We split our dataset into train set and test set. This tutorial uses 75% of our dataset as the train set and 25% as the test set, as shown below:

X_train, X_test, y_train, y_test = train_test_split(X, ylabels, test_size=0.25, random_state=42)

Let's build our pipeline to automate all these processes.

Creating the pipeline

The pipeline will clean our dataset, vectorize our text into numeric values and finally classify our reviews as either positive or negative.

pipe = Pipeline([("cleaner", predictors()),
                 ('vectorizer', vectorizer),
                 ('classifier', classifier)])

Let's use our pipeline to fit our model into the training dataset as shown.

pipe.fit(X_train,y_train)

After training, the output is as shown.

Pipeline(memory=None,
     steps=[('cleaner', <__main__.predictors object at 0x7fee6cac3f98>), ('vectorizer', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ng...ax_iter=1000,
     multi_class='ovr', penalty='l2', random_state=None, tol=0.0001,
     verbose=0))])

Let's now calculate the accuracy score of our model.

Accuracy score

print("Accuracy: ",pipe.score(X_train,y_train))

The output is shown below:

Accuracy:  0.9849726775956285

This shows that our model has an accuracy score of 98.497%. This is a good score and shows that our model has a higher chance of making correct predictions.

Making prediction

We now use our model to see if we can classify a review as either positive or negative.

pipe.predict(["I recommend this movie to watch, it's great"])

The output of the prediction is as shown.

array([1])

The output is 1, which is a positive review.

Let's try another sample text.

example = ["I love this product so much",
 "What an inferior item! I will purchase a new one",
 "I feel happy when using your product!"]

Let's see the prediction output.

array([1, 0, 1])

This shows that the first sentence in the array was a positive review, the second one was a negative, and the last one was a positive review.

In this example, all these cases are true. This shows that our model can make accurate predictions.

Conclusion

In this tutorial, we have learned about sentiment analysis with Spacy and Scikit-learn. We started by learning sentiment analysis and its importance to a business.

We also discussed the different types of sentiment analysis. In this tutorial, we were focusing on the standard sentiment analysis.

We then moved to dataset cleaning and used the final dataset to build our sentiment analysis model.

Next, we performed all the steps required to build the model and finally used a pipeline approach to automate all the processes involved in model building.

Finally, we used our model to make predictions. For example, our model was able to classify a review as either positive or negative.

Using the above steps, a reader should be able to build a sentiment analysis model using Spacy and Scikit-learn.

To access the implementation for this tutorial in Google Colab, click here

References

Peer Review Contributions by: Lalithnarayan C

Author

Francis Ndiritu

Francis Ndiritu is an undergraduate student undertaking a Bachelor of Science in Software Engineering. He is passionate about machine learning and building data-driven applications. He is also a full-stack web application developer using the MERN stack.