Natural language Processing using TensorFlow and Bert Model
Natural language processing (NLP) is a subfield of Artificial Intelligence that enables computers to understand texts and spoken words. <!--more--> Through building of NLP models, the models can perform essential tasks such as speech recognition, sentiment analysis, intent classification, machine translation, spam filtering and chatbot systems.
In this tutorial, we will build a sentiment analysis model using BERT and TensorFlow.
BERT is a pre-trained model for Natural Language Processing. We will use TensorFlow to create the input, intermediate, and output layers.
Table of contents
- Prerequisites
- Getting started with BERT
- What is Hugging Face Transformers?
- Installing Hugging Face Transformers
- Working with sentiment analysis dataset
- Preprocessing the sentiment analysis dataset
- Creating the function
- Specify the number of sentiment labels
- Performing one-hot encoding
- Creating a map function
- Shuffling the training dataset
- Defining the training dataset
- Model creation
- Adding the layers
- Compiling the initialized neural network
- Fitting the neural network
- Using the model to classify input reviews
- Input a review and print the classification results
- Conclusion
- Further reading
Prerequisites
To follow along, the reader should have some knowledge of:
- Python programming.
- natural language processing model.
- Text preprocessing.
- How to build a simple model with TensorFlow.
Getting started with BERT
BERT is a Bidirectional Encoder Representation from the Hugging Face's Transformers.
BERT can perform multiple tasks such as question answering systems, text classification, and sentiment analysis.
In this tutorial, we will use BERT to perform sentiment analysis. This is a supervised model that is pre-trained on raw texts and the English language.
To start using BERT is easy and only requires installing the Hugging Face Transformers library. We then download the pre-trained BERT model from the Hugging Face Transformers.
Finally, we will fine-tune the model to perform sentiment analysis.
What is Hugging Face Transformers?
Hugging Face Transformers provides APIs to download and fine-tune pre-trained models. There are various pre-trained models for NLP tasks, image classification, video classification, and audio classification.
It supports different pre-trained models such as BERT. To see all the supported pre-trained models, click here
Hugging Face Transformers can easily be integrated with machine learning libraries such as Pytorch and TensorFlow. We will start by installing the Hugging Face Transformers library.
Installing Hugging Face Transformers
To install the Hugging Face Transformers, we use the following code:
!pip install transformers
We can also install other necessary libraries for this tutorial as follows:
import numpy as np
from tqdm.auto import tqdm
import tensorflow as tf
numpy - We will use NumPy to convert the dataset into an array.
tqdm - We use this library to create input tensors that the model will use.
tensorflow - We will use TensorFlow to train the model. We will import Keras from TensorFlow to add all the layers the model requires.
After the installation process completes, we can now work with the dataset for sentiment analysis.
Working with sentiment analysis dataset
We will use a movie reviews dataset that has different sentiment labels to train the sentiment analysis model.
You can download the sentiment analysis model from here. The movie review dataset has 5
sentiment labels as follows:
-
0: It represents a negative sentiment/review.
-
1: It represents a negative sentiment/review.
-
2: It represents a neutral sentiment/review
-
3: It represents a somewhat positive sentiment/review
-
4: It represents a positive sentiment/review.
We will read the dataset using Pandas:
import pandas as pd
To read the sentiment analysis dataset, use the code below:
df = pd.read_csv('/content/train.tsv', sep='\t')
The dataset has tab-separated values (TSV). Let's display the dataset in the Google Colab notebook.
df.head()
The image below shows the dataset output:
The dataset has multiple columns, but the model only requires the Phrase
and Sentiment
columns.
The Phrase
column represents the actual movie review, and the Sentiment
columns represent the sentiment labels previously listed.
Before we use the sentiment analysis data in the pre-trained BERT model, we need to process it into an acceptable format for the model.
Preprocessing the sentiment analysis dataset
A BERT model does not understand raw text in the Phrase
column. We first split the texts into smaller words or phrases known as tokens.
We convert the tokens into word embeddings. The word embedding encodes the meaning of the tokens using word vectors.
Word vectors are a numeric representation of the tokens. It is the format that the model can easily understand.
We will use the BertTokenizer
to implement text preprocessing. It will prepare the text input for the BERT model. Let's import the BertTokenizer
.
Import BertTokenizer
We import the library as follows:
from transformers import BertTokenizer
After importing the BertTokenizer
, we initialize it, as shown below:
tokenizer = BertTokenizer.from_pretrained('bert-base-cased')
The code above initializes the BertTokenizer
. It also downloads the bert-base-cased
model that performs the preprocessing.
Before we use the initialized BertTokenizer
, we need to specify the size input IDs
and attention mask
after tokenization. These parameters are required by the BertTokenizer
.
The input IDs
parameter contains the split tokens after tokenization (splitting the text). The attention mask
ensures the model only focuses on the original split tokens and not the synthesized tokens known as padding tokens.
The sentences in the Phrase
column have varying lengths, so the BertTokenizer
synthesizes new tokens to ensure the lengths of sentences are uniform.
Let's initialize these parameters:
X_input_ids = np.zeros((len(df), 256))
X_attn_masks = np.zeros((len(df), 256))
The X_input_ids
will contain the input IDs
tokens. We generate the tokens from the df
(this is our dataset), and the length of each sentence is 256. The X_attn_masks
will contain the X_attn_masks
tokens.
We also generate these tokens from the df
and will also have the same length.
Let's create a helper function to preprocess the dataset. It will take in the df
, the input IDs
, the attention mask
, and the initialized tokenizer
.
Creating the function
We create the function as follows:
def preprocessing_dataset(df, ids, masks, tokenizer):
for i, text in tqdm(enumerate(df['Phrase'])):
tokenized_text = tokenizer.encode_plus(
text,
max_length=256,
truncation=True,
padding='max_length',
add_special_tokens=True,
return_tensors='tf'
)
ids[i, :] = tokenized_text.input_ids
masks[i, :] = tokenized_text.attention_mask
return ids, masks
The function is called preprocessing_dataset
. It outputs the dataset in the required format. It takes in the df
, the input IDs
as ids
, the attention mask
as masks
, and the initialized tokenizer
.
The for
loop will iterate through the Phrase
and generate the word embedding using the tokenizer.encode_plus
method.
The function also has the following arguments:
-
max_length
: It specifies the size of each sentence in thePhrase
column. The specified value is 256. -
truncation=True
: The sentences in thePhrase
column have varying lengths, so longer sentences are truncated to ensure the lengths of sentences are uniform (256). -
padding='max_length
: It synthesizes new tokens to ensure the sentences have a fixed length and uniform length (256). -
add_special_tokens=True
: It adds new tokens to make the sentences have the maximum length. -
return_tensors='tf
: It ensures that the function outputs the preprocessed text as TensorFlow tensors.
The function will finally output the ids
(input IDs) and masks
(attention masks). These outputs values will become the inputs for the BERT model.
We also need to call the function so that it can populate or generate all the input IDs
and the attention mask
.
We will use the following code:
X_input_ids, X_attn_masks = preprocessing_dataset(df, X_input_ids, X_attn_masks, tokenizer)
The next step is to specify the number of sentiment labels.
Specify the number of sentiment labels
We specify the sentiment labels as follows:
labels = np.zeros((len(df), 5))
To know if we have added the labels, run the following code:
labels.shape
The code produces the following output:
(156059, 5)
The above output shows the number of sentiment labels added. The next step is to perform one-hot encoding.
Performing one-hot encoding
One-hot encoding will convert the five sentiment classes in the dataset into a numeric representation that the model understands. We will perform one-hot encoding using the following code:
labels[np.arange(len(df)), df['Sentiment'].values]
Create batches of data
We need to create batches of the dataset for easy loading during training. It will also ease up the training process. We will use the TensorFlow dataset utility method to create the dataset batches.
dataset = tf.data.Dataset.from_tensor_slices((X_input_ids, X_attn_masks, labels))
To see the shape of each dataset batch, use this code:
dataset.take(1)
The code will display the shape of one sample data batch, as demonstrated below:
From the above output, each data sample has 256 tokens. It also has five sentiment labels. The next step is to create a map function.
Creating a map function
The map function will define how the model will return the output. We want the model to use the input Ids
and attention mask
and return one of the five sentiment labels after predictions.
def SentimentDatasetMapFunction(input_ids, attn_masks, labels):
return {
'input_ids': input_ids,
'attention_mask': attn_masks
}, labels
The code above will initialize the map function. Let's now call the map function using the code below:
dataset = dataset.map(SentimentDatasetMapFunction)
The next step is to shuffle the training dataset and provide the batch size.
Shuffling the training dataset
We will shuffle the dataset randomly to prevent the model from memorizing the data samples but learning from the dataset.
It will prevent model bias and ensure we have accurate sentiment predictions. We also need to specify the batch size.
The batch size will determine the number of training data samples that the model will use in one iteration (epoch).
dataset = dataset.shuffle(10000).batch(16, drop_remainder=True)
The dataset.shuffle
method will shuffle the selected 10000
data samples. The model will use 16
data samples during each iteration. The drop_remainder
will drop any word embedding that the model leaves out during training.
The next step is to define how much data the model will use for training. We will specify the ratio that will split the dataset.
Defining the training dataset
We define the training dataset as follows:
p = 0.8
train_size = int((len(df)//16)*p)
The code will define the training dataset to be 80%. The remaining dataset (20%) will be the validation set. Let's split the dataset using this ratio.
training_dataset = dataset.take(train_size)
validation_dataset = dataset.skip(train_size)
The training dataset will be 80% and the validation dataset 20%, and it marks the end of dataset preprocessing or preparation. Let's move to the next phase of model creation.
Model creation
We will use the pre-trained BERT model to create the sentiment analysis model. Let's import the pre-trained BERT model as follows:
from transformers import TFBertModel
After importing, let's initialize the model as follows:
model = TFBertModel.from_pretrained('bert-base-cased')
The code above initializes the TFBertModel
. It also downloads the bert-base-cased
model that will perform sentiment analysis. The next step is to add the input, intermediate, and output layers to the TFBertModel
model.
Adding the layers
We will use Keras to add all the input, intermediate/hidden, and output layers the model requires. Let's first add the input layers.
Adding input layers
The model will have two input layers. The first layer will handle the input Ids
and the second layer will handle the attention mask
. We create the input Ids
layer as follows:
input_ids = tf.keras.layers.Input(shape=(256,), name='input_ids', dtype='int32')
The layer will be named input_ids
, and it will have 256 neurons because this is the maximum length of the input Ids
. We create the attention mask
as follows:
attn_masks = tf.keras.layers.Input(shape=(256,), name='attention_mask', dtype='int32')
The layer will be named attention_mask
, and it will have 256 neurons because this is the maximum length of the attention_mask
. We will combine these layers and feed them the BERT model as follows:
bert_embds = model.bert(input_ids, attention_mask=attn_masks)[1]
Adding intermediate layers
Intermediate layers are the hidden layers of our neural network. These layers will further fine-tune the BERT model and enhance its performance.
intermediate_layer = tf.keras.layers.Dense(512, activation='relu', name='intermediate_layer')(bert_embds)
We have created a Dense
layer as the intermediate layer. It will have 512 neurons and will be named intermediate_layer
. It uses relu
as the activation function.
We use this activation function because the output of this layer ranges between 0 and infinity. It also uses the previous bert_embds
as input since we are building the model sequentially (layer by layer).
Adding the output layer
We add the output layer as follows:
output_layer = tf.keras.layers.Dense(5, activation='softmax', name='output_layer')(intermediate_layer)
The output layer will have five neurons since we have five sentiment labels. It will take the intermediate_layer
as an input.
It uses softmax
as the activation function. We use this activation function because we have more than two sentiment classes. This makes the whole model architecture or structure. We will combine all these layers and initialize the complete neural network.
sentiment_model = tf.keras.Model(inputs=[input_ids, attn_masks], outputs=output_layer)
After initializing the model, we can print the sentiment model summary as follows:
sentiment_model.summary()
It produces the following summary:
It also shows all the input, intermediate, and output layers. The output also shows the following:
Total params: 108,706,565
- These are all the parameters in the initialized neural network.Trainable params: 108,706,565
- It shows the parameters that the initialized neural network will train.Non-trainable params: 0
- These are the pre-trained parameters, which in our case are zero.
In the next step, we will define the accuracy metrics, the loss function, and the optimizer for the model.
We define the optimizer as follows:
optim = tf.keras.optimizers.Adam(learning_rate=1e-5, decay=1e-6)
We define the optimizer as Adam
from TensorFlow's Keras optimizers. It enhances the performance of the initialized neural performs and reduces the errors the model encounters in training.
We also set the learning_rate
that defines the speed at which the initialized neural network learns. The decay
will speed up the learning rate of the initialized neural network.
We define the loss function as follows:
loss_func = tf.keras.losses.CategoricalCrossentropy()
We use the CategoricalCrossentropy
as the loss function because we have different categories/class sentiments (five). It will keep track of the errors in the neural network while training.
We define the accuracy metrics as follows:
acc = tf.keras.metrics.CategoricalAccuracy('accuracy')
We will use CategoricalAccuracy
to check the neural network's performance and calculate the accuracy score. We now compile the neural network using these defined parameters.
Compiling the initialized neural network
We use the following code:
sentiment_model.compile(optimizer=optim, loss=loss_func, metrics=[acc])
After compiling the neural network, let's now fit it to the training_dataset
and the validation_dataset
.
Fitting the neural network
The training_dataset
will train the neural network to understand sentiment analysis. The validation_dataset
will adjust and fine-tune the neural network trainable parameters.
We will output a final model with enhanced performance that can make accurate classifications.
model_training = sentiment_model.fit(
training_dataset,
validation_data=validation_dataset,
epochs=2
)
The sentiment_model.fit
method trains the neural network. We have passed the training_dataset
and the validation_dataset
. The neural network will run for two epochs and produce the following output:
From this output, the final model accuracy score after the 2
epochs is 0.673 (67.3%)
. You can increase the epochs to improve the model accuracy score before using the model in production, but the training process will take more time (hours or even days).
For demonstration purposes, this is still a good accuracy score. Let's use the model to classify input reviews.
Using the model to classify input reviews
We will use the model to classify the input reviews into one of the five sentiment labels. Before we use the trained model, we have to process the input reviews to have the required format.
We will use the same libraries and functions to process the input reviews (we have explained and implemented text preprocessing in the previous sections).
We will follow the same steps as follows:
- Initializing
BertTokenizer
. We use the following code:
tokenizer = BertTokenizer.from_pretrained('bert-base-cased')
- Creating the function for preprocessing
def prepare_data(input_text, tokenizer):
token = tokenizer.encode_plus(
input_text,
max_length=256,
truncation=True,
padding='max_length',
add_special_tokens=True,
return_tensors='tf'
)
return {
'input_ids': tf.cast(token.input_ids, tf.float64),
'attention_mask': tf.cast(token.attention_mask, tf.float64)
}
To make a classification/prediction, we create a function that makes the classification as follows:
def make_prediction(model, processed_data, classes=['Negative', 'A bit negative', 'Neutral', 'A bit positive', 'Positive']):
probs = sentiment_model.predict(processed_data)[0]
return classes[np.argmax(probs)]
The function will make a prediction and classify an input review into the different sentiment labels. Let's now input a review and print the classification results.
Input a review and print the classification results
We use the following code:
input_text = input('Input a review here:')
processed_data = prepare_data(input_text, tokenizer)
result = make_prediction(sentiment_model, processed_data=processed_data)
print(f"Classification results: {result}")
When you run the code, a text input control will appear in Google Colab. You will then be prompted to input a review, as shown below:
You can then type this input review: 'This is the best movie I have ever watched on NetFlix'. After typing the review text, press Enter. The model will then print the following classification results.
Classification results: positive
The model has classified the input reviews as Positive
. It has made the correct prediction/classification. It shows that the model was well trained and understood sentiment analysis.
Conclusion
We have learned how to perform natural language processing using Tensorflow and Bert model. We discussed how to install the BERT model from the Hugging Face Transformers and how to fine-tune the model.
We worked with the sentiment analysis dataset and processed the dataset to have the required format.
Finally, we initialized the neural network using TensorFlow's Keras layers and trained it to perform sentiment analysis. The final model was further fine-tuned and can accurately classify input reviews.
You can download the Python source code here
Further reading
- Hugging Face Transformers documentation
- Data set preprocessing
- Transformers tokenizer
- BERT based cased
- Hugging Face for NLP
- BERT Tokenizer
- Word Embeddings
- NLP transfomers
- TensorFlow documention
Peer Review Contributions by: Wilkister Mumbi