Multi-class Text Classification using H20 and Scikit-learn.
Text classification is an essential task in natural language processing that categorizes various texts into classes. Text classification is done using a model trained using a text dataset. Then, the model learns from the test dataset and makes predictions. <!--more--> Text classification models perform tasks such as intent detection, topic labeling, sentiment analysis and spam detection.
Multi-class text classification is a text classification task with more than two classes/categories. Each data sample can be classified into one of the classes. However, a data sample cannot belong to more than one class simultaneously.
For example, a model that classifies news headlines into news categories. The categories can be business, sports, tech, entertainment, and politics.
This tutorial will build a customer-complaints text classifier with five classes.
We will use Scikit-learn for text preprocessing and vectorization, and H2O to automate the model building process using H2O AutoML algorithm.
Table of contents
- Prerequisites
- H2O library
- Benefits of H2O
- H2O dependencies
- Initializing H2O
- Customer complaints dataset
- Creating a dictionary object
- Dataset splitting
- Text Preprocessing for natural language processing
- Text vectorization
- Converting the train and test sets into an array
- Creating H2O Data Frame
- Adding the target column
- Using H2O AutoML to run multiple models
- Calling the train function
- Performance of the models
- Using the best model
- Conclusion
- References
Prerequisites
To follow along with this article, the reader should:
- Know how to implement Scikit-learn algorithms.
- Understand text pre-processing-techniques
- Know how to build a natural language processing model
You must use Google Colab notebook to build the model. Google Colab notebook has fast CPUs and GPUs. Ensure you connect to GPU in Google Colab to speed up building the model.
Connecting to GPU in Google Colab
To use Google Colab’s GPU, follow the steps below:
- Click the
Runtime
option.
- Click
Change runtime type
.
- Then select the
GPU
option and save
H2O library
H2O is an open-source machine learning library that provides supervised and unsupervised machine learning algorithms. It is robust and easily scalable.
H2O automates the model building process using H2O AutoML. It selects the best algorithm and performs the model evaluation.
Benefits of H2O
-
Saves developers' time. H2O AutoML algorithm automates most machine learning tasks, saving the developers time and increasing productivity.
-
H20 builds simple and interactive interfaces during the automation process.
-
Simplifies the machine learning process by automating complex machine learning tasks.
-
Corrects most human errors due to automation of tasks. H2O is also a debugging tool that detects and removes underlying model errors. As a result, the final model will make accurate predictions.
-
Automatic training and tuning of multiple models. H2O runs multiple models during training. It then selects the best model and performs the model evaluation. Finally, it produces an optimized model that will make accurate predictions.
-
It produces an easily deployable model for production.
H2O dependencies
H2O requires 64-bit JDK and runs on Java, so we have to install Java to proceed.
!apt-get install default-jre
!java -version
After installing the dependencies, we can install H2O.
Installing H2O
Use the command below to install H2O:
!pip install h2o
To import H2O, use this code:
import h2o
from h2o.automl import H2OAutoML
We use H2OAutoML
to run multiple machine learning algorithms during training and select the best algorithm.
Initializing H2O
Use this code snippet to initialize H2O:
h2o.init()
The snippet above runs H2O clusters. We need to use its memory for text classification.
The image below shows the H2O clusters.
Customer complaints dataset
The customer complaints dataset trains the classification model. When we have a new customer complaint, the model will classify it into one of the classes.
You can download the customer complaints dataset here.
We will use Pandas
to read the dataset.
import pandas as pd
To read the dataset, use this code snippet below:
df=pd.read_csv('/content/consumer_compliants.csv')
To see the loaded dataset, use this command:
df
The dataset output:
From the image above, the dataset has 18 columns. We are interested in the Product
, Company
and Consumer complaint narrative
columns.
The company
columns show the customer complaint company.
The Consumer complaint narrative
column contains the actual customer complaints.
The Product
columns contain the complaints classes.
To see the complaints classes, run this code snippet:
df['Product'].value_counts()
The complaints classes output:
From the image above, we have five complaints classes. The model will classify a customer complaint into one of the complaints classes.
Company column
To check the number of complaints received in each company, run the code snippet below:
df['Company'].value_counts()
Renaming Consumer complaint narrative
column
We will rename the column to complaints
. The new name is shorter and more machine-readable. The model can easily understand the new name and use the column during training. To rename the column, use this code:
complaints_df=df[['Consumer complaint narrative','Product','Company']].rename(columns={'Consumer complaint narrative':'complaints'})
The check the dataset with the renamed column, use this code:
complaints_df
The dataset output:
Creating a dictionary object
The dictionary object will encode the complaints classes as integer/numeric values. The integer values will be between 1 to 5.
They will represent the complaints classes. So we need to save them in the target
variable.
target={'Debt collection':0, 'Credit card or prepaid card':1, 'Mortgage':2, 'Checking or savings account':3, 'Student loan':4, 'Vehicle loan or lease':5}
Let us add the target
variable to our dataset.
complaints_df['target']=complaints_df['Product'].map(target)
Use the code snippet below to check the new dataset with the added target variable:
complaints_df
The new dataset output:
Dataset splitting
We will split the customer complaints dataset into two sets. One set for model training and the other for model testing.
from sklearn.model_selection import train_test_split
To split the dataset, use this code:
X_train, X_test = train_test_split(complaints_df, test_size=0.2, random_state=111)
Text Preprocessing for natural language processing
There are many text preprocessing steps. In this tutorial, we will focus on the following:
-
Stemming. Stemming reduces a word into its stem word or root. It removes the word affixes so that only the root remains. For example, the words “connecting”, “connect”, “connection”, and “connects” are all reduced to the root form “connect”.
-
Removing stop words. Stop words are the most common words in any language. However, they do not add much information to the text. Examples of stop words are conjunctions, pronouns, and articles. Removing stop words will enable the model to focus on words that add value in training.
-
Lower Casing. It converts the text dataset to lower case.
-
Tokenization. Breaking up the sentences into smaller word units called tokens. This process enables the model to understand the sentences by analyzing the word tokens.
-
Removing unnecessary characters. The text dataset may have unnecessary characters that do not add value to the model. We remove these characters to ensure the model focus on important information.
Natural Language Toolkit (NLTK) will perform these steps.
!pip install nltk
Import not
using this code snippet:
import nltk
We can also install the library from NLTK that will perform tokenization.
nltk.download('punkt')
punkt
is a pre-trained sentence tokenizer.
Stemming
Let us import the SnowballStemmer algorithm we need to use for stemming.
from nltk.stem.snowball import SnowballStemmer
To initialize the stemming algorithm, use this code:
stemmer = nltk.stem.SnowballStemmer('english')
Downloading stop words
We download the English stop words using this code:
nltk.download('stopwords')
stop_words = set(nltk.corpus.stopwords.words('english'))
Let us create a function to perform all text preprocessing steps using Python regular expression RegEx module.
import re
def preprocessing(text):
tokens = [word for word in nltk.word_tokenize(text) if (len(word) > 3 and len(word.strip('Xx/')) > 2 and len(re.sub('\d+', '', word.strip('Xx/'))) > 3) ]
tokens = map(str.lower, tokens)
stems = [stemmer.stem(item) for item in tokens if (item not in stop_words)]
return stems
The function is named preprocessing
has the following text preprocessing methods:
- The
nltk.word_tokenize
method will tokenize the text. ` word.strip
will remove the unnecessary characters.str.lower
will transform the text to lower case, andstemmer.stem
will perform stemming. The function returns the stemmed words.
Text vectorization
Text vectorization converts the stemmed words to numerical values called word vectors. We feed vectors to the model during training.
We will use the TfidfVectorizer
method for text vectorization.
from sklearn.feature_extraction.text import TfidfVectorizer
Let's initialize TfidfVectorizer
function.
vectorizer_tf = TfidfVectorizer(tokenizer=preprocessing, stop_words=None, max_df=0.75, max_features=1000, lowercase=False, ngram_range=(1,2))
The function has the following parameters:
-
tokenizer=preprocessing
. It is the function that performs all text preprocessing steps. -
stop_words=None
.It ensures that the function does not vectorize the words in the stop words list. -
max_df=0.75
. The function will vectorize 75% of the stemmed words. However, we have a large text dataset that may slow down the vectorization process using the whole dataset. -
max_features=1000
. We only select 1000 words because we have a large dataset, which may slow down the vectorization process. -
lowercase=False
. Ensures the function only vectorizes the words that are in lowercase. -
ngram_range=(1,2)
.ngram_range
is a contiguous sequence of words, symbols, or tokens in the stemmed text. Our stemmed text will have either 1 or 2 words.
We now apply the method to both the training and testing dataset.
Applying TfidfVectorizer
train_vectors = vectorizer_tf.fit_transform(X_train.complaints)
test_vectors = vectorizer_tf.transform(X_test.complaints)
Converting the train and test sets into an array
We convert the train and test sets into an array using the toarray
method.
To convert the train set, use this code:
train_df=pd.DataFrame(train_vectors.toarray(), columns=vectorizer_tf.get_feature_names())
test_df=pd.DataFrame(test_vectors.toarray(), columns=vectorizer_tf.get_feature_names())
The code above converts the train-set and test-set into an array using the toarray()
method. It also adds the 1000 features that we have selected from the original text data using the get_feature_names
method.
We also need to add the target
column to these new data frames (train_df
and test_df
)
Adding the target column
To add the target column, use this code:
train_df=pd.concat([train_df,X_train['target'].reset_index(drop=True)], axis=1)
test_df=pd.concat([test_df,X_test['target'].reset_index(drop=True)], axis=1)
The concat
function will concatenate or merge the data frames with the target column. The final data frames will have the 1000 features that we selected and the target column.
Creating H2O Data Frame
We will convert our Pandas Data Frame to H2O Data Frame. The H2O will use the created Data Frame during algorithm selection and training.
h2o_train_df = h2o.H2OFrame(train_df)
h2o_test_df = h2o.H2OFrame(test_df)
The next step is to add the target
column to the created H2O Data Frame.
Adding the target column
The target column contains the model output after making a prediction.
h2o_train_df['target'] = h2o_train_df['target'].asfactor()
h2o_test_df['target'] = h2o_test_df['target'].asfactor()
We are now ready to use H2O AutoML to run multiple models and select the best.
Using H2O AutoML to run multiple models
Let us initialize the H2O AutoML algorithm and its parameters.
aml = H2OAutoML(max_models = 5, seed = 10, exclude_algos = ["StackedEnsemble"], verbosity="info", nfolds=0, balance_classes=True, max_after_balance_size=0.3)
From the code above, we have initialized the H2OAutoML
algorithm with the following parameters:
-
max_models
It specifies the maximum number of models thatH2OAutoML
will run. For example, it will run five models. -
seed
We use it to ensure model reproducibility. -
exclude_algos
It specifies that the algorithmsH2OAutoML
should not use during model training.H2OAutoML
will skip theStackedEnsemble
algorithms. -
balance_classes
It will handle the imbalanced dataset. We set it totrue
to balance the five classes. -
nfolds=0,
It specifies the number of k-fold cross-validation of the H2OAutoML model. We have set the number to zero. -
max_after_balance_size=0.3
It specifies the maximum relative size of the training data after balancing the classes.
Specifying the y and x variables
The x
variable contains all the input features during training. The y
variable contains the output/target column.
x=vectorizer_tf.get_feature_names()
y='target'
Calling the train function
The train function will train and evaluate the model using the training set.
aml.train(x = x, y = y, training_frame = h2o_train_df, validation_frame=h2o_test_df)
x
specifies the x variable/input features.y
It specifies they
variable/target column.training_frame
contains the training dataset.validation_frame
contains the testing dataset.
H2OAutoML will run five models and produce the following outputs that show the AutoML progress:
Output showing the best model details:
From the five models the best model is XGBoost with a model id of XGBoost_2_AutoML_3_20220322_140825
.
We can also check the performance of all five models using this code:
Performance of the models
aml.leaderboard
leaderboard
will show the performance of the five models. It lists the models from the best performing to the least performing. The image shows the listed models:
From the image above, the best model has a model_id
of XGBoost_2_AutoML_3_20220322_140825
. The least performing model has a model_id
of DRF_1_AutoML_3_20220322_140825
. Let's use the best model to make predictions.
Using the best model
We will use the best model from the leaderboard
to predict the test data frame(h2o_test_df). The model will classify some vectorized text in the test data frame.
pred=aml.leader.predict(h2o_test_df)
Let us apply our vectorizer_tf
method to this text.
The aml.leader
method selects the best model from the list above. Finally, the predict
will classify some vectorized text in the test data frame.
To print the prediction results, use this code:
print(prediction)
The prediction output:
The best model has classified some vectorized text in the test data frame into five classes (0, 1,2, 3, 4, and 5). The predict
columns show the class in which the vectorized text has been classified.
Conclusion
We have learned how to build a multi-class text classification model. We developed the model using Scikit-learn and the H2O library. The tutorial also explained the benefits of H2O and how to install it.
We performed text preprocessing using Natural Language Toolkit. Then, using the clean dataset, we trained a model that classifies customer complaints. H2O runs multiple models, and we used the best model to make predictions.
To get the multi-class text classification model we have trained in this tutorial, click here.
References
- Scikit-learn documentation
- Text-pre-processing-techniques
- H2O AutoML documentation
- H2O GitHub
- Notebook for this tutorial
Peer Review Contributions by: Mercy Meave