Snorkel Python for Labelling Datasets Programmatically
Snorkel is a Python library that is used for data labelling. It programmatically manages and builds training datasets without manual labelling. In machine learning, labels are the target or the output variable in a dataset. This is what the model is attempting to predict.
Instead of humans labelling large datasets manually, Snorkel assigns labels to the extensive training data automatically. This is done using a set of user rules, labelling functions, and other in-built techniques.
Snorkel requires users to come up with functions that contain explicit rules. We will use these rules to label the unlabeled data.
In this tutorial, we will have an unlabeled dataset that contains a list of sentences. The list of sentences is made up of questions, while others are general statements.
This tutorial aims to label a sentence as either a question or not a question. If a sentence is a question, it is labelled 1
, and a non-question sentence(general statement) is labelled -1
. All this will be done programmatically using Snorkel.
Table of contents
- Prerequisites
- How to install Snorkel Python
- Unlabeled dataset
- Load the dataset
- Convert to a data frame
- Dataset splitting
- Define our labeling functions
- Rules
- Keyword lookup function
- Pattern lookup function
- A second pattern lookup function
- Combining the labeling functions
- Building the labeling model
- Adding labels to the unlabeled dataset
- Conclusion
- References
Prerequisites
To follow along easily, the reader should:
- Be familiar with Python programming.
- Know about machine learning modelling.
- Know how to use Google Colab notebooks.
- Have some knowledge of Pandas.
- Be familiar with Scikit-learn.
How to install Snorkel Python
We install Snorkel Python using the following command.
!pip install snorkel
After installing Snorkel, let's start working with our unlabeled dataset.
Unlabeled dataset
Our unlabeled dataset is in text file format.
A snip of our dataset is shown below.
To download this text file, click here.
Load the dataset
We load this dataset using Pandas. Pandas is a Python package that is used for data manipulation and analysis.
It also allows us to import data from different file formats such as CSV files, text files, JSON files, and SQL files.
Let's import Pandas.
import pandas as pd
Let's load our dataset.
data = pd.read_table('unlabeled-dataset.txt')
Let's see if our dataset is loaded.
To see if the dataset is loaded in our machine, use this command.
data
The output is shown below.
Let's shuffle our dataset. Shuffling our dataset ensures that our dataset is properly reorganized and formatted. This reduces bias.
To shuffle our dataset, we use a Python package called random
.
Let's import the random
package.
import random
Let's now shuffle our dataset using the random.shuffle()
method.
random.shuffle(data)
To see the output after the dataset is shuffled, run this command.
data
The output below shows a dataset that is adequately organized and formatted.
Convert to a data frame
A data frame is the representation of data in rows and columns. When data is represented in this form, it is easy for the model to understand and use.
Our dataset will have one column. It will have a sentences
column, which contains all the unlabeled sentences.
Let's create this column.
df = pd.DataFrame({'sentences':data})
Let's now see our dataset with the sentences
column.
df.head()
The output is as shown in the image below.
Let's now split our dataset.
Dataset splitting
We split our dataset into two sets, train set and test set. The train set is used during the training phase so that model can learn from it.
The test set is used to evaluate the general performance of the model. It also checks if the model can make accurate predictions.
Let's import the required method to split our dataset.
from sklearn.model_selection import train_test_split
train_test_split
will be used to split our dataset.
df_train,df_test = train_test_split(df,train_size=0.5)
In the code above, we have specified the train_size=0.5
. This implies that 50%
of the dataset will be used for training, and the remaining 50%
will be used for testing.
Let's check the total number of sentences in our test set.
print(df_train.shape)
The output is shown below.
(44, 1)
This shows we have a total of 44
sentences in our test set and 1
column.
Define our labelling functions
Labelling functions define the rules that the labelling model uses. These rules are used to predict the label of unlabeled data.
Let's import the method that will allow us to come up with labelling functions.
from snorkel.labeling import labeling_function
So that we can come up with an accurate labelling model, we need to come up with at least three labelling functions.
The imported labeling_function
method allows us to create three labeling functions.
To come up with the best labeling functions, we need to know how to label the dataset.
In this case, we want to label our dataset with two labels as follows. A sentence can be labelled as either a question or a general statement. If the sentence is a question, it is labelled 1
, and a general statement is labelled -1
.
The following are rules for a statement to qualify to be a question.
Rules
- A sentence should start with the following phrases:
why
,what
,when
,who
,where
, andhow
. - A sentence should end with a question mark,
?
.
We then need to assign constants for our labels.
Constants for our labels
We will use this to label the sentences. For example, QUESTION
is used to label sentences that qualify as questions.
ABSTAIN
will be used to label the other sentences.
QUESTION = 1
ABSTAIN = -1
Using the rules above, we can now come up with the labelling functions.
Keyword lookup function
This function is used to check for phrases at the beginning of sentences.
These phrases are as follows: why
, what
, when
, who
, where
, and how
.
If this rule is met, the sentence is labelled as QUESTION = 1
. If it's not met, the sentence is labelled ABSTAIN = -1
.
@labeling_function()
def lf_keyword_lookup(x):
keywords = ["why","what","when","who","where","how"]
return QUESTION if any(word in x.sentences.lower() for word in keywords) else ABSTAIN
We have defined our function as lf_keyword_lookup(x)
with the help of the @labeling_function()
decorator.
We then pass our rule. Finally, we loop through our sentences
in the dataset to see if the condition is met.
Let's go to the second labeling function.
Pattern lookup function
This function checks if the sentence contains the phrase what
at the beginning.
It also checks if the sentence ends with a question mark, ?
.
To find ?
in a sentence, we use Python regular expression.
It searches through the sentence patterns until it finds a ?
.
For further reading on Python regular expression, read this documentation.
import re
@labeling_function()
def lf_regex_contains_what(x):
return QUESTION if re.search(r"what.*?",x.sentences,flags=re.I) else ABSTAIN
First, we have import re
, which represents the regular expression. Then, we use the re.search()
method to search through the sentences to find instances of the phrase what
and the ?
.
If the condtion is met the sentence is labeled QUESTION = 1
and if not met it's labeled ABSTAIN = -1
.
Let's look at the last labelling, which will also use a pattern lookup function.
A second pattern lookup function
This function will also use Python regular expression. However, it only searches for question marks, ?
in the sentences.
import re
@labeling_function()
def lf_regex_contains_question_mark(x):
return QUESTION if re.search(r".*?",x.sentences,flags=re.I) else ABSTAIN
We now need to apply all these labelling functions to our train set dataset.
Combining the labeling functions
First, we need to combine all these labelling functions and save them into a single variable, lfs
. When they are combined, they build an optimal labelling model.
lfs = [lf_keyword_lookup,lf_regex_contains_what,lf_regex_contains_question_mark]
We then import the PandasLFApplier
method. This is a Pandas
method used to apply more than one labelling function to the dataset.
from snorkel.labeling import PandasLFApplier
Let's pass our combined labelling function, lfs
into the PandasLFApplier
method.
applier = PandasLFApplier(lfs=lfs)
We then apply all the combined labelling functions to the training dataset. Finally, the training dataset is saved in a variable called df_train
.
The labeling functions learn patterns in the dataset. This process is known as pattern recognition.
L_train = applier.apply(df=df_train)
Now that we have applied all the three labelling functions in our df_train
, it's time to build our dataset.
Building the labeling model
We need to import the method that we will use to build our model.
from snorkel.labeling.model import LabelModel
The LabelModel
method will be used to build our model.
Let's now build the model.
label_model = LabelModel(cardinality=2,verbose=True)
label_model.fit(L_train=L_train,n_epochs=500,log_freq=100,seed=123)
The LabelModel
uses the fit()
method to fit the model into the L_train
. L_train
contains the labeling functions and the training dataset.
During this phase, the model gains knowledge through training. It eventually uses the knowledge gained to make predictions.
We also use the following parameters.
-
n_epochs=500
- The number of iterations the model passes through theL_train
. -
cardinality=2
- This shows the possible labels outputs. In our case, we have1
and-1
. -
verbose=True
- This allows us to use regular expressions when searching for?
. -
log_freq=100
- It checks the frequency in which specific phrases are distributed in the dataset. -
seed=123
- Random numbers that our model will use during model training.
After 500 epochs, we would have successfully trained our model.
Let's now use this model to label the dataset.
Adding labels to the unlabeled dataset
We use this model to add labels to our dataset. Our two labels are: QUESTION = 1
and ABSTAIN = -1
.
We use the predict
method to make the predictions. This method is used to classify the sentences as either a question or not.
df_train['Labels'] = label_model.predict(L=L_train,tie_break_policy="abstain")
After running the code above, the model should be able to classify the various sentences.
Let's see the prediction results.
The image above shows that the sentence What's your favorite ice cream topping?
was labelled as 1
. Therefore, this represents QUESTION
.
Another sentence is, There is no Ctrl-Z in life.
This was labelled -1
, which is a general statement.
Using the two examples above, we can see that our model could make the correct predictions. Furthermore, this shows that our model can successfully assign labels to the unlabeled dataset.
Conclusion
In this tutorial, we have learned how to label a dataset programmatically using Snorkel. First, we started with data pre-processing. This involves cleaning the dataset and adding columns to our dataset.
From there, we split our dataset into two sets so that one set can be used for training and the other one for testing. We then created a labelling function that contains essential rules to be used by the model.
After successfully applying all the labelling functions to our dataset, we started to build our model. In the end, we had a model that could classify various sentences into questions or general statements.
Using this tutorial, a reader should be able to label a dataset programmatically using Snorkel.
To get the notebook for this tutorial, click here.
References
- Google Colab link
- Snorkel documentation
- Scikit-learn documentation
- Intoduction to labeling functions
- Python regular expressions
Peer Review Contributions by: Lalithnarayan C