Online Machine Learning with River Python

Online machine learning is a type of machine learning in which data becomes available in a sequential order. At each step, the model is updated until we have a more accurate and robust model.

The data is in motion and keeps on changing over time. It's best suited when we have streaming data, where we want to process one sample of data at a time.

This differs from traditional or offline machine learning, where the dataset is available locally and we generate a model at the end after learning from the entire dataset. In offline machine learning, all our datasets are usually available locally.

We show an overview of the difference between online and offline machine learning below:

Online vs Offline machine learning

In this tutorial, we will use River Python to build our model and simulate streaming data. This will be a text classification model that classifies a given input text as either software or hardware-related.

Prerequisites
Introduction
River Python installation
Checking methods and attributes
Loading machine learning packages
Getting the relevant methods and attributes
Simulating streaming data
Building the pipeline
Pipeline stages initialization
Visualizing the pipeline
Get pipeline steps
Building our model
Making prediction
Prediction probability
Model accuracy
Conclusion
References

Prerequisites

A reader must have the following:

Python installed in your machine.
A good understanding of Python.
A good knowledge of machine learning models.
How to use Google Colab or Jupyter Notebook. In this tutorial, we will use Google Colab.

NOTE: For you to follow along easily, use Google Colab

Introduction

As mentioned earlier, we will use River Python to build our model. River Python is best suited due to the following reasons:

It has incremental functionality. This library can be updated after each observation, and thus can be used to process streaming data.
It's adaptive. Since we are dealing with streaming data that keeps on changing over time, we need a library that is robust and can work under changing environments.
It is general-purpose. River Python is used for classification, clustering, and regression. Both supervised and unsupervised machine learning.
Efficient & easy to use. It's very efficient when handling streaming data, it is also simple and easy to use and a beginner can easily follow.

In online machine learning, the model uses real-time data for training and makes a single observation at a time. We then update our model as each new data arrives, increasing the model accuracy over time.

The input data is continuously used to improve the current model's knowledge. As we further train the model, it becomes more adaptive and robust.

River Python installation

Since we are using Google Colab, we install river using the following command:

!pip install river

Import River

To use River, we import it into our machine.

import river

Checking methods and attributes

The River Python has various methods and attributes for online machine learning. To list all the available methods use the following command:

dir(river)

This is a list of River methods and attributes:

['__all__',
 '__builtins__',
 '__cached__',
 '__doc__',
 '__file__',
 '__loader__',
 '__name__',
 '__package__',
 '__path__',
 '__spec__',
 '__version__',
 'anomaly',
 'base',
 'cluster',
 'compat',
 'compose',
 'datasets',
 'drift',
 'dummy',
 'ensemble',
 'evaluate',
 'expert',
 'facto',
 'feature_extraction',
 'feature_selection',
 'imblearn',
 'linear_model',
 'meta',
 'metrics',
 'multiclass',
 'multioutput',
 'naive_bayes',
 'neighbors',
 'neural_net',
 'optim',
 'preprocessing',
 'proba',
 'reco',
 'stats',
 'stream',
 'synth',
 'time_series',
 'tree',
 'utils']

Some of the methods that we will be using are as follows:

naive_bayes - This is the algorithm that we will be using to build our text classification model.
preprocessing - It will be used in the processing of our dataset used to train our model.
metrics - It's used to calculate the accuracy score of our model.
stream - It's used to simulate our dataset to be streaming data.
anomaly - Is used to detect errors and anomalies in our model.
compose - Is used to build a pipeline to automate machine learning workflows.

These methods will be very helpful in this tutorial.

Loading machine learning packages

To load our machine learning packages, use the following commands:

import MultinomialNB from river.naive_bayes
import BagOfWords,TFIDF from river.feature_extraction

We have imported the following packages:

MultinomialNB

This is a Naive Bayes method that is used for text classification. This contains the algorithm that is useful in the building of our model.

For a detailed understanding about MultinomialNB, please click here.

BagOfWords

It is used to extract various features in our dataset. Features are the independent variables that are used as input for our model.

It also converts our text inputs into vectors that are more machine-readable. BagOfWords work the same as CountVectorizer in offline machine learning.

In offline machine learning, the CountVectorizer is used to transform a given text into a vector based on the frequency of each word that occurs in the entire text. This is helpful when we have the text and we wish to convert that text into a vector for further text analysis.

CountVectorizer is also used for very basic preprocessing like removing the punctuation marks and converting all the words to lowercase.

In this tutorial we will be dealing with online machine learning, thus, we replace CountVectorizer with BagOfWords which perform the same functionalities.

For a detailed understanding about TFIDF, please click here.

TFIDF

TDIDF stands for Term Frequency-Inverse Document Frequency. It shows how words are relevant in a given document, and gives the importance of words.

This is done by measuring the frequency of how words appear in a given document. If a word appears frequently in a document, while not appearing frequently in others, it means this word is more relevant and has a more classification power as compared to words that appear in all the documents.

For a detailed understanding of TFIDF, please click here.

Getting the relevant methods and attributes

To use the methods stated earlier, we need to add them to our program.

To do this, we use the following Python function:

def get_all_attributes(package):
    subpackages = []
    submodules = []
    for i in dir(package):
        if str(i) not in ["__all__", "__builtins__", "__cached__", "__doc__", "__file__", "__loader__", "__name__", "__package__", "__path__", "__pdoc__", "__spec__", "__version__"]:
            subpackages.append(i)
            res = [j for j in dir(eval("river.{}".format(i)))]
            submodules.append(res)
    df = pd.DataFrame(submodules)
    df = df.T
    df.columns = subpackages
    res_df = df.dropna()
    return res_df

We name this function get_all_attributes to get all the attributes and methods needed to build our model.

The function loops through the package and removes the following: __all__, __builtins__, __cached__, __doc__, __file__, __loader__, __name__, __package__, __path__, __pdoc__, __spec__ and __version__.

We remove them since they are not used to build our model.

The remaining sub-packages and sub-modules in the list are added using the subpackages.append() method.

We then add these sub-packages and sub-modules into our data frame using the pd.DataFrame(submodules) method. The data frame is what will be used to train our model. The data frame will now have all the sub-packages and sub-modules.

The above function removes the unnecessary methods, packages, and attributes as shown:

dir(river)

The output of the remaining methods and attributes is shown:

['anomaly',
 'base',
 'cluster',
 'compat',
 'compose',
 'datasets',
 'drift',
 'dummy',
 'ensemble',
 'evaluate',
 'expert',
 'facto',
 'feature_extraction',
 'feature_selection',
 'imblearn',
 'linear_model',
 'meta',
 'metrics',
 'multiclass',
 'multioutput',
 'naive_bayes',
 'neighbors',
 'neural_net',
 'optim',
 'preprocessing',
 'proba',
 'reco',
 'stats',
 'stream',
 'synth',
 'time_series',
 'tree',
 'utils']

To add these methods and attributes into our program, use the commands shown below. This will make them available for use when building our model.

river_df = get_all_attributes(river)

To see if these methods and attributes are added, use this command:

river_df

The output is as shown:

Methods and attributes

More methods and attributes

Simulating streaming data

To use River Python, we need streaming data. Streaming data comes in incrementally over time, and comes one at a time. To simulate streaming data, we use our dataset as a list of a tuple as shown:

We will have two lists of data: A training list and a testing list.

Training list

data = [("my python program is runnning","software"),
("I tried to run this program, but it has bugs","software"),
("I need a new machine","hardware"),
("the flashdisk is broken","hardware"),
("We need to test our code","software"),
("programming concepts and testing","software"),
("Electrical device","hardware"),
("device drives","hardware"),
("The generator is broken","hardware"),
("im buidling a REST API","software"),
("design the best API so far","software"),
("they need more electrical wiring","hardware"),
("my code has errors","software"),
("i found some program test faulty","software"),
("i broke the car handle","hardware"),
("i tested the user interface code","software")]

This contains a list of texts that are labeled as either hardware or software related. They will be used to train the model one at a time.

Testing List

test_data = [('he writes programs daily','software'),
             ('my disk is broken','hardware'),
             ("program mantainance","software"),
             ('The drive is full','hardware')]

This test data will be used to test our model and measure the model performance.

Building the pipeline

A machine learning pipeline is used to automate the workflow of a machine learning model.

Machine learning pipelines are made up of sequential steps. These steps involve data extraction, preprocessing, model training, and deployment.

Let's import Pipeline.

from river.compose import Pipeline

Our pipeline will have two stages:

BagOfWords: It's used for feature extraction and conversion of text inputs into vectors.
MultinomialNB: Uses the Naive Bayes algorithm in building the model.

Let's initialize these two stages.

Pipeline stages initialization

To initialize our pipeline, use this code snippet.

pipe_nb = Pipeline(('vectorizer',BagOfWords(lowercase=True)),('nb',MultinomialNB()))

This will initialize our two stages; BagOfWords as vectorizer and MultinomialNB as nb. We also change our text to lowercase by setting lowercase=True.

Visualizing the pipeline

To visualize the initialized pipeline, use this command:

pipe_nb

The output is as shown:

Pipeline Visualization

Get pipeline steps

To get the pipeline steps, use the following command. In this tutorial, we have two steps.

pipe_nb.steps

Output:

OrderedDict([('vectorizer',
              BagOfWords (
                on=None
                strip_accents=True
                lowercase=True
                preprocessor=None
                tokenizer=<built-in method findall of re.Pattern object at 0x7fa35529de00>
                ngram_range=(1, 1)
              )),
             ('nb',
              MultinomialNB (
                alpha=1.
              ))])

Building our model

Since we are dealing with streaming data, we have to learn from our data one a time to simulate streaming data.

Our data is in form of a list of tuples, so we can learn from it one a time by iterating through our data using the for loop. This will ensure that we train a model to give each instance in our loop over time, before going to the next instance as if we are dealing with real-time data that comes in streams.

Looping through our dataset

for text,label in data:
    pipe_nb = pipe_nb.learn_one(text,label)

When looping through our dataset, we use the learn_one() method to learn one at a time from our given list of the tuple. learn_one() will learn from the first text and label which is "my python program is running", "software" according to our data set.

It will store the knowledge learned and use it when the next data arrives. Since the model remembers the knowledge gained over time, it will be able to adapt to changes in the dataset.

Over time, the model will be more accurate than when we began the training since it will have accumulated knowledge.

Making prediction

In online machine learning, the model does not wait till the end to make predictions. It will predict this instance and continue training when the next data arrives.

pipe_nb.predict_one("I built an API")

We use predict_one to predict according to this instance.

The output is shown:

'software'

This is true since our text is software related.

Prediction probability

To get the probability of the classification above, use this command:

pipe_nb.predict_proba_one("I built an API")

We use the predict_proba_one() method to get the probability of prediction at this instance.

The output is as shown:

{'software': 0.732646964375691, 'hardware': 0.2673530356243093}

This shows that the probability of the text being classified as software is more than that of hardware. Their probabilities are 0.7326 and 0.2673 respectively.

Online machine learning is a continuous process since we are using real-time time data that is continuously being generated.

Let's make another prediction.

Other prediction

pipe_nb.predict_one("the hard drive  in the computer is damaged")

The prediction output:

'software'

This gives a wrong prediction because it is still early in the learning process. But as we continue training using our dataset, the model will gain more knowledge which it will use when making future predictions.

The aim is for the learning model to adapt to new data without forgetting the existing knowledge.

At the beginning of the training phase, the model has a lower accuracy, but with time the accuracy increases. Let's calculate the model accuracy at this instance.

Model accuracy

We need to get the accuracy of our model at this instance. We use the river.metrics.Accuracy() method to calculate the accuracy.

We loop through our data set to get the instance we want to calculate the accuracy for and update our model with the accuracy score after making a prediction using metric.update.

metric = river.metrics.Accuracy()
for text,label in test_data:
    y_pred_before = pipe_nb.predict_one(text)
    metric = metric.update(label,y_pred_before)
    pipe_nb = pipe_nb.learn_one(text,label)

To get the accuracy score, use the following command:

metric

The output is as shown.

Accuracy: 75.00%

Being the first prediction, this is a good accuracy for our model. This shows that our model has a 75% chance of making an accurate prediction.

As we continue training, our model will learn from the dataset and store the knowledge. It will then use this accumulated knowledge to increase the accuracy score when making predictions, this is the goal of any model.

Conclusion

In this tutorial, we have learned about online machine learning. We started by stating the difference between online and offline machine learning, this gave us a good working knowledge of this kind of model.

We then explored River Python which is a good library that can handle streaming data. River Python has various methods and attributes that are important in building our machine learning model.

We then applied a machine learning pipeline that automated our workflow from data processing, feature extraction, and building our model. We built our model using the Naive Bayes algorithm.

We finally used our trained model to make a prediction and also check the model accuracy. The higher the accuracy the better our model. Our model should be able to classify a given text to be either hardware or software-related.

Happy coding!

References

Peer Review Contributions by: Willies Ogola

Author

Bravin Wasike

Bravin wasike holds an undergraduate degree in Software Engineering. He is currently a freelance data scientist and machine learning engineer. He is passionate about machine learning and deploying models to production using Docker and Kubernetes. Bravin also loves writing articles on machine learning and various advances in the technological industry. He spends most of his time doing research and learning new skills in order to solve different problems.