Machine Learning

Predicting Covid-19 Cases Using NeuralProphet

Performing predictions on datasets where a time series is an independent variable can prove challenging using traditional machine learning methods. In 2017, Facebook (now Meta) came up with a library that extracts non-linear patterns which may have daily, weekly, or other seasonality.  This library is called Prophet (previously known as FbProphet). One limitation of the library is that it does not generalize complex trends well, so it tends to under-fit. The Data Science Core Team at the company thus came up with NeuralProphet, a library based on AR-Net (a simple auto-regressive neural network for time-series) and Pytorch.

We will use the global cases dataset from the Johns Hopkins University Center for Systems Science and Engineering (JHU CSSE) Global Covid-19 Data Repository to perform forecasting. This is a very useful dataset as it has a lot of seasonality and complex time series patterns.

About Time-Series Forecasting
Preparing the environment
Importing, understanding and preparing the data
The COVID-19 Dataset
Importing and preparing the data
Training a NeuralProphet Forecaster
Monitoring the training process and evaluating the model
Forecasting using the trained model
Conclusion

Prerequisites

Basic knowledge of Python.
Machine learning basics.
Basic data manipulation skills with Pandas.
Python (with pip, NumPy and Pandas) installed on your computer or an online environment like Google Colab or Kaggle.

Goals

At the end of this tutorial, you will be familiar with:

Understanding Time-series forecasting.
Installation of NeuralProphet.
Formating the data for forecasting.
Performing Forecasts on a single series using NeuralProphet.
Forecasting on multiple series.

About Time-series forecasting

Time series is a series of data points that are organized in an order of time. For example, weather data for a location for some time will be listed as dates with their corresponding temperature, rainfall amount, atmospheric pressure, and so on.

Predicting this kind of data using traditional statistics such as Linear regression, Ridge regression, and Lasso regression as well as Bayesian techniques, do not produce optimal results. Time-series forecasting is a modeling strategy used to predict future trends using past observations where the time-series are the independent variables. This technique can unravel time-domain relationships like seasonality, weekly, monthly, and yearly trends.

Setting up environments

On Google Colab, to install NeuralProphet, run the following commands:

pip install neuralprophet[live]

Locally, run the following:

pip install neuralprophet

Note: Online editors like repl.it may fail to run our code due to insufficient memory allocations.

Importing and preparing the data

We will start by importing the pandas library and the dataset.

import pandas as pd
path='https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv'
dataset=pd.read_csv(path,engine='python')

The COVID-19 Dataset

We will be using the Covid-19 Global Cases Dataset from COVID-19 Data Repository by the Center for Systems Science and Engineering (CSSE) at Johns Hopkins University. To manually navigate to the CSV file within the repository, follow the path csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv.

This dataset is a collection of reports from various countries' health departments, agencies, and universities that is updated daily by the CSSE at JHU.

Preparation of the data

The data has some countries' data presented by the Province or State. The 'Province/State' column should be combined into one country's column. We also need to drop the 'Lat' and 'Long' column since they're not useful for this task.

dataset=dataset.groupby('Country/Region').sum()
dataset=dataset.drop(columns=['Lat','Long'])

For easier fetching during model training, we will rotate the data so that country names are column names and dates are just rows.

We will do this by saving the dates in a variable dates, transposing the rest of the dataset, and concatenating it with dates as a Date column.

dates=dataset.columns
dataset=dataset.transpose().reset_index(drop=True)
dates=pd.DataFrame(dates)
rotated_dataset= pd.concat([dates, dataset.diff()], axis=1, join='inner')
rotated_dataset=rotated_dataset.rename(columns={0:'Date'})

NeuralProphet only accepts two columns; the dates and y-column when fitting data. The columns must be named ds and y, respectively. We will drop all NULL values from the data.

For the moment, let's use the US data for demonstration.

data=cases[['Date','US']]
data.columns=['ds','y']
data.dropna(inplace=True)
data[:1].head()

Output:

 	      ds 	   y
2 	1/24/20 	1.0
3 	1/25/20 	0.0
4 	1/26/20 	3.0
5 	1/27/20 	0.0
6 	1/28/20 	0.0

Training the model

We will import the NeuralProphet() class from neuralprophet, create an empty model from it, and assign it to the variable m. We will then fit the model on our training dataset data from the previous step. Thereafter, we will begin the training process by passing three arguments to the fit() method:

The dataset to train on, data,
Frequency of the data, freq as 'D' since the occurrence of the data is daily,
Training time, epochs

from neuralprophet import NeuralProphet
m = NeuralProphet()
m.fit(data,freq='D',epochs=1000)

After the training process is complete, we can obtain the model's metrics, such as Mean Absolute Error (MAE), from the output of the cell. We can note that the MAE is reduced significantly (~ 90%) during the training process.

Monitoring the training process and evaluating the model

A NeuralProphet model self-validates and provides validation metrics as part of the training process outputs. If you installed the [live] version of NeuralProphet and want to monitor the metrics during the training process, split the data into training and testing sets using the model's split_df() method as shown below:

df_train, df_test = m.split_df(data, freq='D')
metrics = m.fit(df_train, freq='D', epochs=1000, validation_df=df_test, plot_live_loss=True)

A live plot of the SmoothL1Loss and MAE will be plotted live during training.

The output of the code-cell below displays the metrics of the model at the end of training:

metrics.tail(1)

Output:

index,SmoothL1Loss,MAE,RMSE,RegLoss,SmoothL1Loss_val,MAE_val,RMSE_val
1999,0.0029082219263359263,14688.557393324336,21929.236516299832,0.0,NaN,NaN,NaN

Forecasting

Forecast future trends by passing the data and the number of periods (days) to the make_future_dataframe() method. Let us predict for the next 14 days. Let's call the predict() method on future to perform the forecast. The predictions are the yhat1 column of the forecast dataframe.

future=m.make_future_dataframe(data,periods=14)
forecast=m.predict(future)
print(forecast)

Let us plot the predictions using the model's in-built plotting method, plot() from the matplotlib API.

fig1 = m.plot(forecast)

Output:

Predicted cases

Since we are predicting cases, they cannot be floating-point values. We round of the predictions to the nearest integer by casting the series of floats to an int datatype.

forecast['yhat1'].astype(int)

The yhat1 values are:

0     377139
1     376244
2     376643
3     376004
4     382333
5     358197
6     340498
7     353060
8     352851
9     353981
10    354114
11    361250
12    337950
13    321113
Name: yhat1, dtype: int64

This means that the model predicts 377,139 cases for tomorrow, 376,244 for the following day, and so on. We will obtain the predictions with their dates as follows:

results_df=forecast[['ds','yhat1']]
results_df['yhat1']=forecast['yhat1'].astype(int)
results_df.columns=['Date','Predicted Cases']
results_df

Output:

index,Date,Predicted Cases
0,2022-02-14 00:00:00,377139
1,2022-02-15 00:00:00,376244
2,2022-02-16 00:00:00,376643
3,2022-02-17 00:00:00,376004
4,2022-02-18 00:00:00,382333
5,2022-02-19 00:00:00,358197
6,2022-02-20 00:00:00,340498
7,2022-02-21 00:00:00,353060
8,2022-02-22 00:00:00,352851
9,2022-02-23 00:00:00,353981
10,2022-02-24 00:00:00,354114
11,2022-02-25 00:00:00,361250
12,2022-02-26 00:00:00,337950
13,2022-02-27 00:00:00,321113

Conclusion

We have built a NeuralProphet model and used it to predict Covid-19 cases. In this tutorial, we learned how to install NeuralProphet, import and prepare data for time-series forecasting, train the NeuralProphet forecaster, and forecast using the trained model.

We can serve this model via any web application framework like Streamlit or Dash using Django or Flask via an API. In case of any issues with NeuralProphet, you can raise an issue on NeuralProphet's GitHub.

You can find the complete code here.

Happy coding!