Univariate Time Series Analysis and Forecasting with ARIMA/SARIMA
A time series is a sequence of data points that occur over regular time intervals. A time series shows all the time-dependent variables in the dataset. An example of time series data is stock prices and weather records. <!--more--> In time series analysis and modeling, we train models to identify patterns in datasets. Time series forecasting involves finding the future values that the time series will take.
A time series can be univariate, bivariate, or multivariate. A univariate time series has only one variable, a bivariate has two variables, and a multivariate has more than two variables.
In this tutorial, we will be dealing with univariate time series modeling. We will first discuss a few concepts in ARIMA and Seasonal ARIMA models. We will then practically implement the ARIMA and Seasonal ARIMA models.
Table of contents
- Prerequisites
- How the ARIMA model works
- what is a stationary time series?
- What is differencing?
- Components of the ARIMA model
- How to check for stationarity
- Augmented Dickey-Fuller test
- How SARIMA works
- Time series dataset
- Loading the dataset
- Visualizing the time series data
- Implementing Augmented Dickey-Fuller test
- Applying the function
- Implementing differencing
- Plotting the new dataset
- Implementing the ARIMA model
- Plotting ACF
- Plotting PACF
- Import the ARIMA model
- Training the ARIMA model
- Testing the ARIMA model using the test dataset
- Implementing the SARIMA model
- Training the SARIMA model
- Testing the SARIMA model using the test dataset
- Conclusion
- References
Prerequisites
To easily follow this tutorial, a reader should know the following:
- Time series concepts
- Time series analysis and modeling
- Time series applications
- Time series patterns
- Time series plots
NOTE: Ensure you implement the ARIMA and SARIMA models in Gooogle Colab.
How the ARIMA model works
Auto-Regressive Integrated Moving Average (ARIMA) is a time series model that uses the information in the past time series values to make future predictions. The information found in the past values will indicate the nature of the future predictions. For example, an ARIMA model can predict future stock prices after observing and analyzing previous stock prices.
The ARIMA model will use the single time-dependent (univariate) variable in the time series to make predictions. ARIMA models only work when the time series is stationary.
What is a stationary time series?
A stationary time series is a series whose properties remain constant over time. These properties are variance, mean, and covariance. Stationary time series do not have trends and repetitive cycles/seasonality. Most time-series is non-stationary, especially stock prices and other financial data.
For ARIMA models to work, we have to make the data stationary. We remove the trends and seasonality. This will make it easier for ARIMA to make a prediction. To make a time series stationary, we apply the differencing technique.
What is differencing?
Differencing is the process of making a time series stationary. It is an essential step in dataset preprocessing. It makes the variance, the covariance, and the mean of the time series constant. It also reduces the repetitive cycles and seasonality components in the time series data.
The differencing technique finds the difference between the current time series value and the previous value. We may get the difference between the time series values once but still not make the time series stationary. In this case, we need to find the difference multiple times until the time series becomes stationary.
Components of the ARIMA model
The ARIMA model is comprised of three components. They are combined to form the final ARIMA model.
The components are as follows:
- Auto Regression
- Integrated
- Moving average
Auto Regression
It uses the previous time series values in the time series to make future predictions. It uses the dependency between a current time series value and past observations.
Integrated
This component performs differencing to make the time series stationary. It subtracts a current time series value from the previous time series value.
Moving Average
It uses errors in past predictions to make future predictions. It uses the dependency between an actual time series value and model errors from previous ones.
When creating an ARIMA model, we pass each component as a parameter using the following standard notations: p
, d
, and q
. They represent the parameters that build the ARIMA model. We initialize the ARIMA model as ARIMA (p,d,q).
The functions of the standard notations are as follows:
-
p: It represents the order of the Auto Regression (AR) component. It represents the number of lag observations found in the ARIMA model. A lag is the time gap between two data points/observations in the time series. We get the best value of
p
using a Partial Autocorrelation Function (PACF) plot. -
d: It is the total differencing steps performed to make the time series stationary. If the time-series data is already stationary, there is no need for differencing.
-
q: It represents the order of the Moving Average (MA) component. It shows the forecast errors that the final ARIMA Model should have. We get the best value of
q
using an AutoCorrelation Function (ACF) plot.
NOTE: An order in a time series is the number of previous observations/time series values that the model uses to make a single prediction.
How to check for stationarity
We will visualize the time series charts and perform statistical tests to check for stationarity.
Visualizing the time series charts
This entails visualizing the time-series plots (line charts) to check for trends or seasonality.
Performing statistical tests
This entails performing statistical tests on the time series to check for stationarity. We use various statistical tools for this testing. In this tutorial, we will use the Augmented Dickey-Fuller tool.
Augmented Dickey-Fuller test
We will implement the Augmented Dickey-Fuller (ADF) test to check for stationarity. The ADF test uses hypothesis testing to check for stationarity. It has a null hypothesis and an alternative hypothesis. The null hypothesis of this test is that the times series is non-stationary. The alternative hypothesis is that the time series is stationary.
The ADF test has an important parameter known as the p-value
that determines whether a time series is stationary. The time series is stationary when the p-value
is less than 0.05.
How SARIMA works
Seasonal Autoregressive Integrated Moving Average (Seasonal ARIMA) is a subset of ARIMA models that supports the direct modeling of time series with seasonality/repeating cycles.
We will also implement SARIMA using the ARIMA models parameters (p,d,q). We then add a few new parameters to handle the repeating cycles or seasonality. The newly added parameters are P
, Q
, D
, and s
. We initialize the SARIMA model as SARIMA(p,d,q)(P, D, Q, s).
D
: It represents the number of seasonal differences steps in the time series.Q
: It represents the order of the seasonal moving average component.P
: It represents the order of the seasonal autoregressive component.s
: It represents the periods in each seasonality component. The number of periods (months) in a year is 12, s=12.
SARIMA handles seasonality using the D
parameter. It performs seasonal differencing. It subtracts data points in the seasonality components. The next step is to start building our models.
Time series dataset
We will prepare the sales dataset. The dataset will build a time series model that predicts monthly champagne sales. You can get the dataset from here. The dataset shows monthly sales from 1964 to 1972. We will build the model on the training dataset and make predictions using the test dataset.
Loading the dataset
We will use the Pandas
library to load the dataset.
import pandas as pd
Use this code to read the data:
df = pd.read_csv('monthly-sales.csv')
To display the first five data points of the dataset, use this code:
df.head()
First five data points:
To display the last five data points of the dataset, use this code:
df.tail()
Last five data points:
From this output, the 105th and the 106th data points/rows have missing values. We will have to drop these data points. We also have to rename the columns.
Renamaing columns
To rename the columns, use this code:
df.columns=["Month","Sales"]
df.head()
The output of the renamed columns:
Drop the data points
To drop the 105th and the 106th data points/rows, use this code:
df.drop(,axis=0,inplace=True)
df.drop(,axis=0,inplace=True)
df.tail()
Output:
We have dropped these rows. The new dataset will now have 104 data points/rows.
Converting the 'month' column
We need to convert the month
column to a DateTime
format. This format allows us to perform time-series analysis. We will use the pd.to_datetime
function.
df['Month']=pd.to_datetime(df['Month'])
To see the converted month
column, run this code:
df.head()
Output:
We also need to set the month
column as the index column. Use this code:
df.set_index('Month',inplace=True)
To view the new changes, run this code:
df.head()
Output:
Visualizing the time series data
We will plot time series data and check for trends or seasonality/repeating cycles. As mentioned earlier, this is a simple method to check for time series stationarity or non-stationarity. We will use Matplotlib
for plotting.
import matplotlib.pyplot as plt
To plot, run this code:
df.plot()
It produces the following line chart:
The line chart plots the sales
against month
. Through visualization, the time series has seasonality or repeating cycles. The spikes and dips keep on repeating during certain months of the year.
We can conclude that the time series is non-stationary since it has seasonality. We still need to perform a statistical test on the time series to prove this non-stationarity. As mentioned earlier, we will use the Augmented Dickey-Fuller test.
Implementing Augmented Dickey-Fuller test
We import the Augmented Dickey-Fuller tool as follows:
from statsmodels.tsa.stattools import adfuller
We initialize the adfuller
function and pass the sales
column as follows:
passing_data=adfuller(df['Sales'])
We will create a function that will check for dataset stationarity. We just need to prove what we have observed earlier on the line chart.
def adf_test(sales):
result=adfuller(sales)
labels = ['Test parameters', 'p-value','#Lags Used','Dataset observations']
for value,label in zip(result,labels):
print(label+' : '+str(value) )
if result[1] <= 0.05:
print("Dataset is stationary")
else:
print("Dataset is non-stationary ")
The ADF test will check for stationarity. The p-value
will determine whether the time series is stationary. When the p-value
of the ADF test is less than 0.05, then the time series is stationary. We then apply the function to the Sales
column to know the ADF test results and get the p-value
.
Applying the function
To apply the function, use this code:
adf_test(df['Sales'])
It produces the following results:
From the output above, we have different output values that show the nature of our dataset. We are only interested in the p-value
result. The p-value
is 0.363915771660247.
This number is greater than 0.05. It implies that the time series is non-stationary. We will have to make the time series stationary using the differencing approach.
Implementing differencing
This approach finds the difference between the current monthly values and the previous monthly values in the time series. We will difference only once, therefore our d=1
.
df['Differencing']=df['Sales']-df['Sales'].shift(12)
This code will difference sales
values.
We again perform the Augmented Dickey-Fuller test to check whether the time series has become stationary.
adf_test(df['Differencing'].dropna())
The test results.
From the test results the p-value
is 2.519620447387081e-10 (0.0000000002519620447387081). This number is < 0.05, therefore the dataset has become stationary. We will plot this new time series to see whether we have removed the seasonal components.
Plotting the new dataset
To plot the new dataset, use this code:
df['Differencing'].plot()
From the output above, we have removed the seasonal components. We can start applying the ARIMA model to the dataset.
Implementing the ARIMA model
As mentioned earlier, we initialize the ARIMA model as ARIMA (p,d,q). So we need to get the values of these parameters. We have already discussed the function of each parameter.
We already know the d=1
. It is because we have performed differencing only once. The next step is to get the best p
and q
values.
Getting the best 'p' and 'q' values
We get the best value of q
using an Autocorrelation Function (ACF) plot.
We get the best value of p
using a Partial Autocorrelation Function (PACF) plot.
Lets import PACF
and ACF
.
from statsmodels.graphics.tsaplots import plot_acf,plot_pacf
import statsmodels.api as sm
We then first plot the ACF as follows:
Plotting ACF
To plot the ACF, use this code:
ax1 = fig.add_subplot(211)
fig = sm.graphics.tsa.plot_acf(df['Differencing'].iloc[13:],lags=40,ax=ax1)
The code above produces the following plot:
We will use this plot to get the best value of q
. From the ACF plot, lag number one stands out. The red arrow shows the lag point. It is slightly above (cuts off) the significance line (the blue line). We will select this lag as the best value of q
. Therefore, q=1
.
To understand how to interpret an ACF plot and get the best value of q
, read this article.
Plotting PACF
To plot the PACF, use this code:
ax2 = fig.add_subplot(212)
fig = sm.graphics.tsa.plot_pacf(df['Differencing'].iloc[13:],lags=40,ax=ax2)
It produces the following plot:
We will use this plot to get the best value of p
. From the PACF plot, we can observe that lags 1 and 13 stand out. The red arrow shows the lag points.
These points are above (cuts off) the significance line (the blue shaded line). We select lag number one as the best value of p
. It is the first lag that is above the blue line. Therefore, p=1
.
Our values will be: p=1
, d=1
and q=1
. The next step is to import the ARIMA model.
Import the ARIMA model
We import the ARIMA model as follows:
from statsmodels.tsa.arima_model import ARIMA
We initialize the model as follows:
model=ARIMA(df['Sales'],order=(1,1,1))
Training the ARIMA model
We use fit
function to train the ARIMA model. The ARIMA model will learn from the time series dataset.
arima_model=model.fit()
We have trained the ARIMA model. We can use it to predict the test dataset. The prediction will show the actual monthly sales and the predicted (forecast) sales.
Testing the ARIMA model using the test dataset
We will use the values from the 90th row to the 103rd row as the test portion/set. We will then use Matplotlib
to show the actual and the predicted (forecast) monthly sales.
df['forecast']=arima_model.predict(start=90,end=103,dynamic=True)
df[['Sales','forecast']].plot(figsize=(12,8))
It produces the following line chart:
From the output above:
- The blue line is the actual monthly sales.
- The orange line is the forecast sales.
The ARIMA model has not performed well since it has not made correct predictions. The orange line is far apart from the blue line. We will now build another time series model using SARIMA to improve the performance.
Implementing the SARIMA model
SARIMA will handle and model time series data with repeating cycles or seasonality. From the earlier ADF test, the dataset has seasonality. We initialize the SARIMA model as SARIMA (p,d,q)(P, D, Q, s). We already have the p
, d
, and q
values.
We can get the P
, D
, Q
in the same way. Thus, p=P, q=Q, and d=D. s=12
since there are 12 months in a year. We initialize the SARIMA model as follows:
model=sm.tsa.statespace.SARIMAX(df['Sales'],order=(1, 1, 1),seasonal_order=(1,1,1,12))
Training the SARIMA model
We use fit
function to train the SARIMA model. The SARIMA model will learn from the time series dataset.
sarima_model=model.fit()
We have trained the SARIMA model. We can use it to predict the test dataset. The prediction will show the actual monthly sales and the forecast sales.
Testing the SARIMA model using the test dataset
We will also use the values from the 90th row to the 103rd row as the test portion/set. We use Matplotlib
to show the actual and the predicted (forecast) sales as follows:
df['forecast']=sarima_model.predict(start=90,end=103,dynamic=True)
df[['Sales','forecast']].plot(figsize=(12,8))
It produces the following line chart:
From the output above:
- The blue line is the actual monthly sales.
- The orange line is the forecast sales.
The time series model had made correct predictions since the two lines are close together. The SARIMA model has performed well as compared to the ARIMA model.
Conclusion
We have built a univariate time series model with ARIMA and SARIMA in Python. We discussed how both the ARIMA and SARIMA models work. We also talked about time-series stationarity and how to make the series stationary using differencing. We covered how to check for stationarity using the Augmented Dickey-Fuller test.
We explored all the components of the ARIMA model and their specific parameters. We then used the autocorrelation Function (ACF) plot and Partial Autocorrelation Function (PACF) plot to get the best p
and q
values. Finally, we implemented both the ARIMA and SARIMA models. The SARIMA model made better predictions.
You can get the Python code for this article from here.
Happy coding!
References
- Interpreting Autocorrelation and Partial Autocorrelation plots
- Autocorrelation and Time Series Methods
- What is Arima and Sarima models
- Introduction to Autocorrelation and Partial Autocorrelation
- Identifying AR and MA terms
- Handling a Non-Stationary Time Series
- Non-Stationary Processes
- ARIMA definition
- Understanding SARIMA Model
Peer Review Contributions by: Wilkister Mumbi