Getting Started with Polynomial Regression in R
Polynomial regression is used when there is a non-linear relationship between dependent and independent variables. Examples of cases where polynomial regression can be used include modeling population growth, the spread of diseases, and epidemics. <!--more--> Such trends are usually regarded as non-linear.
The general form of a polynomial regression model is:
y = β<sub>0</sub> + β<sub>1</sub>X + β<sub>2</sub>X<sup>2 </sup> +........+ ε
For example, a polynomial model of 2
degrees can be written as:
y = β<sub>0</sub> + β<sub>1</sub>X + β<sub>2</sub>X<sup>2 </sup> + ε
Now that we know what Polynomial Regression is, let's use this concept to create a prediction model.
Prerequisites
A general understanding of R and the Linear Regression Model will be helpful for the reader to follow along.
Step 1 - Data preprocessing
The dataset used in this article can be found here.
The first step we need to do is to import the dataset, as shown below:
dataset = read.csv('salaries.csv')
This is how our dataset should look like:
In the dataset above, we do not need column 1
since it only contains the names of each entry.
To remove column 1
from our dataset, we simply run the following code:
dataset= dataset[2:3]
Our dataset should now look like this:
To determine whether a polynomial model is suitable for our dataset, we make a scatter plot and observe the relationship between salary
(dependent variable) and level
(independent variable).
library(ggplo2)
ggplot() +
geom_point(aes(x = dataset$Level, y = dataset$Salary),
colour = 'red')
Our scatter plot should look as shown below:
From the analysis above, it's clear that salary
and level
variables have a non-linear relationship. Therefore, a polynomial regression model is suitable.
The second step in data preprocessing usually involves splitting the data into the training set
and the dataset
. In our case, we will not carry out this step since we are using a simple dataset.
The lm
function has also allowed us to take care of feature scaling.
Step 2 - Fitting the polynomial regression model
The polynomial regression model is an extension of the linear regression model. The only difference is that we add polynomial terms
of the independent variables
(level) to the dataset to form our matrix.
This is demonstrated below:
dataset$Level2 = dataset$Level^2
dataset$Level3 = dataset$Level^3
dataset$Level4 = dataset$Level^4
Our new dataset will look like this:
As stated, to fit the polynomial model, we use the lm
function, as highlighted below:
poly_reg = lm(formula = Salary ~ .,data = dataset)
After completing the polynomial model, we use the following code to evaluate its effectiveness:
summary(poly_reg)
From the results above, the model is quite good due to its 99.53% accuracy.
Step 3 - Visualizing of the model
We use the ggplot2 library to visualize our model, as demonstrated below:
library(ggplot2)
x_grid = seq(min(dataset$Level), max(dataset$Level), 0.1)
ggplot() +
geom_point(aes(x = dataset$Level, y = dataset$Salary),
colour = 'red') +
geom_line(aes(x = x_grid, y = predict(poly_reg,
newdata = data.frame(Level = x_grid,
Level2 = x_grid^2,
Level3 = x_grid^3,
Level4 = x_grid^4))),
colour = 'blue') +
ggtitle('Truth or Bluff (Polynomial Regression)') +
xlab('Level') +
ylab('Salary')
Below are the results obtained from this analysis:
From the graph above, we can see that the model is nearly perfect. It fits the data points appropriately. Therefore, we can use the model to make other predictions.
Step 4 - Making predictions using the polynomial regression model
Now that we have developed the model, it's time to make some predictions.
Assuming that you would like to predict the salary
of an employee whose level
is 7.5
. To do this, we use the predict()
function, as highlighted below.
# Predicting a new result with the polynomial regression
predict(poly_reg, data.frame(Level = 7.5,
Level2 = 7.5^2,
Level3 = 7.5^3,
Level4 = 7.5^4))
Output:
225126.3
The salary
of an employee with a level
of 3.7
is calculated, as shown below:
predict(poly_reg, data.frame(Level = 3.7,
Level2 = 3.7^2,
Level3 = 3.7^3,
Level4 = 3.7^4)
The result is:
84363.82
The next step is to examine the effect of additional degrees on our polynomial model:
dataset$Level5 = dataset$Level^5
Let's build a new model with a Level5
column added and then examine its effects:
poly_reg2 = lm(formula = Salary ~ .,data = dataset)
predict(poly_reg2, data.frame(Level = 7.5,
Level2 = 7.5^2,
Level3 = 7.5^3,
Level4 = 7.5^4
Level5= 7.5^5))
The employee's salary is predicted to be 237446
as compared to the 225123.3
we had obtained from the model with 4 degrees.
Generally, the more degrees the polynomial regression model has, the more accurate its predictions are.
Conclusion
From this article, you have learned how to analyze data using polynomial regression models in R. You can use this knowledge to build accurate models to predict disease occurrence, epidemics, and population growth.
Happy coding!
Peer Review Contributions by: Wanja Mike