Effects of Feature Scaling on a Machine Learning Model
The datasets that we use for training models in machine learning have unpredictable values that might vary from each other on a broad scale. Numerical values might have big differences amongst themselves, especially when they represent different scales, and this might make it difficult to compare them, i.e., kg, litres, millimetres, miles, pixels, etc. <!--more--> Feature scaling is introduced to solve this challenge. It adjusts the numbers to make it easy to compare the values that are out of each other's scope. This helps increase the accuracy of the models, especially those using algorithms that are sensitive to feature scaling, i.e., Gradient Descent and distance-based algorithms.
There are two techniques of scaling features:
- Normalization - the values are rescaled to range between zero and one.
- Standardization - the values are rescaled to center around the mean with units of standard deviation.
Choosing which to use on your dataset is subjective to your dataset, the machine learning algorithm, and the type of problem you might solve.
In this approach, we will learn how to implement each. We will first build a prediction model without feature scaling, one with standardized features, and lastly, one with normalized features. We will also use the same dataset for the three to compare how our dataset affects the choice of the technique we use.
Table of contents
- Prerequisites
- Importing the libraries
- The initial model
- Using Standard Scaling
- Using Normalization
- Effects of the feature scaling
- Choosing between the two scaling techniques
- Exceptions
- Conclusion
Prerequisites
We will need to have:
- A fundamental knowledge of Python programming language.
- A fundamental knowledge in Machine Learning.
- Jupyter Notebook/ Jupyter Lab/ Google Colab.
Using the given data for training, we will build models to predict if a patient was diagnosed with either disease M
or B
.
Importing the libraries
In our notebook, let us import the following libraries and run the cell:
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder, StandardScaler, MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.svm import LinearSVC,SVC
from sklearn.metrics import classification_report
1.The initial model
(i) Fetching the dataset from github
We will use pandas to read the CSV file from the raw dataset URL:
url = 'https://github.com/Sajeyks/Section-dataset1/blob/main/data.csv?raw=true'
df = pd.read_csv(url,index_col=0)
(ii) Exploring our dataset
We will use df.info
to check what our dataset consists of:
df.info()
Output:
As you can see, our dataset entails 32 columns, where most are floats, and one contains objects. The first 31 columns are also in a good state; hence no need for replacing or removing some rows.
We can also use df.head()
to have a look at the values in the first ten rows of our dataset:
df.head(10)
Output:
(iii) Preparing the dataset
We will now select a set of attributes that can affect the dependent variables (diagnosis) in our dataset:
my_features = ['radius_mean', 'perimeter_mean', 'area_mean', 'compactness_mean', 'concavity_mean', 'concave points_mean',
'perimeter_worst', 'area_worst', 'compactness_worst', 'concavity_worst', 'concave points_worst']
Then we will split the data between y(dependent)
and X(independent)
variables:
y = df.diagnosis
X = df[my_features]
As we saw above, y(diagnosis) is an object data type used to represent the status of the diagnosis. Machine learning algorithms only work with numerical values; hence we need to represent our diagnosis status numerically. To do that, we will use a LabelEncoder
. We have two categories of diagnosis status, M
and B
. The label encoder will change them to categories of 1
and 0
, respectively:
lb = LabelEncoder()
y = lb.fit_transform(df.diagnosis)
If you check y
now, you will see that it transformed into an array of 1's
and 0's
:
print(y)
Output:
After that, we can now go ahead and split our dataset into the training set and testing set in the ratios of 30:70:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)
(iv) Fitting the model
We will use LinearSVC
to build a Support Vector Classifier model that categorizes our data into either one of the two diagnosis statuses:
classifier = LinearSVC()
# fitting model
classifier.fit(X_train,y_train)
# predict
y_predict = classifier.predict(X_test)
# check accuracy
accuracy_b4 = classifier.score(X_test, y_test)
(v) Predicting
We can then print both the prediction and testing sample and try to compare the results:
print(y_predict)
print(y_test)
2.Using Standard Scaling
We will build another model that uses standard scaling
to level the dataset before fitting.
(i) Creating and analyzing dataframe
Use the code below to import the dataset and read then create the second dataframe:
df1 = pd.read_csv(url,index_col=0)
# dataframe summary
df1.head(10)
Take note of how the values across the columns are distributed in a wide range, i.e., some are 122.80 and others 0.11840:
(ii) Implementing standard scaling
To scale properly, we will first collect the names of all the numerical columns in a list then use the list to fit the standard scaler.
col =[]
for col_name in df1.columns:
if df1[col_name].dtype=='object':
pass
else:
col.append(col_name)
# fitting the standard scaler
for i in col:
sc = StandardScaler().fit(df1[[i]])
df1[i] = sc.transform(df1[[i]])
Using standard scaler
from sklearn
, we could rescale our features using the standardization
technique.
Now, if we check our df1
again, you will notice how the range between the values has been reduced:
df1.head(10)
(iii) Splitting our dataset
y1
is given diagnosis status that is already encoded in numerical form, while X1
is assigned the same features as the first model.
lb = LabelEncoder()
y1 = lb.fit_transform(df1.diagnosis)
X1 = df1[my_features]
After that, we can now go on and split our dataset further into the training and testing sets:
x_train1, x_test1, Y_train1, Y_test1 = train_test_split(X1, y1, test_size=0.3, random_state=1)
(iv) Fiting the model
We will fit our model using the same algorithm we used in the first model:
classifier = LinearSVC()
# fit model
classifier.fit(x_train1, Y_train1)
# predicting
Y_predict1 = classifier.predict(x_test1)
# accuracy
accuracy_after_stdScaler = classifier.score(x_test1, Y_test1)
(v) Predicting
To check the predictions, we will use:
print(Y_predict1)
Let us compare the results with the test set:
print(Y_test1)
3.Using Normalization
We will now build the final model that also uses the same training algorithm and dataset, but the dataset will first go under normalization before being used.
(i) Creating a Dataframe
As in the above models, we first need to fetch and our dataset, then store it in a dataframe.
df2 = pd.read_csv(url,index_col=0)
## df2 summary
df2.describe()
Take note of the values:
(ii) Spliting data between dependent (Y) and independent (X) variables
Here we will select the target, features, and label encode our non-numerical target (y2).
lb = LabelEncoder()
y2 = lb.fit_transform(df2.diagnosis)
X2 = df2[my_features]
(iii) Implementing Normalization
The next and also critical part in this model's preparation is normalizing
the dataset, in this case, the features:
norm = MinMaxScaler()
X2 = norm.fit_transform(X2)
Now, let us take a look at our dataset after normalization:
X2_df = pd.DataFrame(X2)
X2_df.describe()
If you compare this description with the one above, you will notice the difference. If you check the min
and max
values for all the columns, they are 0
and 1
, respectively. This means our features are now rescaled between 1 and zero.
(iv) Splitting dataset into training and testing sets
Now we can split our dataset into training and testing sets:
X_train2, X_test2, y_train2, y_test2 = train_test_split(X2, y2, test_size=0.3, random_state=1)
(v) Fitting the model
We will use the same process of fitting as the one used for the previous two models:
classifier = LinearSVC()
# fit model
classifier.fit(X_train2, y_train2)
#predicting
y_predict2 = classifier.predict(X_test2)
#accuracy
accuracy_after_normalization = classifier.score(X_test2, y_test2)
(vi) Predicting
To compare the test values and predicted values:
print(y_predict2)
Also :
print(y_test)
Effects of feature scaling
We used variables to store the accuracy of each model. To check them out, we need to print them:
print("Accuracy 1 :", accuracy_b4, " Accuracy 2 :", accuracy_after_stdScaler, " Accuracy 3:", accuracy_after_normalization)
As you can see above, the accuracy score of a feature scaled
model is higher than that of the initial one. This goes on to demonstrate the effects of feature scaling on the models.
You will also note that the accuracy of the feature-scaled models is consistent while that of the initial(unscaled) model keeps fluctuating despite setting a random state value.
Choosing between the two scaling techniques
When it comes to choosing between normalization
and standardization
, it depends on:
- The dataset property- normalization is preferred if your dataset doesn't follow Gaussian Distribution.
- Performance - you should try to use both and compare which works better for your model. As for us,
Standard Scaling
gives better results.
Exceptions
- Feature scaling is not required when using tree-based algorithms, i.e., Random Forest and Decision Tree.
- When using standardization with a dataset that contains categorical data that is one-hot encoded, you should exclude the encoded columns. Not doing so might lead to your dataset losing its categorical property.
Please find the complete code for this tutorial here.
Conclusion
We have learned the importance of feature scaling
, looked at both the Standard Scaling
technique and Normalization
technique, and learned how to implement each and compared the results of using each. As we have learned from comparing the results, feature scaling
can significantly boost a model's performance. It also helps stabilize the accuracy of a model. You can now choose and implement these feature scaling
techniques in your machine learning projects.
Peer Review Contributions by: Willies Ogola