Linear Discriminant Analysis from Scratch

In this tutorial, we will look into the algorithm Linear Discriminant Analysis, also known as LDA. One should be careful while searching for LDA on the net. We also abbreviate another algorithm called Latent Dirichlet Allocation as LDA.

Linear Discriminant Analysis(LDA) is a supervised learning algorithm used as a classifier and a dimensionality reduction algorithm. We will look at LDA's theoretical concepts and look at its implementation from scratch using NumPy.

Let's get started.

Prerequisites

Installation

We will install the packages required for this tutorial in a virtual environment. We’ll use conda to create a virtual environment. For more installation information, refer to the Anaconda Package Manager website.

Create a new virtual environment by typing the command in the terminal. Perform this after installing anaconda package manager using the instructions mentioned on Anaconda’s website.

conda create -n lda python=3.6

This will create a virtual environment with Python 3.6.

We'll be installing the following packages:

Activate the virtual environment using the command, conda activate lda. After activating the virtual environment, we’ll be installing the above mentioned packages locally in the virtual environment. To use these packages, we must always activate the virtual environment named lda before proceeding. If you choose to, you may replace lda with a name of your choice for the virtual environment.

To install the packages, we will use the following commands:

matplotlib: pip3 install matplotlib
numpy: pip3 install numpy
sklearn: pip3 install sklearn

Once installed, the following code can be executed seamlessly.

Introduction

In some cases, the dataset's non-linearity forbids a linear classifier from coming up with an accurate decision boundary. Therefore, one of the approaches taken is to project the lower-dimensional data into a higher-dimension to find a linear decision boundary. Consider the following example taken from Christopher Olah's blog.

spiral data

The other approach is to consider features that add maximum value to the process of modeling and prediction. If any feature is redundant, then it is dropped, and hence the dimensionality reduces. LDA is one such example.

It's a supervised learning algorithm that finds a new feature space that maximizes the class's distance. The higher the distance between the classes, the higher the confidence of the algorithm's prediction.

The purpose for dimensionality reduction is to:

Obtain the most critical features from the dataset.
Visualize the dataset
Have efficient computation with a lesser but essential set of features: Combats the "curse of dimensionality".

Let's say we are given a dataset with n-rows and m-columns. Where n represents the number of data-points, and m represents the number of features. m is the data point's dimensionality.

Assuming the target variable has K output classes, the LDA algorithm reduces the number of features to K-1. Hence, the number of features change from m to K-1.

The aim of LDA is:

Minimize the Inter-Class Variability: Inter-class variability refers to including as many similar points as possible in one class. This ensures less number of misclassifications.
Maximize the Distance Between the Mean of Classes: The classes' mean is placed as far as possible to ensure high confidence during prediction.

Below is a picture to help explain:

LDA

Image Source

The data-points are projected onto a lower-dimensional hyper-plane, where the above two objectives are met. In the example given above, the number of features required is 2. The scoring metric used to satisfy the goal is called Fischer's discriminant.

The Fischer score is given as:

Fischer Score f(x) = (difference of means)^2/ (sum of variances).

We're maximizing the Fischer score, thereby maximizing the distance between means and minimizing the inter-class variability.

Code

Let's consider the code needed to implement LDA from scratch.

We'll begin by defining a class LDA with two methods:

__init__: In the __init__ method, we initialize the number of components desired in the final output and an attribute to store the eigenvectors.
transform: We'll consider Fischer's score to reduce the dimensions of the input data. The Fischer score is computed using covariance matrices. The formula mentioned above is limited to two dimensions. We'll be coding a multi-dimensional solution. Therefore, we'll use the covariance matrices. The matrices scatter_t, scatter_b, and scatter_w are the covariance matrices. scatter_w matrix denotes the intra-class covariance and scatter_b is the inter-class covariance matrix. scatter_t covariance matrix represents a temporary matrix that's used to compute the scatter_b matrix.

Using the scatter matrices computed above, we can efficiently compute the eigenvectors. The eigenvectors obtained are then sorted in descending order. The first n_components are selected using the slicing operation. If n_components is equal to 2, we plot the two components, considering each vector as one axis.

Finally, we load the iris dataset and perform dimensionality reduction on the input data. Then, we use the plot method to visualize the results.

import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_iris

class LDA:
    def __init__(self, n_components=None):
        self.n_components = n_components
        self.eig_vectors = None
    
    def transform(self,X,y):
        height, width = X.shape
        unique_classes = np.unique(y)
        num_classes = len(unique_classes)

        scatter_t = np.cov(X.T)*(height - 1)
        scatter_w = 0
        for i in range(num_classes):
            class_items = np.flatnonzero(y == unique_classes[i])
            scatter_w = scatter_w + np.cov(X[class_items].T) * (len(class_items)-1)
        
        scatter_b = scatter_t - scatter_w
        _, eig_vectors = np.linalg.eigh(np.linalg.pinv(scatter_w).dot(scatter_b))
        print(eig_vectors.shape)
        pc = X.dot(eig_vectors[:,::-1][:,:self.n_components])
        print(pc.shape)

        if self.n_components == 2:
            if y is None:
                plt.scatter(pc[:,0],pc[:,1])
            else:
                colors = ['r','g','b']
                labels = np.unique(y)
                for color, label in zip(colors, labels):
                    class_data = pc[np.flatnonzero(y==label)]
                    plt.scatter(class_data[:,0],class_data[:,1],c=color)
            plt.show()
        return pc

LDA_obj = LDA(n_components=2)
data = load_iris()
X, y = data.data, data.target
X_train, X_test, Y_train, Y_test = train_test_split(X, y, test_size=0.2)

LDA_object = LDA(n_components=2)
X_train_modified = LDA_object.transform(X_train, Y_train)

print("Original Data Size:",X_train.shape, "\nModified Data Size:", X_train_modified.shape)

The output of the code should look like the image given below. The iris dataset has 3 classes. Observe the 3 classes and their relative positioning in a lower dimension.

output of code

Conclusion

In this article, we have looked at implementing the Linear Discriminant Analysis (LDA) from scratch. I suggest you implement the same on your own and check if you get the same output. Another fun exercise would be to implement the same algorithm on a different dataset. I hope you enjoyed reading this tutorial as much as I enjoyed writing it. Happy learning.

Peer Review Contributions by: Adrian Murage

Author

Lalithnarayan C

Lalithnaryan C is an ambitious and creative engineer pursuing his Masters in Artificial Intelligence at Defense Institute of Advanced Technology, DRDO, Pune. He is passionate about building tech products that inspire and make space for human creativity to flourish. He is on a quest to understand the infinite intelligence through technology, philosophy, and meditation.

Cloudzilla is FREE for React and Node.js projects

Deploy GitHub projects across every major cloud in under 3 minutes. No credit card required.

Get Started for Free