arrow left
Back to Developer Education

    Introduction to Scikit Learn in Python

    Introduction to Scikit Learn in Python

    The concept of machine learning has been booming over the past few years, and more often than not, graduate students and industry professionals have made a career switch to data science or machine learning. An essential ingredient for establishing familiarity in this field is to know your libraries and dependencies. <!--more-->

    Introduction to Scikit-Learn in Python

    A significant chunk of your work goes towards having the right approach towards the problem and manipulating the dataset regarding your approach. This multi-part article introduces the reader to SciKit-Learn, a vital library used to build statistical models to make predictions.

    Prerequisites

    The reader is expected to understand basic libraries like NumPy and Pandas, machine learning, and machine learning algorithms, including linear and logistic regression, support vector machines and decision trees, and boosting algorithms.

    For a better understanding, the reader is advised to go through the following articles on Python, NumPy, Matplotlib and SciPy.

    Table of contents

    1. Introduction
    2. Installing the Scikit-Learn Library
    3. Dataset Transformations using sklearn
    4. Scikit-Learn for Standardization
    5. Scikit-Learn for Normalization
    6. Scikit-Learn when Encoding Categorical Features
    7. Scikit-Learn when Filling Missing Values
    8. Conclusion
    9. Further Readings

    Introduction

    SciKit-Learn (often referred to as sklearn) provides a wide array of statistical models and machine learning. sklearn, unlike most modules, is written in Python and not in C. Although it is written in Python, sklearn's performance is attributed to its usage of NumPy for high-performance linear algebra and array operations.

    SciKit-Learn was written as a part of Google's Summer of Code project and has since made lives easier for thousands of Python centered data scientists worldwide. This part of the series focuses on introducing the library and focusing on one aspect - dataset transformations, an important and crucial step to go through before building a prediction model.

    Installing the Scikit-Learn library

    Scikit-Learn requires the following libraries to be pre-installed: NumPy, SciPy, Matplotlib, IPython, Sympy, and Pandas. Let's go ahead and install them from the terminal using pip (works only for Windows).

    pip install numpy
    pip install scipy
    pip install matplotlib
    pip install ipython
    pip install sympy
    pip install pandas
    

    Now that we've installed the dependent libraries let us install Scikit-Learn.

    \>> pip install scikit-learn
    

    Let's check if Scikit-Learn can be accessed using Python.

    import sklearn
    

    Yes, it works!

    Dataset transformations using sklearn

    An essential component for building a machine learning algorithm is data. A lot of work goes into preparing the data so that it can be fed to the model. This is called data preprocessing. Data preprocessing tasks can range from a mere change in notation to changing a continuous variable to a categorical variable.

    The sklearn.preprocessing package provides various functions and classes to change the representation of certain variables to be better suited for the estimators down the model pipeline. So, let's go ahead and look at the methods that Scikit-Learn offers, that help in data preprocessing and transformation. Before that, let's import the sklearn.preprocessing package.

    from sklearn import preprocessing
    

    Scikit-Learn for standardization

    Distance based models are machine learning algorithms that use distances to check if they are similar or not. If two points are close together, one can infer that the feature values are simiar and hence, can be classified as similar. Standardization is an essential task for distance based models so that one particular feature does not dominate over the other.

    A data point x is standardized as follows:

    Standardization

    Where µ is the mean of the distribution and σ is the standard deviation of the distribution. Standardization is centering around zero and scaling the data point such that the mean is 0, and the standard deviation is 1.

    This means all the data points now lie between -1 and 1. The reader is encouraged to go through this resource to get a better grip on why to standardize your features.

    The following are the temperatures recorded in Bloomington (in Fahrenheits) in Illinois in the month of January:

    [33.2,33.1,33.1,33.0,32.9,32.9,32.8,32.8,32.7,32.7,32.6,32.6,32.6,32.6,32.5,32.5,32.5,32.6,32.6,32.6,32.7,32.7,32.8,32.9,33.0,33.1,33.2,33.4,33.5, 33.7, 33.9]
    

    Let us try to standardize this vector.

    # Import libraries
    from sklearn.preprocessing import StandardScaler
    import numpy as np
    
    # List of temperatures recorded in Bloomington
    temperatures_list = [33.2,33.1,33.1,33.0,32.9,32.9,32.8,32.8,32.7,32.7,32.6,32.6,32.6,32.6,
                        32.5,32.5,32.5,32.6,32.6,32.6,32.7,32.7,32.8,32.9,33.0,33.1,33.2,33.4,33.5, 33.7, 33.9]
    
    # Convert the list to a NumPy array
    temperatures_np = np.array(temperatures_list).reshape(-1,1)
    
    # Standardize the vector
    temperatures_std = StandardScaler().fit_transform(temperatures_np)
    
    # Print the means
    print("Mean Before Standardizing:",sum(temperatures_list)/len(temperatures_list))
    print("Mean After Standardizing:",sum(temperatures_std.reshape(1,-1)[0])/len(temperatures_std))
    
    # Output:
    # Mean Before Standardizing: 32.896774193548396
    # Mean After Standardizing: -2.6215588839535955e-15
    

    Notice that after standardizing the data, the mean is almost 0.

    In the example above, fit_transform() is used. There are two important functions - fit() and fit_transform(). fit() is used to compute the mean and standard deviation, that is later used for scaling along the feature axis and fit_transform() computes the mean and standard deviation, scales the vector, and returns a NumPy array of the computed values. Therefore, standardization can either be done using fit() and transform() or in one single optimized step, fit_transform().

    Scikit-Learn for normalization

    Normalization is another feature scaling technique used to transform the values of the numeric attributes to a standard scale (0 to 1). Normalization is used in cases where the values do not follow Gaussian distribution. (Rule of thumb - Standardize if the attribute can be modeled to be a Gaussian distribution. If not, normalize).

    Normalization is important because it does not provide a window for the model to prefer one attribute because of the scale of values. This resource by DataScienceDojo explains normalization with an easy-to-understand example.

    A data point x is normalized as follows:

    Normalization

    Source: Miro Medium

    Let's try to normalize the data using the same set of values used in the previous example.

    #Import libraries
    from sklearn.preprocessing import MinMaxScaler
    import numpy as np
    
    #List of temperatures recorded in Bloomington
    temperatures_list = [33.2,33.1,33.1,33.0,32.9,32.9,32.8,32.8,32.7,32.7,32.6,32.6,32.6,32.6,
                        32.5,32.5,32.5,32.6,32.6,32.6,32.7,32.7,32.8,32.9,33.0,33.1,33.2,33.4,33.5, 33.7, 33.9]
    
    #Convert the list to a NumPy array
    temperatures_np = np.array(temperatures_list).reshape(-1,1)
    
    #Normalize the vector
    temperatures_norm = MinMaxScaler().fit_transform(temperatures_np)
    
    print("Minimum Value Before Normalization:",min(temperatures_np.reshape(1,-1)[0]))
    print("Maximum Value Before Normalization:",max(temperatures_np.reshape(1,-1)[0]))
    print("Minimum Value After Normalization:",min(temperatures_norm))
    print("Maximum Value After Normalization:",max(temperatures_norm))
    
    # Output:
    # Minimum Value Before Normalization: 32.5
    # Maximum Value Before Normalization: 33.9
    # Minimum Value After Normalization: [0.]
    # Maximum Value After Normalization: [1.]
    

    Scikit-Learn when encoding categorical features

    Almost every dataset has a feature (or more than one feature), that is categorical in nature. Consider a dataset containing the details of all the passengers of a certain airline. The possible categorical variables in the dataset could be the passenger's gender (male/female) and their seating choice (economy, business, first-class). Estimators take in only numerical data, and hence, these categorical features have to be encoded.

    There are 2 types of encoding - Label Encoding and One Hot Encoding

    Summarizing the above resources with an example, assume a dataset of car information with the feature "Manufacturer," and there are three car manufacturers - Ford, Hyundai, and Tata.

    Label Encoding would mean replacing all "Ford" with 0, all "Hyundai" with 1, and all "Tata" with 2, and one hot encoding would have three more features, 1 representing if the manufacturer was indeed that company, 0 indicating otherwise.

    from sklearn.preprocessing import LabelEncoder
    
    bands = ["Pink Floyd","Led Zeppelin","Pink Floyd","Foo Fighters","Queen","Queen","Pink Floyd","AC/DC","Foo Fighters","Led Zeppelin","Queen",
               "Nirvana","AC/DC","The Doors","Queen","Fleetwood Mac","Nirvana"]
    
    # Invoking an instance of Label Encoder
    label_encoding = LabelEncoder()
    
    # Fit the labels
    encoded = label_encoding.fit(bands)
    
    print(encoded.transform(bands))
    
    # Output - [5 3 5 2 6 6 5 0 2 3 6 4 0 7 6 1 4]
    

    If one were to look at the output, they would understand that the feature has been encoded. But mere numbers do not make any sense. Luckily, classes_ help us interpret what these labels are.

    #Iterate through the classes_ list and print them
    band_list = encoded.classes_
    
    for band_number in range(1,len(band_list)+1):
        print(band_number, band_list[band_number-1])
    
    # Output
    # 1 AC/DC
    # 2 Fleetwood Mac
    # 3 Foo Fighters
    # 4 Led Zeppelin
    # 5 Nirvana
    # 6 Pink Floyd
    # 7 Queen
    # 8 The Doors
    

    Note that the labels have been encoded in ascending order.

    If the band_list feature is one-hot encoded, it would be represented in 1's and 0's instead of decimals.

    import numpy as np
    from sklearn.preprocessing import OneHotEncoder
    
    band_list = np.array(["AC/DC","Fleetwood Mac","Foo Fighters","Led Zeppelin","Nirvana","Pink Floyd","Queen","The Doors"]).reshape(-1,1)
    
    # Invoking an instance of Label Encoder
    label_encoding = OneHotEncoder()
    
    # Fit the labels
    encoded = label_encoding.fit(band_list)
    
    print(encoded.transform(band_list).toarray())
    
    # Output
    # [[1. 0. 0. 0. 0. 0. 0. 0.]
    #  [0. 1. 0. 0. 0. 0. 0. 0.]
    #  [0. 0. 1. 0. 0. 0. 0. 0.]
    #  [0. 0. 0. 1. 0. 0. 0. 0.]
    #  [0. 0. 0. 0. 1. 0. 0. 0.]
    #  [0. 0. 0. 0. 0. 1. 0. 0.]
    #  [0. 0. 0. 0. 0. 0. 1. 0.]
    #  [0. 0. 0. 0. 0. 0. 0. 1.]]
    

    Scikit-Learn when filling missing values

    Almost 70% of time and resources are spent on collecting and cleaning the dataset for every project. When one deals with a real-life dataset, there are always missing values. Cleaning the dataset and handling missing data is important as many machine learning algorithms do not accommodate a missing attribute in the data.

    This is where Scikit-Learn's impute module comes into play. A simple way to deal with missing values is to remove the row of data with a missing value, that would mean losing valuable-yet-incomplete data. A better way is to replace the missing values with values that can be inferred from known data. One way would be to replace the missing data with the mean of that column.

    Missing values are encoded with NumPy's NaN (numpy.nan)

    The following are the temperatures recorded in Bloomington (in Fahrenheits) in Illinois in the month of February:

    [33.2,32.8,32.9,33.0,nan,33.2,33.4,33.1,32.6,32.5,32.5,33.1,33.0,nan,32.7,32.7,32.6,nan,32.6,32.9,32.8,32.8,32.5,32.6,nan,32.6,32.7,32.7,33.5, 33.7,33.9].
    

    Let's try to replace the missing temperatures with their mean.

    import numpy as np
    from sklearn.impute import SimpleImputer
    
    #List of temperatures
    temperatures = [33.2,32.8,32.9,33.0,"NaN",33.2,33.4,33.1,32.6,32.5,32.5,33.1,33.0,"NaN",32.7,32.7,32.6,"NaN",32.6,32.9,32.8,
                    32.8,32.5,32.6,"NaN",32.6,32.7,32.7,33.5, 33.7,33.9]
    
    temperatures_cleaned = []
    
    #Replace NaN's with np.nan
    for temperature in temperatures:
        if temperature=="NaN":
            temperatures_cleaned.append(np.nan)
        else:
            temperatures_cleaned.append(temperature)
    
    temperatures_np = np.array(temperatures_cleaned).reshape(-1,1)
    
    # Create an instance of the imputer
    imputer_mean = SimpleImputer(missing_values=np.nan,strategy='mean')
    
    #Transform the array and fit according to the chosen strategy
    temperatures_np = imputer_mean.fit_transform(temperatures_np)
    
    print(*temperatures_np, sep=", ")
    
    # Output - [33.2], [32.8], [32.9], [33.], [32.91111111], [33.2], [33.4], [33.1], [32.6], [32.5], [32.5], [33.1], [33.], [32.91111111], 
    #          [32.7], [32.7], [32.6], [32.91111111], [32.6], [32.9], [32.8], [32.8], [32.5], [32.6], [32.91111111], [32.6], 
    #          [32.7], [32.7], [33.5], [33.7], [33.9]
    

    SimpleImputer provides four options for strategy - mean, median, most_frequent, and constant. Since mean was the chosen strategy, the nan's were replaced with the mean of the temperatures (32.91111111).

    Had most_frequent been the chosen category:

    # Create an instance of the imputer
    imputer_most_frequent = SimpleImputer(missing_values=np.nan,strategy='most_frequent')
    
    #Transform the array and fit according to the chosen strategy
    temperatures_np = imputer_most_frequent.fit_transform(temperatures_np)
    
    print(*temperatures_np,sep=", ")
    
    # Output - [33.2], [32.8], [32.9], [33.], [32.6], [33.2], [33.4], [33.1], [32.6], [32.5], [32.5], [33.1], [33.], [32.6], [32.7], 
    #          [32.7], [32.6], [32.6], [32.6], [32.9], [32.8], [32.8], [32.5], [32.6], [32.6], [32.6], [32.7], [32.7], [33.5], [33.7], [33.9]
    

    ... the nan's would be replaced with the value with the most occurrences (the mode of the feature) - 32.6.

    Opting for constant would require a value for the parameter fill_value.

    # Create an instance of the imputer
    imputer_constant = SimpleImputer(missing_values=np.nan,strategy='constant',fill_value=32.9)
    
    #Transform the array and fit according to the chosen strategy
    temperatures_np = imputer_constant.fit_transform(temperatures_np)
    
    print(*temperatures_np,sep=", ")
    
    # Output - [33.2], [32.8], [32.9], [33.], [32.9], [33.2], [33.4], [33.1], [32.6], [32.5], [32.5], [33.1], [33.], [32.9], [32.7], 
    #          [32.7], [32.6], [32.9], [32.6], [32.9], [32.8], [32.8], [32.5], [32.6], [32.9], [32.6], [32.7], [32.7], [33.5], [33.7], [33.9]
    

    Conclusion

    This article was a brief dive into the multi-faceted world of scikit-learn. Scikit-Learn is a very important package to have a good understanding of and some experience in within every data scientist's journey. This article aimed to make the reader comfortable with data manipulation using sklearn and would prove to be a great starting point for Scikit-Learn.

    Happy Coding!

    Further readings

    1. Official Docs

    2. Medium

    3. Tutorialspoint

    4. Machine Learning Mastery

    5. Data Camp


    Peer Review Contributions by: Lalithnarayan C

    Published on: Jan 25, 2021
    Updated on: Jul 12, 2024
    CTA

    Start your journey with Cloudzilla

    With Cloudzilla, apps freely roam across a global cloud with unbeatable simplicity and cost efficiency
    Get Started for Free