Genetic Programming Models Using TPOT

Genetic programming is a technique by which models, and programs evolve. With time, the model finds the optimal solution. The model starts with poor or unfit parameters. It then gradually evolves into a superior model.

It does this in an automated way to reduce time in model building and evolving processes. It automates stages in machine learning from dataset pre-processing to building the model.

During the automation process, it uses machine learning pipelines. Machine learning pipelines are used to find the best parameters. In doing so, it produces an optimal machine learning model.

TPOT automates the machine learning process logically and intelligently. It does this by exploring all machine learning parameters and pipelines. It only uses the ones that will give the best results.

In this tutorial, we will use two approaches to building the machine learning model. First, we will build a model without applying genetic programming. Second, we will follow all the traditional stages from dataset pre-processing to model building.

In the second approach, we will use TPOT to build our model. It will apply all the concepts of genetic programming. The best algorithm will survive.

Prerequisites
Getting started with TPOT
Model building using individual algorithms
Loading machine learning packages
Importing exploratory data analysis packages
Checking for missing values
Convert the species column
Adding labels and features
Building model using logistic regression
Building using random forest classifier algorithm
Model building using TPOT
Exploring TPOT methods and attributes
Splitting dataset
Initializing our TPOT application
Fitting
Making Predictions
Conclusion
References

Prerequisites

The reader must:

Know Python programming.
Know how to work with machine learning models.
Be familiar with machine learning lifecycle processes.
Know how to use Google Colab. We will use Google Colab instead of Jupyter Notebook in this tutorial.

NOTE: Google Colab has a Graphical Processing Unit (GPU) and Tensor Processing Unit (TPU), which are fast. TPOT requires a powerful tool for processing.

Getting started with TPOT

TPOT is a powerful Python library used to automate the machine learning process. TPOT uses genetic programming.

TPOT uses three concepts during the genetic programming process.

Selection: TPOT selects the algorithm that will give the best results.
Crossover: After selecting the algorithms, these algorithms are cross-bred to find a hybrid solution.
Mutation: Over time these algorithms change and become more advanced. This yields the optimal solution.

TPOT is built on top of powerful Python libraries such as NumPy, scikit-learn, pandas and joblib. This makes it powerful for genetic programming and automation.

[Scikit-learn] contains the following algorithms for classification and regression.

TPOT iterates through these algorithms and finds the best one. The one with the highest accuracy score is the one that is chosen. It can also combine two or more algorithms to come up with a hybrid algorithm.

To see the power of TPOT, we start by using individual algorithms to build the model.

We will use two algorithms linear regression and random forest when building the model.

We will compare the accuracy score of each algorithm. This enables us to see which one is better.

In the next section, we will use TPOT and build a machine learning model. First, TPOT combines all the available classification algorithms and finds the optimal one. Then, the one with the highest accuracy score is chosen.

Model building using individual algorithms

We will use the iris dataset to train our model. The model classifies the flower species based on the input features. To get this dataset, click here.

A snip of the dataset is shown in the image below.

Output

Initialize the dataset URL

We have to initialize the dataset URL. This enables us to download the dataset into our working directory.

dataset_url = "https://drive.google.com/file/d/1gmfoaF14KhMybNThcaU72edgbFm7DO9l/view?usp=sharing"

Loading machine learning packages

import train_test_split from sklearn.model_selection
import LogisticRegression from sklearn.linear_model
import RandomForestClassifier from sklearn.ensemble

In the above code snippet, we have imported the following.

train_test_split: Used to split the dataset into train_set and test_set. It is used during the training and testing phases.
LogisticRegression: We will use this algorithm when building our model.
RandomForestClassifier: This is the second algorithm used to build our model.

NOTE: We will use the two algorithms LogisticRegression and RandomForestClassifier. We then compare the accuracy score. Later we will use TPOT to combine all the algorithms when building our model and find the best one.

Importing exploratory data analysis packages

These packages are helpful when it comes to data analysis and manipulation.

import pandas as pd
import numpy as np

We will use pandas to import and read our dataset.

df = pd.read_csv(dataset_url)

Dataset column and rows

Use the following command to check rows and columns.

df.head()

Output

Checking for missing values

We have to check for missing values. Our dataset should not contain any missing values.

df.isnull().sum()

The output is shown below:

Output

Convert the species column

We need to convert the species column to numeric. The numeric values are the dictionary of labels used for prediction.

Labels are the target or the output variable. This is what the model wants to predict.

The labels will be as follows.

o for setosa
1 for versicolor
2 for virginica

data = data_set = {dataset:index for index,dataset in convert(df['species'].unique())}

Output after conversion:

Output

We then add the dictionary labels into our dataset. We will add the three-label dictionaries in a new column called new_label.

df['new_label'] = df['species'].map(data_set)

To see new columns, run this code snippet:

df.head()

The output will be a dataset with an additional column as shown.

Output

We now need to set our model features and labels as xfeatures and ylabels.

Adding labels and features

xfeatures are the independent variables in our dataset that act as the inputs for our model.

Our features are sepal_length, sepal_width, petal_length and petal_width.

ylabels will be used as an output when making a prediction.

Our labels are 0, 1 and 2.

xfeatures = df[['sepal_length', 'sepal_width', 'petal_length', 'petal_width']]
ylabels = df['new_label']

After setting features and labels, we can now start building our model.

Building model using logistic regression

Import a cross-validation score. This is used to test the effectiveness of our model and its ability to make new predictions.

from sklearn.model_selection import cross_val_score

We can now use cross_val_score in the LogisticRegression algorithm. We will set the number of folds to 10. The model will iterate ten times and find the average accuracy after the ten iterations. Thus, the model learns from the dataset and improves over time. We also built our model using the xfeatures and ylabels we created earlier.

cv_scores = cross_val_score(LogisticRegression(),xfeatures,ylabels,cv=10)

Will iterate ten times and output the accuracy score after each iteration. To see the accuracy score after each iteration, use the following command.

cv_scores

The accuracy scores outputs:

Logistic regression accuracy scores

To get the average score from the above ten iterations, we use the following command.

print(np.mean(cv_scores))

The average score:

0.9707983

Building using random forest classifier algorithm

We will use Random Forest as the second algorithm to model the data. We can then compare the two algorithms' accuracy scores.

rf_cv_scores = cross_val_score(RandomForestClassifier(),xfeatures,ylabels,cv=10)

We will do the same ten folds in training our model. The score of the algorithm.

rf_cv_scores

Output:

Random forest accuracy scores

Average of the algorithm.

print(np.mean(rf_cv_scores))

Output:

0.9666666666666666

LogisticRegression and RandomForestClassifier gives accuracy scores of 0.9707983 and 0.9666666666666666. It shows that LogisticRegression is better.

A user would choose LogisticRegression when building the model. Still, this might not be the best one because we have only compared two algorithms.

The process of building models using different algorithms is tiring. That's why TPOT is the best solution when dealing with many algorithms.

TPOT combines all the available algorithms for classification and finds the best one. Thus, it automates the process of model building using genetic programming and saves a lot of time from manually comparing all the available algorithms.

In the next section, we will see how to build a machine learning model using TPOT.

Model building using TPOT

To use TPOT, we install it into our machine. For example, we install TPOT using the following command.

!pip install tpot

We then import TPOT as shown.

import tpot

Exploring TPOT methods and attributes

We explore the package to know the available methods we can use in model building.

dir(tpot)

It will list all the available methods and attributes found in TPOT.

Methods and attributes

In the output, we see the available method. The TPOTClassifier method is what we are interested in. We use TPOTClassifier when building our classifier model. Let's split our dataset before we begin.

Splitting dataset

We split our dataset into train_set and test_set.

70% of our data is thetrain_set.
30% is the test_set.

x_train,x_test,y_train,y_test = train_test_split(xfeatures,ylabels,test_size=0.3,random_state=42)

In the code above:

x_train and y_train are used in the training phase.
x_testand y_test in the testing phase.

Initializing our TPOT application

We initialize our TPOT application using the TPOTClassifier() method. But, first, we pass the following parameters.

Set generations=5. Generation represents the number of iterations TPOT will run. This helps TPOT to find an optimal pipeline. Here we will set the generation to 5.
Set random_state=42. It is used to reproduce out of our split dataset.
Set verbosity=2. It is used to give progress and information about the TPOT operation.

tpot = TPOTClassifier(generations=5,verbosity=2, random_state=42)

After initializing TPOT, we fit our model into our dataset.

Fitting

We fit x_train and y_train into the tpot method. This allows tpot to learn from the train_set dataset. It understands patterns and gains knowledge. It eventually used this knowledge to make predictions.

tpot.fit(x_train,y_train)

We will have an optimization process. TPOT will iterate 5 times to find the optimal pipeline.

NOTE: Make sure you use Google Colab for this process to run faster. By the end of the five iterations, we will get an output with the best algorithm to use for model building. TPOT will also give us the highest accuracy scored by our model.

Useful since it saves the users time by automating the whole process. During this optimization process, TPOT uses the concept of genetic programming. As a result, it eventually finds the best algorithm.

We choose the best algorithm, and subsequently, the poor algorithms are dropped. After the optimization process, the output is as shown.

Optimization process

The best algorithm chosen is KNeighborsClassifier. It was automatically selected from all the available classification algorithms. This saves time in comparing the algorithms.

TPOT also helps us know the exact parameters used to achieve such optimization. For example, the parameter used is such as n_neighbors=20.

The accuracy scores are as follows after 5 iterations: 0.9714285714285713, 0.9714285714285715, 0.9714285714285715, 0.9714285714285715 and 0.9714285714285715.

TPOT takes the highest score as the model accuracy. Thus, the final model accuracy will be 0.9714285714285715.

Making predictions

We can use our optimized model to make predictions. The input sample contains flower lengths and widths.

To predict, the input must be in an array as shown. The reshape method allows us to have a single column showing the prediction output.

example = np.array([6.3,3.5,5.6,2.4]).reshape(1,-1)

We can now make predictions using the predict() method, as shown.

tpot.predict(example)

The prediction outcome is as shown.

array([2])

The outcome is 2, which represents virginica. This gives an accurate prediction. This shows we can accurately make predictions using our TPOT application.

Conclusion

In this tutorial, we learned about genetic programming. Genetic programming allows the model to evolve. This enables us to get the best algorithm for the model. We also learned about TPOT. It is an important library used for the automation during genetic programming.

We started by preparing our dataset. We then used this dataset to build a model using two algorithms. We then compared the two algorithms to find the best one.

Finally, we used POT to find the best algorithm to use when building a machine learning model. Through genetic programming, the best algorithm was KNeighborsClassifier.

This tutorial shows us how TPOT is a useful tool. It also shows us how genetic programming can be applied to find the optimal algorithm.

To get the Google colab link, click here

Happy coding!

References

Peer Review Contributions by: Lalithnarayan C

Author

Bravin Wasike

Bravin wasike holds an undergraduate degree in Software Engineering. He is currently a freelance data scientist and machine learning engineer. He is passionate about machine learning and deploying models to production using Docker and Kubernetes. Bravin also loves writing articles on machine learning and various advances in the technological industry. He spends most of his time doing research and learning new skills in order to solve different problems.