Classification of GitHub Issues using Machine Learning
Classification of GitHub issues involves analyzing GitHub issues and assigning labels using models. In GitHub, we have in-built labels such as bug
, help wanted
, revision needed
, enhancement
, and question
. We also have custom labels that one can create.
<!--more-->
These labels help reviewers to know the status and progress of a given open GitHub issue. This ensures that we have an enhanced collaboration.
In the classification of GitHub issues, we will train a machine learning model using a dataset that has different GitHub issues. The model will learn from the dataset and will be able to make predictions. Using this prediction, the model will be able to classify a given issue and assign the label.
For example, if an issue requires a revision or enhancement, it will assign either the enhancement
or revision needed
label.
In this tutorial, we will implement the model using the Scikit-Learn library. We will then test the model using the Streamlit GitHub repository to see if the model can make accurate predictions.
Table of contents
- Prerequisites
- GitHub issues dataset
- Dataset preparation
- Adding column names
- Concat data frames
- Convert dataset to CSV
- Text cleaning
- Adding features and labels
- Dataset splitting
- Importing machine learning packages
- Importing Pipeline package
- Pipeline stages
- Pipeline fitting
- Calculating the accuracy score
- Making predictions
- Conclusion
- References
Prerequisites
To follow along easily, a reader should:
- Know Python programming.
- Have some knowledge about machine learning models.
- Know some concepts of natural language processing.
- know how to use some of the Scikit-learn algorithms in building machine learning models.
- Must use Google Colab notebook for easy code reproducibility.
GitHub issues dataset
We will use the GitHub issues dataset to train our model. The dataset contains GitHub issues from popular GitHub repositories with their respective labels. We need to download the dataset so that we can use it.
You can view the dataset here.
To download the dataset, run the following command:
!wget https://tickettagger.blob.core.windows.net/datasets/dataset-labels-top3-30k-real.txt
NOTE: We will be using Google Colab to run the code snippets.
A snip of the downloaded dataset is shown below.
From the image above, our dataset is in text format.
The dataset is unorganized and is not correctly formatted. Our model will not be able to easily understand the dataset. We, therefore, need to prepare the dataset and format it correctly before using it.
Before we can prepare our dataset, we will load the dataset into our working notebook.
Loading dataset
To load the dataset we will use the pandas
library. Let's import pandas
as shown:
import pandas as pd
Pandas help to load the dataset and manipulate data.
To load the downloaded dataset, use the following code:
df = pd.read_csv("dataset-labels-top3-30k-real.txt",header=None)
To see the structure of our dataset, use the following code:
df.head()
The output of the dataset is shown below:
Let's now prepare this dataset.
Dataset preparation
We will start by extracting the labels from the dataset. This dataset has three labels: enhancement
, question
, and bug
.
To extract the labels, use this code:
df_new = df[0].str.split(r'(__label__enhancement)|(__label__bug)|(__label__question)',expand=True)
The code above will specify the three labels in our dataset. To see this new dataset with the labels, use this code.
df_new.head()
The output is shown below:
From the image above, the dataset has columns numbered from 0
to 4
.
- The column numbered
1
represents GitHub issues with__label__enhancement
. - The column numbered
2
represents GitHub issues with__label__bug
. - The column numbered
3
represents GitHub issues with__label__question
. - The column numbered
4
represents the actual GitHub issues (title of the issue).
Creating data frames
Further, we need to format the dataset by creating separate data frames for each label. This will ensure that we have three data frames. These data frames will organize our dataset into rows and columns.
To create the three data frames, run this code:
enh_df = df_new[df_new[1] == '__label__enhancement'][[1,4]]
bug_df = df_new[df_new[2] == '__label__bug'][[2,4]]
question_df = df_new[df_new[3] == '__label__question'][[3,4]]
To have a look at the structure of the three data frames, use the following code:
__label__enhancement
dataframe
enh_df.head()
This code will output all GitHub issues with the __label__enhancement
labels as shown below:
__label__bug
dataframe
bug_df.head()
This code will output all GitHub issues with the __label__bug
labels as shown below:
__label__question
dataframe
question_df.head()
This code will output all GitHub issues with the __label__question
labels as shown below:
The next step in the dataset preparation process is to add descriptive columns names.
Adding column names
Currently, the columns of the data frames are numbered. Instead, we need to add descriptive column names that are more human-readable.
We have two columns for each data frame where the first column must be named label
and the second column named as description
.
The label
column represents the labels, while the description
column represents the GitHub issues.
To add the column names, use this code:
enh_df.columns = ['label','description']
bug_df.columns = ['label','description']
question_df.columns = ['label','description']
We have prepared each data frame separately and formatted them correctly. We can now concatenate the three data frames together. This ensures we have a single data frame that will be easy for the model to use. The single data frame will be correctly labeled and easy to manipulate.
Concat dataframes
To contact the data frames, use this code:
df = pd.concat([enh_df, bug_df, question_df])
To see the structure of this new dataset, use this code:
df.head()
The output is shown below:
We also need to remove the prefix __label__
from the labels. This makes the label
column more readable.
To remove prefix __label__
, use this code:
df['label'] = df['label'].str.replace('__label__', '')
The new dataset with the removed __label__
, run this code:
df.head()
The output is shown below:
Finally, we will convert the dataset into CSV format from text format.
Convert dataset to CSV
The comma-separated values (CSV) dataset is easy to use and understand by the model.
df.to_csv("github-issues-dataset-labels.csv")
The code will convert our dataset into CSV format. We now need to clean the dataset by removing stopwords and converting the text into lower case.
Text cleaning
Text cleaning will involve removing stop words and converting the text into lower case.
Stopwords are the commonly used words in a given language. These words being very common, they carry little information during model training. Removing the stopwords enables the model to focus on the most important words adding value to the model during training.
We convert all the text to lowercase to bring uniformity to them. To perform text cleaning we will use NeatText
.
NeatText
is a Python library that has in-built functions that help in text cleaning. Let's install NeatText
as shown:
!pip install neattext
After installing NeatText
, let's import the functions that will be used for text cleaning using the following code:
import neattext.functions as nfx
To remove the stopwords and convert the text into lower case, use the following code:
df['description_clean'] = df['description'].apply(lambda x: nfx.remove_stopwords(str(x).lower()))
From the code above, the nfx.remove_stopwords
method will be used to remove the stopwords. The lower
method will be used to convert the text into lower case. This dataset is now ready for use.
Adding features and labels
Another crucial step in machine learning is to add features and labels to our dataset.
Features are all the columns that are used as input for our model during training. Labels are the columns that are used as the output of the model during prediction.
Our feature will be the description_clean
column and the label will be the label
column.
Xfeatures = df['description_clean']
ylabels = df['label']
The code above will add the features and labels. The next step is to split our dataset into two. This will ensure we have one set for training and another set for testing.
Dataset splitting
To split our dataset, let's import the machine learning package that will be used for dataset splitting.
from sklearn.model_selection import train_test_split
Let's use train_test_split
to split our dataset into two.
x_train,x_test,y_train,y_test = train_test_split(Xfeatures,ylabels,test_size=0.3,random_state=42)
In the code above, we have specified test_size=0.3
. This is the ratio used to split our dataset with 70%
of the dataset used for training and 30%
used for testing.
After splitting the dataset, we can start building our model. To build our model, let's import the machine learning packages that we will use.
Importing machine learning packages
Let's import all the important machine learning packages using the following code:
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from sklearn.feature_extraction.text import CountVectorizer
Let's explore the function of each imported package as follows:
DecisionTreeClassifier
This is a Scikit-learn algorithm that we will use to build our GitHub issue classification model.
We have chosen to use this algorithm because it will balance our dataset improving the model's performance. Generally, the models built using the DecisionTreeClassifier
algorithm have a higher accuracy score.
accuracy_score
We use this package to get the accuracy score of the model after training. This is the probability of the model making an accurate prediction.
CountVectorizer
This package enables machine learning models to understand the text. Machines learning models have a problem of understanding and using raw texts. However, machine learning models work well with numbers.
CountVectorizer
converts the raw text into vectors of numbers. It ensures that the converted vectors of numbers represent the original text.
To further understand how the CountVectorizer
work behind the scene to convert the raw text into vectors of numbers, click here.
We will use these packages to build our model. To make the process of building our machine learning model easier and faster, we will use the Pipeline
package.
The Pipeline
package automates all the stages and processes used in building the model. To use the Pipeline
package, we will import it from Scikit-learn.
Importing Pipeline package
We import the package using the following code:
from sklearn.pipeline import Pipeline
To automate the process of a model building using this Pipeline
package, we initialize all the stages in building the model. After initializing the stages, they will be automated.
We have two stages as follows:
CountVectorizer
converting the input text to vectors of numbers.- Using the
DecisionTreeClassifier
algorithm to train the model.
We can now initialize the two stages using the following code:
Pipeline stages
pipe_dt = Pipeline(steps=[('cv',CountVectorizer()),('dt',DecisionTreeClassifier())])
The code above has initialized the two stages in sequential order. After initializing the two stages, we can fit this pipeline onto our train set dataset. The pipeline will learn from the train set dataset and gain useful insight.
Pipeline fitting
We fit the pipeline as follows:
pipe_nb.fit(x_train,y_train)
This process will train our model and produce the following output:
Our model is now trained using the machine learning pipeline. We can now calculate the accuracy score of this model.
Calculating the accuracy score
The accuracy score represents the probability of the model making an accurate prediction. The higher the accuracy score, the higher chance of the model to make an accurate prediction.
Let's calculate the accuracy score using the following code:
pipe_dt.score(x_test,y_test)
After running this code, the accuracy score is shown in the output below:
0.866
This represents 86.6%
. Using this accuracy score, our model has a higher chance of making accurate predictions. We will use this model to make predictions.
Making predictions
In making predictions, we test our model's ability to classify GitHub issues using the three labels. The model will be used to predict if a given GitHub issue is an enhancement
, bug
, or question
.
To test the model, we will use the Streamlit GitHub repository to see if the model can make accurate predictions. This repository has different issues that our model can predict.
We can extract some of the issues (issue title) from the repository for the model to make predictions.
issue1 = "st.file_uploader returns HTTP code 400 with invalid session_id when deployed on Kubernetes"
issue2 = "Cannot set Plotly theme because Streamlit overrides user values"
After extracting the two features, let's use the model to make the predictions. The model will analyze these issues and provide the labels for each issue.
We will use the following code to predict the first issue.
pipe_dt.predict([issue1])
The prediction output is shown below:
array(['bug'], dtype=object)
The model has correctly predicted the issue as a bug
from the Streamlit GitHub repository.
Let's make another prediction.
pipe_dt.predict([issue2])
The prediction output is shown below:
array(['enhancement'], dtype=object)
The model has correctly predicted the issue as an enhancement
.
Using these two predictions, our model can make accurate predictions. This model can be further deployed and used in production to classify GitHub issues.
Conclusion
In this tutorial, we have learned how to classify GitHub issues using machine learning. This involves analyzing open GitHub issues using models and assigning labels.
We started with dataset preparation ensuring we correctly format our dataset. After this stage, we cleaned the dataset by removing stopwords and converting the text into lower case.
Finally, we used the clean dataset to build our model. After training the model, we used the model to make predictions. Our model was able to predict if a given GitHub issue is an enhancement
, bug
, or question
. This model is ready to be deployed and used in production.
To get the GitHub issues classification model, click here.
References
- Google Colab notebook
- Streamlit GitHub repository
- Scikit-learn official documentation
- Text cleaning with NeatText
- GH Archive
Peer Review Contributions by: Srishilesh P S