Building an NLP classifier: Example with Firefox Issue Reports

DistilBERT vs LSTM, with data exploration

Teemu Kanstrén
24 min readMay 15, 2021
This is what the article is about. Robot photo by Phillip Glickman on Unsplash. Developer persona, Beetle, and Fox images by Open Clipart, tortoise by creozavr, ladybug by Prawny, bee by gustavorezende, other beetle by Clker. On Pixabay for the last N after the Unsplash. Composition by author.

Machine learning (ML) techniques for Natural Language Processing (NLP) offer impressive results these days. Libraries such as Keras, PyTorch, and HuggingFace NLP make the application of the latest research and models in the area a (relatively) easy task. In this article, I implement and compare two different NLP based classifier model architectures using the Firefox browser issue tracker data.

I previously built a similar issue report classifier. Back then, I found the deep-learning based LSTM (Long-Short-Term-Memory) model architecture performed very well for the task. I later added the Attention mechanism on top of the LSTM, improving the results. This LSTM with Attention is the first model type I chose for this article.

The second model type is Transformers. Transformers have become very popular in NLP over the past few years, and their popularity has spawned many variants. In this article, I wanted to get a feel for how they compare to the LSTM approach I applied before.

I use the HuggingFace (HF) DistilBERT as the Transformer model. For the LSTM, I use Keras with its LSTM and open-source Attention layers. I train them both to predict what component an incoming Firefox bug report should be assigned to. The idea would be to assist in bug report triaging tasks.

Getting the Data

As said, I am using the issue tracker for the Firefox browser as the source of my data. Since I am in no way affiliated with Mozilla of the Firefox Browser, I just used their Bugzilla REST API to download it. In a scenario where you would actually work within your company or with a partner organization, you likely could also just internally ask to get the data as a direct dump from the database. But sometimes you just have to find a way. In this case it was quite simple, with a public API to support downloads. You can find the downloader code on my Github. It is just a simple script to download the data in chunks.

Exploring and Cleaning the Data

The overall data I use contains the following attributes for each issue:

  • ID: A unique issue number, apparently across all Firefox/Mozilla product issues (not just the Browser)
  • Summary: A summary description of the issue. This is one of the text fields I will use as features for the classifier.
  • Description: A longer description of the issue. This is the other text field I will use as features for the classifier.
  • Component: The software component in the Firefox Browser architecture that the issue was assigned to. This will be my prediction target, and thus the label to train the classifier on.
  • Duplicate issues number, if any
  • Creator: email, name, user id, etc.
  • Severity: trivial to critical/blocker, or S1 to S4. Or some random values. Seems like a mess when I looked at it :).
  • Last time modified (change time)
  • Keywords: apparently, you can pick any keywords you like. I found 1680 unique values.
  • Status: One of resolved, verified, new, unconfirmed, reopened, or assigned
  • Resolution: One of Duplicate, fixed, worksforme, incomplete, invalid, wontfix, expired, inactive, or moved
  • Open/Closed status: 18211 open, 172552 closed at the time of my download.

Selecting Features for the Predictor

My goal was to build a classifier assigning bug reports to components, based its natural language description. The first field I considered for this was naturally the description field. However, another similarly interesting field for this purpose is the summary field.

By looking at the values, I can quite simply see there are some issue reports with no (or empty) description:

There are 2973 issue reports with missing description. Image by author.

I considered an option of using the summary field as another extra feature set. Almost all issue reports have a summary:

There are 4 issue reports with missing summary. Image by author.

I would need a way to build a classifier so each issue report could be classified, meaning they all should have the required features. Clearly description or summary alone is not sufficient here. One option I considered was to use the summary if the description is missing, or if the description classifier is not very confident in its classification.

A related paper I found used a combination of summary + description as the feature set. Combining these two gives a set where all items at least have some features to predict on:

Joining summary + description as a single feature set to have no issue with missing features. Image by author.

The strange looking apply function in the above simple adds the summary and description texts together to produce text_feature. With this, all issue reports have a non-empty text_feature as a result. This is what I used as the features for the classifier (tokenized words from text_feature).

Selecting the Components to Predict

To predict a component to assign an issue report to, I need the list of components. Getting this list is simple enough:

Getting the number of target components (52) for the issue reports. Image by author.

There are 52 components that have something assigned to them. The simple thing to do would be to just train a classifier to predict any of these 52 based on the text features. But as software evolves, some components may become obsolete, and trying to assign something to them at the current time might be completely pointless if the component is no longer relevant.

Lets see the details:

Getting the count of issue reports for all components. Image by author.

This gives the following list:

General                                 64153
Untriaged 18207
Bookmarks & History 13972
Tabbed Browser 9786
Address Bar 8120
Preferences 6882
Toolbars and Customization 6701
Theme 6442
Sync 5930
Menus 4440
Session Restore 4205
Search 4171
New Tab Page 4132
File Handling 3471
Extension Compatibility 3371
Security 3277
Shell Integration 2610
Installer 2504
PDF Viewer 2193
Keyboard Navigation 1808
Messaging System 1561
Private Browsing 1494
Downloads Panel 1264
Disability Access 998
Protections UI 984
Migration 964
Site Identity 809
Page Info Window 731
about:logins 639
Site Permissions 604
Enterprise Policies 551
Pocket 502
Firefox Accounts 451
Tours 436
WebPayments UI 433
Normandy Client 409
Screenshots 318
Translation 212
Remote Settings Client 174
Top Sites 124
Headless 119
Activity Streams: General 113
Normandy Server 91
Nimbus Desktop Client 83
Pioneer 73
Launcher Process 67
Firefox Monitor 61
Distributions 56
Activity Streams: Timeline 25
Foxfooding 22
System Add-ons: Off-train Deployment 15
Activity Streams: Server Operations 5

The number of issues per component is quite highly spread, and the distribution highly skewed (General almost has more reports assigned than others together). Specifically, some components have very few reports, and likely any classifier would do quite poorly with them.

First, lets see if I can find any components that might be obsolete and could be removed. This might happen over time as the software under test evolves, and some features (and their components) are dropped. One way to look at this is to find components that have had no activity for a long time. The following should show the latest date when a component had an issue created for it:

Getting a sorted list of latest issue creation time per component. Image by author.
component
Activity Streams: Timeline 2016-09-14T14:05:46Z
Activity Streams: Server Operations 2017-03-17T18:22:07Z
Activity Streams: General 2017-07-18T18:09:40Z
WebPayments UI 2020-03-24T13:09:16Z
Normandy Server 2020-09-21T17:28:10Z
Extension Compatibility 2021-02-05T16:18:34Z
Distributions 2021-02-23T21:12:01Z
Disability Access 2021-02-24T17:35:33Z
Remote Settings Client 2021-02-25T17:25:41Z
System Add-ons: Off-train Deployment 2021-03-23T13:58:13Z
Headless 2021-03-27T21:15:22Z
Migration 2021-03-28T16:36:08Z
Normandy Client 2021-04-01T21:14:52Z
Firefox Monitor 2021-04-05T16:47:26Z
Translation 2021-04-06T17:39:27Z
Tours 2021-04-08T23:26:27Z
Firefox Accounts 2021-04-10T14:17:25Z
Enterprise Policies 2021-04-13T02:38:53Z
Shell Integration 2021-04-13T10:01:39Z
Pocket 2021-04-15T02:55:12Z
Launcher Process 2021-04-15T03:10:09Z
PDF Viewer 2021-04-15T08:13:57Z
Site Identity 2021-04-15T09:20:25Z
Nimbus Desktop Client 2021-04-15T11:16:11Z
Keyboard Navigation 2021-04-15T14:40:13Z
Installer 2021-04-15T15:06:46Z
Page Info Window 2021-04-15T19:24:28Z
about:logins 2021-04-15T19:59:31Z
Site Permissions 2021-04-15T21:33:40Z
Bookmarks & History 2021-04-16T09:43:36Z
Downloads Panel 2021-04-16T11:39:07Z
Protections UI 2021-04-16T13:25:27Z
File Handling 2021-04-16T13:40:56Z
Foxfooding 2021-04-16T14:08:35Z
Security 2021-04-16T15:31:09Z
Screenshots 2021-04-16T15:35:25Z
Top Sites 2021-04-16T15:56:26Z
General 2021-04-16T16:36:27Z
Private Browsing 2021-04-16T17:17:21Z
Tabbed Browser 2021-04-16T17:37:16Z
Sync 2021-04-16T18:53:49Z
Menus 2021-04-16T19:42:42Z
Pioneer 2021-04-16T20:58:44Z
New Tab Page 2021-04-17T02:50:46Z
Messaging System 2021-04-17T14:22:36Z
Preferences 2021-04-17T14:41:46Z
Theme 2021-04-17T17:45:09Z
Session Restore 2021-04-17T19:22:53Z
Address Bar 2021-04-18T03:10:06Z
Toolbars and Customization 2021-04-18T08:16:27Z
Search 2021-04-18T09:04:40Z
Untriaged 2021-04-18T10:36:19Z

The list above is sorted by time, and all three components related to “Activity Streams” have last issues assigned to them 4–5 years ago. With this, I added them to a list of components to remove from the dataset. Seems pointless to assign any new issues to them with this timeline.

The Activity Streams: Timeline component was also one of the components in the earlier list with the fewest issues assigned to it. The other two components with very few issues created were Foxfooding and System Add-ons: Off-train Deployment. Since the issues for a component are listed in chronological order, looking at the last few in both of these should give some insight on their recent activity.

First System Add-ons: Off-train Deployment:

Last 5 issue reports for “System Add-ons: Off-train Deployment”. Image by author.

The above figure/table shows how the actual last reported issue is from 2019, and the last few after that are actually some kind of clones of old issues, made for some other purpose than reporting actual issues. So I dropped System Add-ons: Off-train Deployment from the dataset as well.

Foxfooding is described in the Firefox issue tracker as collecting issues for later triaging. Looking into it it only shows recent issues. I guess older ones might have been triaged. Without further knowledge, I left it in the dataset. With better access to domain experts, I might have removed it as it sounds like the actual issues in it could belong to many other components (and moved after triaging). But I expect it is not a big deal as it only has a few issues in it.

Few other components also had a bit longer period since last issue report. To get a better idea about how active these components have been over time, I plotted their count of issues over months. For example, Webpayments UI:

Plotting number of issues report over a month for “Webpayments UI”. Image by author.

WebPayments UI seems to have started quietly, gotten some attention, and quieted down again. The last of this attention was a bit over a year ago from today, on March 2020. I don’t know if it is still relevant, so I just left it in.

Finally, the list of components I removed for training as a result of this short analysis were the following:

  • System Add-ons: Off-train Deployment
  • Activity Streams: Timeline
  • Activity Streams: Server Operations
  • Activity Streams: General

The rest of the components seemed more likely to be relevant still. This left me with 48 target components from the original 52. I trained the initial set of models with these 48 target components. After a little bit of looking at the results, I removed one more component. Did you figure out which one it is already?

It is Untriaged. Because untriaged is just a list of issues that are not yet assigned to other components. Thus from machine learning perspective these issues are unlabeled. As far as I can see, keeping these in the training set can only confuse the trained classifier. So for further training iterations I removed also issues assigned to Untriaged, leaving 47 target components (labels).

In data analysis, it is easy to get sidetracked because of the next shiny. A bit like the Little Red Riding Hood in the forest I guess. In that line, some interesting facts can also be found by looking at the oldest reports with the Untriaged component/tag:

Oldest open issues in Untriaged state. Image by author.

The above list shows the oldest open and untriaged issue is from over 20 years ago (when writing this). It discusses the correct way to abbreviate “seconds”. In my experience, this is exactly how issue trackers in projects tend to evolve over time. No-one wants to say this issue does not matter and close it, yet no-one wants to make the effort or decide what to do with it. Or take the heat for the slightly irrelevant decision. Or maybe it is just forgotten.

A bunch of others in that list are also waiting for a few years, and if I remove the is_open requirement from the query, there are many very old issues in untriaged status. Issue trackers in general evolve this way, I guess. At least it is what I have seen, and it sort of makes sense. Like my storage closet, junk accumulates and easier to just leave it than do something about it.

Finally, just one more query to show the oldest issue created per component:

Date of oldest issue created per component. Cut in 2014 to save space. Image by author.

The above list actually seems to give a kind of a history on the development of the project. As I said, it is easy enough to get lost in data exploration like this, but I guess in a practical setting I should be focusing more on the task at hand. So lets get back to building the issue report classifier..

Training the Models

In this section I will briefly describe the models I used, and the general process I applied. For some assurance on training stability, I repeated the training 10 times for both models with randomly shuffled data. The overall dataset I used has 172556 rows of data (issue reports) after removing the five components discussed above.

A Look at the Models

First, the models.

Keras LSTM

A notebook setting up the Keras LSTM model, and running it can be found on my Github. The Keras summary function shows the structure:

Keras model with bi-directional LSTM layers and Attention layers. Image by author.

The input layer takes in a maximum of 512 word tokens. These feed into a Glove-based word-embedding layer, which converts the input token sequence into a 300-dimensional embedding vector. This is followed by a bi-directional LSTM layer that has 128 nodes. A self-attention layer follows, and feeds the attention output into another bi-directional LSTM layer that has 64 nodes. The output from here goes into a weighted attention layer, that passes it to the final Dense output layer. Many fancy words if not familiar with it, but worry not. It is simple to use in the end, and practical results will be presented.

Keras LSTM model, but with boxes and arrows. Image generated by Keras as instructed by author.

I recommend checking the layer docs for more information if interested.

HuggingFace DistilBERT

The HuggingFace DistilBERT is a bit more of a black box. A notebook for the training and testing is also on my Github. The Keras summary for it gives:

Model summary given by Keras for the HF DistilBERT model. Image by author.

I guess it is some kind of a custom Tensorflow implementation in a single layer from Keras viewpoint. Matches my previous attempts at trying to read about all the transformer architectures, where everyone quickly diverges into all the in-depth details from the general high level name, and I am just left wondering why is no-one able to provide an understandable and intuitive intermediate view. Anyway. The visual representation of it in terms of boxes and arrows in Keras is even better:

Model visualization by Keras for the HF DistilBERT. Image generated by Keras as instructed by author.

It’s all just a single box. I guess this is what you would call a black-box model (just color the box black to finish it..) :).

Sampling the Data

I sampled the dataset in each case into 3 parts. The model was trained on 70% of the data (training set), or 124126 issue reports. 20%, or 31032 issue reports (validation set), were used for evaluating model performance during training. 10%, or 17240, issue reports to evaluate the final model after training finished (test set). The sampling in each case was stratified, producing an equal proportion of different target components in each dataset of train, validation, and test.

I repeated the sampling to these 3 different sets 10 times, with different randomization in selecting items to each set. Another option would have been to do 10-fold splits and validations, which would be more systematic. But the sampling I used worked for my purposes. Luckily I am not writing a research paper for Reviewer 2 today, so lets pretend that is fine.

Training Accuracy over Epochs

The following figures illustrate the training loss and accuracy, for training the models on one set of the sampled data for both models. First the HuggingFace training for DistilBERT:

DistilBERT transformer evaluation loss and accuracy over training epochs. Image by author.

I set the HuggingFace trainer to evaluate the model at every 500 steps, providing the high granularity graph above. With the amount of data I used, and a batch size of 16, the number of steps in HF training over 3 epochs was 22881. At each 500 of these the evaluation was performed, and shows as a point on the above graph. As shown in the figure, training was quite consistent but leveled at around epoch 2.

The Keras LSTM training metrics per epoch are shown below:

LSTM training and validation set loss and accuracy over epochs. Epoch 2 = 1.0 due to 0-index. Image by author.

In this figure, epoch 0.0 is epoch 1, and 1.0 epoch 2, and so on. It is simply due to how I used a 0-indexed array for it. The name test is also actually referring my validation set in this case, I always get the terms confused. Sorry about that. More importantly, I trained this model for 4 epochs, and each point in this figure shows the evaluation result after an epoch. The validation always peaked on epoch 2 for this model.

Results

Results are perhaps the most interesting part. The following table shows 10 different runs where the LSTM and Transformer classifiers were trained on different dataset variants as described before:

10 different runs of train + predict with top 1 and top 5 accuracy. Image by author.

Each row in the table is a separate training on different random shuffles of the data. The table has the following columns:

  • hf_best_epoch: the epoch where the lowest loss (for validation set) was recorded for HF. In Keras this was always epoch 2, so I did not include a column for it.
  • hf_val_loss: the validation set loss at hf_best_epoch as given by HF.
  • hf_val_acc: the validation set accuracy at same point as hf_val_loss.
  • k_val_loss: the validation set loss at end of best epoch as given by Keras.
  • k_val_acc: the validation set accuracy at same point as k_val_loss.
  • hf1_test_acc: HF accuracy of using the best model to predict the target component, and only taking the top prediction.
  • k1_test_acc: same as hf1_test_acc but for the Keras model.
  • hf5_test_acc: same as hf1_test_acc, but considering if any of the top 5 predictions match the correct label. Think of it as providing the triaging user with 5 top component suggestions to assign the issue to.
  • k5_test_acc: same as h5_test_acc but for the Keras model.

Comparing Accuracy in Transformers vs LSTM

The results from the above table are quite clear, and the Transformer version outperforms the LSTM version in each case. The difference for top-1 prediction accuracy is about 0.72 vs 0.75, or 72% vs 75% in favor of the Transformer architecture. For the top-5 the difference is about 96% vs 97% accuracy. The difference in loss is about 0.91 vs 0.84, again favoring the Transformer.

In a research study promoting research results these would be a big deal. However, in a practical situation the significance of this depends on the target domain. Here, my aim was to build a classifier to help the triaging process by suggesting components to assign new issue reports to. In this case, a few misses, or a difference of 96% vs 97% in top 5 may not be that big a deal.

Additionally, besides this classification performance, other considerations may also be relevant. For example, the LSTM in general trains faster and requires fewer resources (such as GPU memory). This and similar issues might also be important tradeoffs in practice.

A Deeper Look at Misclassifications

Beyond blindly looking at accuracy values, or even loss values, it is often quite useful to look a bit deeper at what did the classifier get right, and what did it get wrong. That is, what is getting misclassified. Let’s see.

In the following, I will present multiple tables across the models and their misclassifications. These tables hold the following columns:

  • total: total number of issue reports for this component in the entire dataset
  • test_total: number of issue reports for this component in the test set
  • fails_act: number of issues for this component that were misclassified as something else. For example, there were 1000 issue reports that were actually for component General but classified as something else.
  • fails_pred: number of issues predicted for this component, but were actually for another component. For example, there were 1801 issues predicted as General but their correct label was some other component.
  • total_pct: the total column value divided by the total number of issues (172556). The percentage this components represents from all the issues.
  • test_pct: same as total_pct but for the test set.
  • act_pct: how many percent of test_total is fails_act.
  • pred_pct: how many percent of test_total is fails_pred.

Keras LSTM Top-1 Prediction Misclassifications

First, failure statistics for the Keras LSTM classifier, and its top-1 predictions:

Keras LSTM classifier failure stats for top-1 prediction. Image by author.

Predicting only the most common label is often used as the very baseline reference, and in this case we could then expect an accuracy of 37% if we always predicted each issue report to be assignable to the General component. Because the table above shows it holds 37% of all issues.

Something that I find slightly unexpected here is that even though General is so much more common in the set of issues, it does not have an overly large proportion of misclassified issues attributed to it (act_pct + pred_pct). Often such a dominant label in the training set also dominates the predictions, but it is nice to see it is not too bad in this case.

Instead, there are others in that list that more stand out. For example, Shell Integration looks quite bad with 83% (217 of 261) of its actual test set issues being misclassified for some other component (act_pct). One might consider this to be due to it having such a smaller number of issues in the training set, but many components with even fewer issues are much better. For example, the one visible in the table above, Installer, has a 32% (act_pct) fail rate only.

To analyze these causes deeper, I would look in a bit more detail at the strongest misclassifications (in terms of component probabilities) for Shell Integration, and try to determine the cause of mixup. Perhaps some feature engineering would be in order, to preprocess certain words or tokens differently. But this article is long enough as it is, so not going there.

Something a bit more generic, that I looked further in the statistics, is pairs of misclassifications. The following list shows the most common misclassified pairs in the test set. For example, the top one shows 204 issues for Tabbed Browser being predicted as General. And similarly, 134 General issues predicted as Tabbed Browser. Clearly, the two seem to be mixed often. Our friend, Shell Integration also seems to be commonly mixed with General. And so on.

Pairs of most common misclassifications for LSTM top-1. Image by author.

Overall, the biggest General component also dominates the above list, as one might have expected. Maybe because it is general, large, and thus randomly holds a bit of everything..

Keras LSTM Top-5 Predictions Misclassifications

Besides top-1 predictions, I also collected top-5 predictions. Top-5 means taking the 5 predicted components for an issue, and considering the prediction correct if any of these 5 is the expected (correct) label. The following shows similar stats for the top-5 as before for top-1:

Keras LSTM classifier failure stats for top-5 prediction. Image by author.

This is otherwise similar to top-1, but the fails_pred column has large values, because each issue report had 5 predictions counted for it. So even if one of the 5 was correct, the other 4 would be calculated in fails_pred here.

For the rest of the values, the numbers are clearly better for top-5 than for the top-1 results. For example, General has fails_act value of only 3, while in top-1 it was 1000. Likely because of its dominant size it gets into many top-5's. This drop from 1000 to 3 is a big improvement, but the overall top-5 accuracy is only 97% and not 99.9% as the components with fewer instances are still getting larger number of misclassifications. For example, Shell Integration still has about 17% act_pct, even with top-1 relaxed to top-5. Much better, but also not nearly 0% fails.

HuggingFace DistilBERT Top-1 Prediction Misclassifications

Let’s move to DistilBERT results for the top-1 prediction. Again for details, see my Github notebook, and two scripts to go with it. The overall values are clearly better for this model vs the LSTM as was illustrated by the general statistics earlier. However, I see a slightly different trend in this vs the LSTM. This model seems to balance better.

HF DistilBERT classifier failure stats for top-1 prediction. Image by author.

The number of failed predictions for the dominating General component are higher than for the LSTM model. For example, fails_act here is 1063, while in the LSTM model top-1 it was 1000. However, as the overall accuracy was quite a bit better for this model (72% LSTM vs 75% for this), this gets balanced by the other components having better scores. This is what I mean with more balanced. The errors are less focused on few components.

For example, the fails_act for the nasty Shell Integration is down to 185 for DistilBERT here, while it was 217 in the LSTM results. Still not great, but much better than the LSTM. Most of the other components are similarly lower, with a few exceptions. So this model seems to be overall more accurate, but also more balanced.

To further compare, here are the most commonly misclassified pairs for this model:

Pairs of most common misclassifications for DistilBERT top-1. Image by author.

This also shows the similar trend, where the total number of misclassifications is lower, but also no single pair dominates the list as strongly as in the LSTM model results. And again, the pair of Tabbed Browser and General are at the top with getting mixed with each other. Along with General being a part of almost every pair in that list. Looking into these associations in more detail would definitely be on my list if taking this classifier further.

HuggingFace DistilBERT Top-5 Prediction Misclassifications

And the results for the DistilBERT top-5:

HF DistilBERT classifier failure stats for top-5 prediction. Image by author.

Similar to top-1, this one has a higher fails_act for General but mostly lower for the others, leading to what seems to be a more balanced result, along with the higher overall accuracy (96% LSTM vs 97% this model/DistilBERT).

Short Summary on Model Differences

The results I presented could likely be optimized quite a bit. I did not do any hyperparameter optimization, or model architecture tuning, so my guess is that both the LSTM and Transformer model could have been optimized further. Might also be useful to try the other HF transformer variants. However, my experience is that the optimization gains are not necessarily huge. My goal for this article was to build a practical classifier from scratch, and compare the trendy Transformers architecture to my previous experiments with Attention LSTM’s. For this, I find the currect results are fine.

Overall, I find the results show that the Transformer in general gives better classification performance, but also is more balanced. The LSTM still produces good results, but if resources were available I would go with the Transformer.

One point I find interesting in the model differences is that the LSTM seems to do slightly better for the dominant General component, while the Transformer seems to do better on the other ones. This analysis was based on one pair of the 10 variants I trained, so with more time and resources, looking at different training variants would likely give more confidence still.

One way to apply such differences in model performance would be to build an ensemble learner. In such a combined model, both the LSTM and Transformer models would contribute to the prediction. These models would seem to be good candidates, since they have mutual diversity. Meaning one produces better results in some parts of the data, while the other in a different part. In practical production systems, ensembles can be overly complex for relatively small gains. However, in something like Kaggle competitions, where fractions of percentage in the results matter, this would be good insights to look into.

Predictor in Action

The article so far is very much text and data, and would contribute for a very boring powerpoint slideset to sell the idea. More concrete, live, and practical demonstrations are often more interesting.

Back when I presented my Qt company classifier based on their issue tracker data, I had similar data with some older classifiers (RF, SVM, TF-IDF, …) and the LSTM. Back then, I found the LSTM classifier produced surprisingly good results (similar to here). I made a presentation of this, showed some good results, and people naturally had the question: How does it actually work on issue reports it has not seen before?

One way to address this question is to try to explain that the training data did not include the validation or the test set, and thus it already measures exactly how well it would do on data it has not seen before. But not everyone is so familiar with machine learning terminology or concepts. To better address this question, I then opened the issue tracker for that day, took one of the newest, still untriaged issues (no component assigned), and copied its text. And then ran the classifier live on that text. Asked if people thought it was a good result. We can simulate that also here.

I took an issue report filed on May 13th (today, in 2021), as my training dataset was downloaded on the 21st of April. As opposed to a live demo, you just have to trust me on that :).

For this experiment, I picked issue number 1710955. The component label it has been assigned by the developers is Messaging System. The following shows the top-5 predictions for the issue using the HF DistilBERT model. First using only the summary field as the features, followed by using both summary + description as features.

HF DistilBERT predicted components for issue 1710955. Image by author.

The top line above shows the Messaging System as the top prediction, which is correct in this case. Second place goes to New Tab Page. As shown by the middle line, with just the summary field, the Messaging System is correctly predicted as the top component at 98.6% probability, followed by New Tab Page at 84.7%. The third and final line above shows how adding the description field to the features, the predicted top-5 remains the same but the classifier becomes more confident with Messaging System given 99.1% probability, and New Tab Page dropping to 62.5% on second place.

The summary field text for this issue report is simply “Change MR1 upgrade onboarding to Pin then Default then Theme screens”. Check the issue report for more details and the description. I think its quite impressive to predict the component from such short summary text, although I guess some words in it might be quite specific.

While this was a very good result for a random issue I picked, it is maybe not as believable as picking one in a live demonstration. In a more of a live demonstration, I could also ask people to write a real or imaginary bug report right there, run the classifier on it, and give them a classification for it, asking their opinion on its correctness. But, as usual, last time I tried that it was a bit difficult to get anyone to volunteer and I had to pick one myself. But at least you can do it live.

Conclusions

Well that’s it. I downloaded the bugreport data, explored it, trained the classifiers, compared the results, and dug a bit deeper. And found a winner in Transformers, along with building myself and increased understanding and insights into the benefits of each model.

In the end, writing this article was an interesting exercise in trying out the Transformer architecture for a real-world dataset, and gaining insight in comparison the LSTM architecture I used before. Would someone use something like this in real world? I don’t know. I think such applications depend on the scale and real needs. In very large companies with very large products and development departments, I can see it could provide useful assistance. Or integrated as part of an issue tracker platform for added functionality. Think about the benefit of being to advertise your product uses the latest deep-learning models and AI to analyze your issues.. :)

Besides comparing to my previous work on building a similar bug classifier for the Qt company issue tracker, I also found another paper from around the same time when I wrote my previous study, on a similar issue tracker classification, called DeepTriage. I considered using it for comparison, but their target was to predict the developer to assign the issue to, so I left that out. It was still useful for providing some insight on where to find accessible issue trackers, and how they used the data for features.

However, when I went searching the internet for the word DeepTriage after that paper, I found another paper with the same name. DeepTriage for cloud incident triaging. Yes, in my previous articles on metamorphic-testing of machine-learning based systems I noted how many DeepXXX articles there are. Apparently it is now so popular to name your research DeepXXX that you even re-use names. In any case, this was a paper from Microsoft, talking about the cost and need for fast triaging in datacenters, imbalanced datasets and all that. And how they use this type of technique in Azure since 2017 with thousands of teams. So, as I said, this likely becomes useful once your scale and real-world requirements catch up.

Well, my goal was to play with Transformers and maybe provide some example of building a classifier from real-world NLP data. I think I did ok. Cheers.

--

--

Teemu Kanstrén

PhD. Technology research and software engineering. Typically I write too long, because I try to understand something myself.