Building an NLP classifier: Example with Firefox Issue Reports

DistilBERT vs LSTM, with data exploration

This is what the article is about. Robot photo by Phillip Glickman on Unsplash. Developer persona, Beetle, and Fox images by Open Clipart, tortoise by creozavr, ladybug by Prawny, bee by gustavorezende, other beetle by Clker. On Pixabay for the last N after the Unsplash. Composition by author.

Getting the Data

As said, I am using the issue tracker for the Firefox browser as the source of my data. Since I am in no way affiliated with Mozilla of the Firefox Browser, I just used their Bugzilla REST API to download it. In a scenario where you would actually work within your company or with a partner organization, you likely could also just internally ask to get the data as a direct dump from the database. But sometimes you just have to find a way. In this case it was quite simple, with a public API to support downloads. You can find the downloader code on my Github. It is just a simple script to download the data in chunks.

Exploring and Cleaning the Data

The overall data I use contains the following attributes for each issue:

  • Summary: A summary description of the issue. This is one of the text fields I will use as features for the classifier.
  • Description: A longer description of the issue. This is the other text field I will use as features for the classifier.
  • Component: The software component in the Firefox Browser architecture that the issue was assigned to. This will be my prediction target, and thus the label to train the classifier on.
  • Duplicate issues number, if any
  • Creator: email, name, user id, etc.
  • Severity: trivial to critical/blocker, or S1 to S4. Or some random values. Seems like a mess when I looked at it :).
  • Last time modified (change time)
  • Keywords: apparently, you can pick any keywords you like. I found 1680 unique values.
  • Status: One of resolved, verified, new, unconfirmed, reopened, or assigned
  • Resolution: One of Duplicate, fixed, worksforme, incomplete, invalid, wontfix, expired, inactive, or moved
  • Open/Closed status: 18211 open, 172552 closed at the time of my download.

Selecting Features for the Predictor

My goal was to build a classifier assigning bug reports to components, based its natural language description. The first field I considered for this was naturally the description field. However, another similarly interesting field for this purpose is the summary field.

There are 2973 issue reports with missing description. Image by author.
There are 4 issue reports with missing summary. Image by author.
Joining summary + description as a single feature set to have no issue with missing features. Image by author.

Selecting the Components to Predict

To predict a component to assign an issue report to, I need the list of components. Getting this list is simple enough:

Getting the number of target components (52) for the issue reports. Image by author.
Getting the count of issue reports for all components. Image by author.
General                                 64153
Untriaged 18207
Bookmarks & History 13972
Tabbed Browser 9786
Address Bar 8120
Preferences 6882
Toolbars and Customization 6701
Theme 6442
Sync 5930
Menus 4440
Session Restore 4205
Search 4171
New Tab Page 4132
File Handling 3471
Extension Compatibility 3371
Security 3277
Shell Integration 2610
Installer 2504
PDF Viewer 2193
Keyboard Navigation 1808
Messaging System 1561
Private Browsing 1494
Downloads Panel 1264
Disability Access 998
Protections UI 984
Migration 964
Site Identity 809
Page Info Window 731
about:logins 639
Site Permissions 604
Enterprise Policies 551
Pocket 502
Firefox Accounts 451
Tours 436
WebPayments UI 433
Normandy Client 409
Screenshots 318
Translation 212
Remote Settings Client 174
Top Sites 124
Headless 119
Activity Streams: General 113
Normandy Server 91
Nimbus Desktop Client 83
Pioneer 73
Launcher Process 67
Firefox Monitor 61
Distributions 56
Activity Streams: Timeline 25
Foxfooding 22
System Add-ons: Off-train Deployment 15
Activity Streams: Server Operations 5
Getting a sorted list of latest issue creation time per component. Image by author.
component
Activity Streams: Timeline 2016-09-14T14:05:46Z
Activity Streams: Server Operations 2017-03-17T18:22:07Z
Activity Streams: General 2017-07-18T18:09:40Z
WebPayments UI 2020-03-24T13:09:16Z
Normandy Server 2020-09-21T17:28:10Z
Extension Compatibility 2021-02-05T16:18:34Z
Distributions 2021-02-23T21:12:01Z
Disability Access 2021-02-24T17:35:33Z
Remote Settings Client 2021-02-25T17:25:41Z
System Add-ons: Off-train Deployment 2021-03-23T13:58:13Z
Headless 2021-03-27T21:15:22Z
Migration 2021-03-28T16:36:08Z
Normandy Client 2021-04-01T21:14:52Z
Firefox Monitor 2021-04-05T16:47:26Z
Translation 2021-04-06T17:39:27Z
Tours 2021-04-08T23:26:27Z
Firefox Accounts 2021-04-10T14:17:25Z
Enterprise Policies 2021-04-13T02:38:53Z
Shell Integration 2021-04-13T10:01:39Z
Pocket 2021-04-15T02:55:12Z
Launcher Process 2021-04-15T03:10:09Z
PDF Viewer 2021-04-15T08:13:57Z
Site Identity 2021-04-15T09:20:25Z
Nimbus Desktop Client 2021-04-15T11:16:11Z
Keyboard Navigation 2021-04-15T14:40:13Z
Installer 2021-04-15T15:06:46Z
Page Info Window 2021-04-15T19:24:28Z
about:logins 2021-04-15T19:59:31Z
Site Permissions 2021-04-15T21:33:40Z
Bookmarks & History 2021-04-16T09:43:36Z
Downloads Panel 2021-04-16T11:39:07Z
Protections UI 2021-04-16T13:25:27Z
File Handling 2021-04-16T13:40:56Z
Foxfooding 2021-04-16T14:08:35Z
Security 2021-04-16T15:31:09Z
Screenshots 2021-04-16T15:35:25Z
Top Sites 2021-04-16T15:56:26Z
General 2021-04-16T16:36:27Z
Private Browsing 2021-04-16T17:17:21Z
Tabbed Browser 2021-04-16T17:37:16Z
Sync 2021-04-16T18:53:49Z
Menus 2021-04-16T19:42:42Z
Pioneer 2021-04-16T20:58:44Z
New Tab Page 2021-04-17T02:50:46Z
Messaging System 2021-04-17T14:22:36Z
Preferences 2021-04-17T14:41:46Z
Theme 2021-04-17T17:45:09Z
Session Restore 2021-04-17T19:22:53Z
Address Bar 2021-04-18T03:10:06Z
Toolbars and Customization 2021-04-18T08:16:27Z
Search 2021-04-18T09:04:40Z
Untriaged 2021-04-18T10:36:19Z
Last 5 issue reports for “System Add-ons: Off-train Deployment”. Image by author.
Plotting number of issues report over a month for “Webpayments UI”. Image by author.
  • Activity Streams: Timeline
  • Activity Streams: Server Operations
  • Activity Streams: General
Oldest open issues in Untriaged state. Image by author.
Date of oldest issue created per component. Cut in 2014 to save space. Image by author.

Training the Models

In this section I will briefly describe the models I used, and the general process I applied. For some assurance on training stability, I repeated the training 10 times for both models with randomly shuffled data. The overall dataset I used has 172556 rows of data (issue reports) after removing the five components discussed above.

A Look at the Models

First, the models.

Keras model with bi-directional LSTM layers and Attention layers. Image by author.
Keras LSTM model, but with boxes and arrows. Image generated by Keras as instructed by author.
Model summary given by Keras for the HF DistilBERT model. Image by author.
Model visualization by Keras for the HF DistilBERT. Image generated by Keras as instructed by author.

Sampling the Data

I sampled the dataset in each case into 3 parts. The model was trained on 70% of the data (training set), or 124126 issue reports. 20%, or 31032 issue reports (validation set), were used for evaluating model performance during training. 10%, or 17240, issue reports to evaluate the final model after training finished (test set). The sampling in each case was stratified, producing an equal proportion of different target components in each dataset of train, validation, and test.

Training Accuracy over Epochs

The following figures illustrate the training loss and accuracy, for training the models on one set of the sampled data for both models. First the HuggingFace training for DistilBERT:

DistilBERT transformer evaluation loss and accuracy over training epochs. Image by author.
LSTM training and validation set loss and accuracy over epochs. Epoch 2 = 1.0 due to 0-index. Image by author.

Results

Results are perhaps the most interesting part. The following table shows 10 different runs where the LSTM and Transformer classifiers were trained on different dataset variants as described before:

10 different runs of train + predict with top 1 and top 5 accuracy. Image by author.
  • hf_val_loss: the validation set loss at hf_best_epoch as given by HF.
  • hf_val_acc: the validation set accuracy at same point as hf_val_loss.
  • k_val_loss: the validation set loss at end of best epoch as given by Keras.
  • k_val_acc: the validation set accuracy at same point as k_val_loss.
  • hf1_test_acc: HF accuracy of using the best model to predict the target component, and only taking the top prediction.
  • k1_test_acc: same as hf1_test_acc but for the Keras model.
  • hf5_test_acc: same as hf1_test_acc, but considering if any of the top 5 predictions match the correct label. Think of it as providing the triaging user with 5 top component suggestions to assign the issue to.
  • k5_test_acc: same as h5_test_acc but for the Keras model.

Comparing Accuracy in Transformers vs LSTM

The results from the above table are quite clear, and the Transformer version outperforms the LSTM version in each case. The difference for top-1 prediction accuracy is about 0.72 vs 0.75, or 72% vs 75% in favor of the Transformer architecture. For the top-5 the difference is about 96% vs 97% accuracy. The difference in loss is about 0.91 vs 0.84, again favoring the Transformer.

A Deeper Look at Misclassifications

Beyond blindly looking at accuracy values, or even loss values, it is often quite useful to look a bit deeper at what did the classifier get right, and what did it get wrong. That is, what is getting misclassified. Let’s see.

  • test_total: number of issue reports for this component in the test set
  • fails_act: number of issues for this component that were misclassified as something else. For example, there were 1000 issue reports that were actually for component General but classified as something else.
  • fails_pred: number of issues predicted for this component, but were actually for another component. For example, there were 1801 issues predicted as General but their correct label was some other component.
  • total_pct: the total column value divided by the total number of issues (172556). The percentage this components represents from all the issues.
  • test_pct: same as total_pct but for the test set.
  • act_pct: how many percent of test_total is fails_act.
  • pred_pct: how many percent of test_total is fails_pred.
Keras LSTM classifier failure stats for top-1 prediction. Image by author.
Pairs of most common misclassifications for LSTM top-1. Image by author.
Keras LSTM classifier failure stats for top-5 prediction. Image by author.
HF DistilBERT classifier failure stats for top-1 prediction. Image by author.
Pairs of most common misclassifications for DistilBERT top-1. Image by author.
HF DistilBERT classifier failure stats for top-5 prediction. Image by author.

Predictor in Action

The article so far is very much text and data, and would contribute for a very boring powerpoint slideset to sell the idea. More concrete, live, and practical demonstrations are often more interesting.

HF DistilBERT predicted components for issue 1710955. Image by author.

Conclusions

Well that’s it. I downloaded the bugreport data, explored it, trained the classifiers, compared the results, and dug a bit deeper. And found a winner in Transformers, along with building myself and increased understanding and insights into the benefits of each model.

PhD. Technology research and software engineering. Typically I write too long, because I try to understand something myself.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store