Building an NLP classifier: Example with Firefox Issue Reports

DistilBERT vs LSTM, with data exploration

This is what the article is about. Robot photo by Phillip Glickman on Unsplash. Developer persona, Beetle, and Fox images by Open Clipart, tortoise by creozavr, ladybug by Prawny, bee by gustavorezende, other beetle by Clker. On Pixabay for the last N after the Unsplash. Composition by author.

Getting the Data

Exploring and Cleaning the Data

There are 2973 issue reports with missing description. Image by author.
There are 4 issue reports with missing summary. Image by author.
Joining summary + description as a single feature set to have no issue with missing features. Image by author.
Getting the number of target components (52) for the issue reports. Image by author.
Getting the count of issue reports for all components. Image by author.
General                                 64153
Untriaged 18207
Bookmarks & History 13972
Tabbed Browser 9786
Address Bar 8120
Preferences 6882
Toolbars and Customization 6701
Theme 6442
Sync 5930
Menus 4440
Session Restore 4205
Search 4171
New Tab Page 4132
File Handling 3471
Extension Compatibility 3371
Security 3277
Shell Integration 2610
Installer 2504
PDF Viewer 2193
Keyboard Navigation 1808
Messaging System 1561
Private Browsing 1494
Downloads Panel 1264
Disability Access 998
Protections UI 984
Migration 964
Site Identity 809
Page Info Window 731
about:logins 639
Site Permissions 604
Enterprise Policies 551
Pocket 502
Firefox Accounts 451
Tours 436
WebPayments UI 433
Normandy Client 409
Screenshots 318
Translation 212
Remote Settings Client 174
Top Sites 124
Headless 119
Activity Streams: General 113
Normandy Server 91
Nimbus Desktop Client 83
Pioneer 73
Launcher Process 67
Firefox Monitor 61
Distributions 56
Activity Streams: Timeline 25
Foxfooding 22
System Add-ons: Off-train Deployment 15
Activity Streams: Server Operations 5
Getting a sorted list of latest issue creation time per component. Image by author.
component
Activity Streams: Timeline 2016-09-14T14:05:46Z
Activity Streams: Server Operations 2017-03-17T18:22:07Z
Activity Streams: General 2017-07-18T18:09:40Z
WebPayments UI 2020-03-24T13:09:16Z
Normandy Server 2020-09-21T17:28:10Z
Extension Compatibility 2021-02-05T16:18:34Z
Distributions 2021-02-23T21:12:01Z
Disability Access 2021-02-24T17:35:33Z
Remote Settings Client 2021-02-25T17:25:41Z
System Add-ons: Off-train Deployment 2021-03-23T13:58:13Z
Headless 2021-03-27T21:15:22Z
Migration 2021-03-28T16:36:08Z
Normandy Client 2021-04-01T21:14:52Z
Firefox Monitor 2021-04-05T16:47:26Z
Translation 2021-04-06T17:39:27Z
Tours 2021-04-08T23:26:27Z
Firefox Accounts 2021-04-10T14:17:25Z
Enterprise Policies 2021-04-13T02:38:53Z
Shell Integration 2021-04-13T10:01:39Z
Pocket 2021-04-15T02:55:12Z
Launcher Process 2021-04-15T03:10:09Z
PDF Viewer 2021-04-15T08:13:57Z
Site Identity 2021-04-15T09:20:25Z
Nimbus Desktop Client 2021-04-15T11:16:11Z
Keyboard Navigation 2021-04-15T14:40:13Z
Installer 2021-04-15T15:06:46Z
Page Info Window 2021-04-15T19:24:28Z
about:logins 2021-04-15T19:59:31Z
Site Permissions 2021-04-15T21:33:40Z
Bookmarks & History 2021-04-16T09:43:36Z
Downloads Panel 2021-04-16T11:39:07Z
Protections UI 2021-04-16T13:25:27Z
File Handling 2021-04-16T13:40:56Z
Foxfooding 2021-04-16T14:08:35Z
Security 2021-04-16T15:31:09Z
Screenshots 2021-04-16T15:35:25Z
Top Sites 2021-04-16T15:56:26Z
General 2021-04-16T16:36:27Z
Private Browsing 2021-04-16T17:17:21Z
Tabbed Browser 2021-04-16T17:37:16Z
Sync 2021-04-16T18:53:49Z
Menus 2021-04-16T19:42:42Z
Pioneer 2021-04-16T20:58:44Z
New Tab Page 2021-04-17T02:50:46Z
Messaging System 2021-04-17T14:22:36Z
Preferences 2021-04-17T14:41:46Z
Theme 2021-04-17T17:45:09Z
Session Restore 2021-04-17T19:22:53Z
Address Bar 2021-04-18T03:10:06Z
Toolbars and Customization 2021-04-18T08:16:27Z
Search 2021-04-18T09:04:40Z
Untriaged 2021-04-18T10:36:19Z
Last 5 issue reports for “System Add-ons: Off-train Deployment”. Image by author.
Plotting number of issues report over a month for “Webpayments UI”. Image by author.
Oldest open issues in Untriaged state. Image by author.
Date of oldest issue created per component. Cut in 2014 to save space. Image by author.

Training the Models

Keras model with bi-directional LSTM layers and Attention layers. Image by author.
Keras LSTM model, but with boxes and arrows. Image generated by Keras as instructed by author.
Model summary given by Keras for the HF DistilBERT model. Image by author.
Model visualization by Keras for the HF DistilBERT. Image generated by Keras as instructed by author.
DistilBERT transformer evaluation loss and accuracy over training epochs. Image by author.
LSTM training and validation set loss and accuracy over epochs. Epoch 2 = 1.0 due to 0-index. Image by author.

Results

10 different runs of train + predict with top 1 and top 5 accuracy. Image by author.
Keras LSTM classifier failure stats for top-1 prediction. Image by author.
Pairs of most common misclassifications for LSTM top-1. Image by author.
Keras LSTM classifier failure stats for top-5 prediction. Image by author.
HF DistilBERT classifier failure stats for top-1 prediction. Image by author.
Pairs of most common misclassifications for DistilBERT top-1. Image by author.
HF DistilBERT classifier failure stats for top-5 prediction. Image by author.

Predictor in Action

HF DistilBERT predicted components for issue 1710955. Image by author.

Conclusions

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Teemu Kanstrén

PhD. Technology research and software engineering. Typically I write too long, because I try to understand something myself.