By Alessio Placitelli, Ali Almossawi, and Rebecca Weiss
Cross-posted from Medium

We have just released the Firefox Hardware Report, a report of the hardware used by the Firefox release user base. You can read the announcement here. The Firefox team believes that this report will be very valuable to developers, particularly those who build for the web. Web developers need to know what platforms and hardware are being used to inform their decisions when they are building and upgrading applications.

As you may know, Firefox is built not just by the paid contributors of Mozilla but by an amazing community of volunteer contributors. When it comes to data, we believe that our users are also contributing by providing data to us about their client hardware and activity. The Firefox Hardware Report is a way to demonstrate the value of the data that our users have provided.

This article will describe how the Firefox Hardware Report works, from the manner in which the device-level data is measured to the process by which the data is prepped and processed to produce this report.

How and why we collect hardware data

The Firefox browser has a built-in system called Telemetry that measures, among other things, browser behaviour and platform details about the machines that are using the browser. This information is collected automatically from the Firefox desktop general release channel and can be disabled if users choose to do so.

Our primary motivation for collecting hardware data is to identify and fix performance issues that lead to a poor browsing experience. Identifying performance issues is complicated by the fact that we have a lot of users out in the world whose clients and machines have unique attributes and environments. Determining the causes of poor performance with device-level data at our scale means that we need to collect a lot of data about performance hangs and crashes, as well as many different measurements about browser activity and client hardware.

We handle this data in accordance with Mozilla’s Data Privacy Principles and privacy policy. When we propose new types of data to collect, we conduct a data collection review process and seek to describe the following:

1. What are we trying to measure?

2. What are our intentions with this data? How do we intend to use it?

3. The proposed source code for the measurement’s implementation (of course!)

We do this to ensure that we aren’t accidentally collecting more data than needed and are documenting the intent for collecting that specific measurement. You can check to see what we are collecting in Telemetry by reading these docs. Users can control their participation in this data collection via their preference settings.

Ingesting Firefox-scale data

Once device-level data reaches our servers, it is cleaned and aggregated into derived views. Those are distributed internally as named data sets. The Firefox Hardware Report accesses the reported measures through the Longitudinal data set. This is an aggregated view of 180 days of data for a sample of clients, equivalent to roughly 1% of the Firefox client base. Using the Longitudinal data set has a few advantages over the processing of the raw measurements: it allows for faster access to the data and has already been pre-processed to discard the bad device-level submissions (e.g. a corrupted data ping or invalid measurements). The raw summaries generated from the Longitudinal data undergo additional processing to prevent individual devices from being identified.

Collapsing the data

Reporting the marginal distributions of each hardware configuration first requires that we enumerate all the different configurations and count them over a predefined set of dimensions. For example, we count how many clients have a specific CPU model or 64-bit OS. We chose to report aggregate counts for two reasons:

1. We don’t want to accidentally provide a report for an individual user.

2. Visually reporting aggregated counts is more digestible.

Even so, extremely rare configurations may still be reported with a simple aggregation. That’s where configuration collapsing comes into play. Configurations that occur less than 1% of the time within the sample population are collapsed into a single category. For example, if the vendor name of a graphics adapter is so rare that we only observe it a handful of times, we attribute those observations to a generic reporting group called “Other”. Even if these configurations are so rare that they only represent a couple of clients, we have to keep track of them as it’s important to preserve the marginal distributions as accurately as we can. After all, the sum of all the rare and unique configurations could end up being the most popular group overall.

To retain the catch-all “Other” client configurations, we employ different collapsing strategies depending on the type of the data we’re looking at. For operating systems, if an observed combination of OS name and version is rare, we try to only report the name and set the version as “Other”. Screen resolutions have similar issues, as the resolution space is very wide and some resolutions change only by a few pixels. We discard these rare screen resolution observations by rounding both the screen width and height to the nearest hundred pixels. If after applying any of these data-specific collapsing strategies we still end up with uncommon configurations, we add these observations to “Other”.

The last step we take in order to report the data is to avoid reporting raw counts for each data point. Instead, we report the counts as ratios, decimal fractions in the [0, 1] range describing how often the class of data occurs within the considered sub-sample. For example, a ratio of 0.12 for the feature X indicates that it occurred 12% of the times within the considered population.

Final form of the data set

At this point, the data can be stored in its aggregated form. We chose to make it available as a JSON file. This file contains weekly summaries in order to allow a relatively low effort visualization. The most recent data can be found on the Firefox Hardware Report site itself and the current format is described in the project GitHub repository. This repository also hosts the source used to generate the aggregated data.

Thanks for reading!

We are very excited to finally release the Hardware Report, but we know there’s much more work to do. Looking forward, we will be exploring other future improvements depending on feedback we hear from you.

Most of the work related to the Firefox Hardware Report is tracked through the metabug 1228054 on Bugzilla.

Interested in taking part in this effort? Get in touch with us with comments, suggestions or simply let us know what you’d love to see next!