The data our Firefox users share with us is the key to identify and fix performance issues that lead to a poor browsing experience. Collecting it is not enough if we don’t manage to receive the data in an acceptable time-frame. My esteemed colleague Chris already wrote about this a couple of times: data latency sucks. But we can fix that.
Why is there latency, anyway?
The bulk of measurements we collect (histograms, scalars, events, …) are sent through the main-ping. This ping is generated at different times during the browsing session, including shutdown. The “shutdown” main-ping, which accounts for about ~80% of all the pings we receive, once generated, is not sent to our servers until the next Firefox restart. Depending on the user habits and the day of the week, this could be anything between a few minutes to a few days (see the CDF plot below): way too much! One of my team’s goal for this year is to reduce this latency, allowing developers to take decisions and iterate quickly.
With bug 1310703, we landed a new binary called the pingsender. From the second browsing session on, when Firefox shuts down, the pingsender is spawned as a separate process and attempts to send the shutdown main-ping once. If it fails, the ping will still be picked up by Firefox next time it starts. We are deliberately ignoring the first browsing session while we investigate the nasty business of “bot” machines: they usually create a profile, spawn Firefox, then shut it down and never use that profile again.
Did it work?
Landing the pingsender made the situation much better on the Nightly channel! But does this hold up on the Beta channel too? Let’s look at the cumulative distribution function of the submission delay (delay between data recording on the client and it hitting our servers) for the pings in that channel.
We are now able to receive 85% of the shutdown main pings within an hour from their generation (green curve), compared to only 25% without the pingsender (blue curve). Moreover, we’re able to get as much as 95% of the pings within 8 hours, a threshold that is not reached before 90 (NINETY!) hours without the pingsender.
In addition to the latency analysis, we took a deep dive into the data to make sure that we are not introducing any undesired side-effect: the full analysis is available here. Most of the issues with the pingsender were ironed out in Nightly (thanks users!). We found that builds using the pingsender take a few milliseconds more to shut down, up to 4ms on Nightly and up to 8ms on Beta. Even if this below the latest known (to me!) user-perceivable visual presentation delay, which is 13ms, we made sure to not slow down the OS shutdown (in case Firefox is still open) by disabling the pingsender if the OS is shutting down.
Moaar speed, less latency!
Using the pingsender for sending shutdown is looking very promising, but there’s still room for improvement. For example, enabling the pingsender on the first browsing session once our study on “bot” profiles is complete. In addition to speeding up the shutdown main-ping, we’re putting efforts into reducing the overall Telemetry latency by introducing a couple of new pings: the new-profile ping and the upcoming update ping.
A special thanks goes to everyone who worked on this for the past months: contributors, colleagues and the great QA team!