Extending Glean: build re-usable types for new use-cases

(“This Week in Glean” is a series of blog posts that the Glean Team at Mozilla is using to try to communicate better about our work. They could be release notes, documentation, hopes, dreams, or whatever: so long as it is inspired by Glean. You can find an index of all TWiG posts online.)

The last blog post:This Week in Glean: Cargo features – an investigation by Jan-Erik.

The philosophy of Glean has always been to offer higher-level metric types that map semantically to what developers want to measure: a Timespan metric type, for instance, will require developers to declare the resolution they want the time measured in. It is more than just a number. The build-time generated APIs will then offer a set of operations, start() and stop(), to allow developers to take the measurements without caring about the implementation details or about the consistency of times across platforms. By design, a Timespan will record time consistently and predictably on iOS, Android and even desktop. This also empowers the rest of the Glean ecosystem, especially pipeline and tooling, to know about the quality guarantees of the types, their format and, potentially, ways to aggregate and visualize them.

Why “can’t we just” land code? Why a process?

Time is one of the hardest illusions to constrain to code, and everyone has different assumptions about it (time zones? daylight savings? …). It’s not uncommon to see time differences being measured one way on one platform and then in subtly different ways on other platforms, wreaking havoc on analyses due to the dreaded “the operation took -1 milliseconds to complete”. Generally useful types need to be concerned about answering business questions with wrong or useless data, hence the need for a process to minimise the risk of that happening.

The initial set of metric types was created by analysing the usage of our legacy telemetry system in both Firefox Desktop and mobile platforms. We found that counting things was a common pattern, and we designed the Counter metric type. We found that sometimes developers wanted to count related things, and we designed Labeled Counter metric types. We know, however, that there might have been other use-cases we were not aware of at design time in addition to novel use-cases popping up.

In order to empower developers to use Glean for such use-cases, we kicked off a process to request changes to existing metric types or even create new ones.

The “lean” in Glean: a lightweight process

From a requester point of view, the process is meant to be lean and straightforward:

File a bug in Data platforms & tools::Glean Metric Types component (requires a Bugzilla account) and fill in the provided form.

Yes, that’s it! It’s a numbered list with only one item!

The Bugzilla form that shows up when filing a new bug is designed to provide all the information we think are needed to move the decision making process forward:

Proposal for changing an existing or adding a new Glean metric type
Who is the individual/team requesting this change?
Example: Alessio Placitelli, Glean SDK team.

Is this about changing an existing metric type or creating a new one?
Example: creating a new metric type.

Can you describe the data that needs to be recorded?
Example: We need to record whether or not some feature was enabled in a product.

Can you provide a raw sample of the data that needs to be recorded (this is in the abstract, and not any particular implementation details about its representation in the payload or the database)
Example: feature is available -> true to represent that the feature was enabled, feature is available -> false to represent that the feature was not enabled.

What is the business question/use-case that requires the data to be recorded?
Example: we need to understand how many users turn on/off the feature we are working on.

How would the data be consumed?
Example: we would like to see aggregated counts available on the GLAM dashboard.

Why existing metric types are not enough?
Example: no metric type exists to record boolean values. Events could be used instead, sending an event every time the feature is flipped on or off. This would make it harder to consume on available tools such as GLAM, requiring us to write custom SQL code.

What is the timeline by which the data needs to be collected?
Example: Q4 2021.

From there on, here’s what happens, roughly:

The triage owner creates a discussion template and adds the request details to it (within 6 business days);
All the stakeholders (Data Science, Data Stewards, Data Tools, Pipeline and SDK! Whoha, many teams!) of the Glean ecosystem are flagged for feedback on the document;
Teams required to do the work provide an estimate of the efforts required (e.g. how much time and effort are needed to create the APIs? How much for tooling support?)
The Glean end-to-end tech lead goes through the feedback from the stakeholders and decides whether or not the request is in line with the product strategy.
The decision is documented on the initial bug along with the discussion document, which is made publicly available at this point.

Of course I’m leaving out most of the nitty gritty details. You can find them in the process wiki, if you’re curious!

If you have questions about the process or need help please do reach out to the #Glean channel on Matrix or send us an email.