Let’s face it: if you’re reading this, chances are that you are receiving the dreaded voice messages more often than you would want. I like the romantic feeling behind voice messages on Instant Messaging platforms: you can feel the nuances of the sender’s voice, quickly deciphering their mood. However, as you start receiving voice messages more often, that romantic feeling will collapse under the weight of a stack of these:

A stack of voice messages in Signal

A stack of voice messages in Signal

I get why people send voice messages: it saves their time, allowing them to record their voice while doing other things. Some go even as far as sending a voice message just to say “yes” or “no”. Sadly, while it can be convenient to listen to a message while doing something else, there are, in my opinion, some major drawbacks:

  • searching them is awful; you need to go through the voice messages and listen to them in order to find what is needed;
  • voice messages can’t be skimmed, like text messages; while it’s true that most IM allow reproducing voice messages faster, it’s not the same.

CoquiSTT + Mozilla Common Voice to the rescue

And then it finally occurred to me: everybody is shouting Machine Learning all over the place. Let’s try to put it to good use, shall we? What if we had a way to use Speech To Text ML algorithms to transcribe the received audio messages and show their text, instead?

By using the fantastic STT APIs from Coqui and the Italian model trained by Mozilla Italia from the Mozilla Common Voice data I was able to stitch together a prototype Signal Desktop doing just that. I picked Signal both because I like it quite a lot and because, being the desktop version basically built with web tech, I felt very familiar with it. The Coqui STT community and its developers were quite helpful and friendly throughout the whole process, what’s not to love?

The full source code of this experiment is available on my GitHub fork (stt_audio branch). Here’s out it behaves:

Signal Desktop using Coqui STT to transcribe audio messages

Signal Desktop using Coqui STT to transcribe audio messages

Adding speech-to-text to Signal Desktop

The first thing I did was to fork Signal Desktop and, before making any change to its source code, I made sure I was able to build it and run it locally. Their Contributor Guidelines have a nice step by step guide that got me through the process, with minimal hiccups due to node-pre-gyp. I then made sure to set up a staging environment using data from my real Signal install: I had just received a couple of audio messages, so I made sure to use them as testing material!

With a development version of Signal Desktop, along with sample audio, up and running, I started tinkering with STT.

Step 1 - Adding the CoquiSTT dependency and downloading a model

Adding the CoquiSTT was as easy as doing yarn add stt@1.3.0 (truth is, I worked with version 1.2.0 and had to hack around to add Electron 16 support to CoquiSTT; but that resulted in my first contribution to CoquiSTT which got live in 1.3.0!). I then downloaded the Mozilla Italian model from the Coqui Models page and unpacked it in Signal-Desktop/models/it.

Step 2 - Build an abstraction around the STT low level APIs

The core of the integration is in the SpeechToText.ts file, which exports two main functions start() and getText(url).

The start function gets called at startup, when Signal attempts to load other things as well (e.g. stickers, emojis). It’s responsible for loading the model downloaded in the previous step using the Model class and for setting up the conversion mechanisms so that the sample rate of the voice messages will match the one used in the downloaded model. The beauty of the “Web as a platform” enables us to do such conversion automagically by creating an AudioContext with a sampleRate that matches the one from the model. Whenever the context will be used, the sample rate conversion will automatically take place.

export async function start(): Promise<void> {
  log.info(`SpeechToText.start: loading model at ${BASE_MODEL_PATH} with Coqui v ${coquiSTT.Version()}`);
  try {
    // Yes, file names are hardcoded for the sake of building a working PoC fast :-)
    activeModel = new coquiSTT.Model(path.join(BASE_MODEL_PATH, "model.tflite"));
    activeModel.enableExternalScorer(path.join(BASE_MODEL_PATH, "it-mzit-1-prune-kenlm.scorer"));
    log.info(`SpeechToText.start: model at ${BASE_MODEL_PATH} successfully loaded`);
  } catch (e) {
    log.error("SpeechToText.start: failed to load model", e);
    activeModel = undefined;
    return;
  }

  // Create an audio context for future processing.
  audioContext = new AudioContext({
    // Use the model's sample rate so that the decoder will resample
    // for us when calling `getText`.
    sampleRate: activeModel.sampleRate()
  });
  audioContext.suspend();
}

The getText(url) takes an URI representing the location of the audio sample (the actual voice message), fetches it and then uses the previously initialized audio context to an raw audio buffer with a sample rate that matches the one of the model. That’s as simple as calling decodeAudioData (again, the beauty of the Web as a platform). The Coqui STT APIs expect the audio values to be an array of 16-bit, mono raw audio samples. Unfortunately the audio context returns a Float32Array so some conversion is needed before the audio files can be processed.

The converted, raw buffer can then be fed to the loaded model by creating a stream, feeding it with the buffer, and then waiting for the transcribed text to be computed.

export async function getText(url: string): Promise<string> {
  if (!activeModel || !audioContext) {
    throw new Error(
      'SpeechToText.start() must be successfully called before transcribing messages'
    );
  }

  log.info(`SpeechToText.getText: transcribing ${url} with model ${BASE_MODEL_PATH}`);
  const response = await fetch(url);
  const raw = await response.arrayBuffer();
  const audioBuffer = await audioContext.decodeAudioData(raw);

  if (audioBuffer.sampleRate != activeModel.sampleRate()) {
    // In practice, this should never happen. The audio context should do its magic
    // for us to prevent it.
    throw new Error(
      `SpeechToText.getText: message rate ${audioBuffer.sampleRate},  model ${activeModel.sampleRate()}`
    );
  }

  let processedData = converFloat32ToInt16(audioBuffer.getChannelData(0));
  let modelStream = activeModel.createStream();
  modelStream.feedAudioContent(Buffer.from(processedData.buffer));
  return modelStream.finishStream();
}

Step 3 - Wiring things together

The easiest way that I found to stitch everything together was to add a specialized reactjs effect to perform the speech to text transcribing for messages that had “audio attachments”, in MessageAudio.tsx. This was as easy as calling the previously discussed getText function and providing it with the URL of the audio attachment. I slightly extended the HTML of the component showing the message in Signal in order for it to support showing the transcribed audio once it’s available. See this commit.

Conclusions and potential improvements

This was a fun experiment, but nowhere near production ready. I really wish this was a polished feature of the products I use the most! While the current implementation relies on models to be downloaded, the major upside is that the raw unencrypted audio streams never have to leave the local machine. The speech to text computation process is completely offline. Aside from the transcription quality, there are a few companion features that I believe would make such a capability amazing:

  • once a transcription is computed, we should be able to save it locally and not automatically retrigger STT next time (unless requested so by the user, in case a new model is available); this would allow us to save processing time and make loading huge conversations much faster;
  • with transcriptions being safely stored to the disk, we could even think about making audio messages searchable, by indexing the transcribed text. That would make my days so much better;
  • allow users to tweak the transcribed text, to manually fix wrong transcriptions and make searches more effective; who knows, maybe in the future this could feed back into the model via federated learning or one-shot distributed learning.