Machine Learning

Speech to Text Transcription Model using Deep Speech

Speech-to-text models have made users more comfortable when using online voice services. Businesses can now use speech recognition models in their operations.

Speech-to-text transcription is a subset of natural language processing that is used to convert speech to text. Speech may be in form of video or audio files. The model analyses the speech and converts it to the corresponding text.

A speech to text model is applied in various areas such as:

Subtitle generation in audio and video files.
Medical sector to convert spoken words to text.
Online voice services. It is applied in businesses that use online customer support.
Automatic generation of word documents. Using audio and producing word documents instead of typing.

Speech-to-text models have made users more comfortable when using online voice services. Businesses can now use speech recognition models in their operations.

In this tutorial, we will use the Deep Speech library to build the model. Deep Speech takes digital audio or video file as input and outputs text.

Prerequisites
Types of transcription models
Getting started with Deep Speech
Installing depencancies
Importing important packages
Adding model file paths
Initializing models
Simulating streaming audio
Function to read the audio file
Function to perform the transcription
Downloading an audio file
Playing the audio file
Calling the function
Conclusion
References

Prerequisites

To follow along with this tutorial, a reader should:

Types of transcription models

In speech-to-text transcription, we have two types of models. Streaming transcription and batch streaming.

Streaming transcription

This is a type of transcription that deals with real-time audio or video files. In streaming transcription, it breaks the input audio into chunks. The model then outputs the transcribed text in real-time as the audio is processed. It is effective when you have live events and you want the transcribed text in real-time.

Streaming transcription can be applied in the following areas:

Live podcasts, webinars, and videos.
Live online events and teleconferencing.
Real-time phone calls.

Batch streaming

This is a type of transcription that deals with offline audio and video files. It is best suitable when we have large recorded video and audio files. The model only produces the transcribed text when it has finished the processing.

In this tutorial, we will be dealing with the streaming transcription for audio data.

Getting started with Deep Speech

Deep Speech is a library used for speech-to-text transcription. Deep Speech library uses deep learning neural networks. It converts speech spectrograms into a text transcript. Deep Speech is built using TensorFlow framework. We install Deep Speech using the following command:

!pip install deepspeech==0.8.2

In the code above, we have installed Deep Speech version 0.8.2.

Deep speech is made up of two pre-trained models that we have to download. It is made up of the Acoustic model and the Language model.

Acoustic model

It is a pre-trained model that converts the raw audio into a waveform or spectrogram. It analyses the spectrogram and converts it to its corresponding text transcripts. This is the model used for speech recognition.

To download the acoustic model, run this code:

!wget https://github.com/mozilla/DeepSpeech/releases/download/v0.8.2/deepspeech-0.8.2-models.pbmm

Language model

It is a pre-trained that is used to fine-tune and guide the acoustic model. It refines output text and makes sure it's valid. It checks for the grammar and contextual relationship of words in a sentence. This increases the accuracy of the transcription output.

To download the language model, run this code:

!wget https://github.com/mozilla/DeepSpeech/releases/download/v0.8.2/deepspeech-0.8.2-models.scorer

We also need to install some more dependencies.

Installing depencancies

We will install the I/O libraries. The I/O libraries are used for cross-platform audio and video support. This will enable us to perform audio and video processing using any operating system.

For more information about these libraries, read this documentation.

To install all the I/O libraries, use this code:

!apt install libasound2-dev portaudio19-dev libportaudio2 libportaudiocpp0 ffmpeg

The next step is to import the packages and functions we will use to build the transcription model.

Importing important packages

To import the necessary packages and functions, use this code:

from deepspeech import Model
import numpy as np
import os
import wave
from IPython.display import Audio

Model - It is the function used to initialize our downloaded pre-trained model. It will be used to initialize the downloaded acoustic model.

numpy - It is used to convert audio files into an array. It also performs mathematical computation on our audio file.

os - It enables us to interact with the operating systems. It will enable us to perform core operating functions while in the notebook

wave - This library enables us to read the WAV audio format.

IPython.display - This will enable us to display and play the audio file in the Google Colab notebook.

Adding model file paths

We add the file path for the acoustic and language so that we can be able to use them.

model_file_path = 'deepspeech-0.8.2-models.pbmm'
lm_file_path = 'deepspeech-0.8.2-models.scorer'

After adding the file paths, we also initialize the following.

beam_width = 100
lm_alpha = 0.93
lm_beta = 1.18

In the code above beam_width is used to format our audio file to make it easy to use. The lm_alpha and lm_beta specifies the alpha and beta versions of our language model.

Next, let's now add the parameters above into our model.

Initializing models

We use the following code:

model = Model(model_file_path)
model.enableExternalScorer(lm_file_path)

model.setScorerAlphaBeta(lm_alpha, lm_beta)
model.setBeamWidth(beam_width)

In the code above we have added the acoustic and language models so that we can begin using them. We have added the alpha and beta versions of our language model.

Simulating streaming audio

The audio data will be broken into chunks. This simulates data being received from a real-time recording. This will enable us to achieve real-time transcription.

stream = model.createStream()

The next step is to create a function used to read the audio file.

Function to read the audio file

The function is created using the following code:

def read_audio_file(filename):
 with wave.open(filename, 'rb') as w:
        rate = w.getframerate()
        frames = w.getnframes()
        buffer = w.readframes(frames)

 return buffer, rate

The function used is named read_audio_file. The audio file we will use in this model is a WAV audio format. wave.open method will be used to read the WAV audio files.

The function also specified the audio frame rate and the number of frames. This has been done using the getframerate() and getnframes() methods. The buffer variable will hold the final audio chunk.

We now need to create another function that will be used to perform the transcription.

Function to perform the transcription

The function is created using the following code:

def real_time_transcription(audio_file):
    buffer, rate = read_audio_file(audio_file)
    offset=0
    batch_size=8196
    text=''

 while offset < len(buffer):
      end_offset=offset+batch_size
      chunk=buffer[offset:end_offset]
      data16 = np.frombuffer(chunk, dtype=np.int16)

      stream.feedAudioContent(data16)
      text=stream.intermediateDecode()
 print(text)
      offset=end_offset
 return True

The function used is named real_time_transcription. We pass the buffer and rate as a parameter. We also add the read_audio_file function which is used to read the WAV audio file. We set the batch_size=8196, the batch size represents the size of the audio. One broken audio chunk will be 8196 bytes.

The function also has a while loop. The while loop is used to loop through the audio file. During each loop, it breaks the audio file into the required size (8196 bytes).

The feedAudioContent method feeds the audio chunks into the model for transcription. intermediateDecode() will decode the model output and produce the corresponding text. Finally, the function will return the results and print the transcribed text.

Before we call this function, we need to download a sample audio file that the model will use.

Downloading an audio file

To download the audio file, use the following command:

!wget -O speech.wav https://github.com/EN10/DeepSpeech/blob/master/man1_wb.wav?raw=true

To see if the audio file has been added to our working directory, use this command:

!ls

The output is shown below:

Dowloaded audio

From the image above, the working directory has two models (acoustic and language model). It also shows the downloaded audio file saved as speech.wav.

Playing the audio file

To play the audio file in Google Colab, use this code:

Audio('speech.wav')

When we run this code, it will provide an interface that allows you to play the audio in Google Colab. The interface is shown below:

Audio interface

The interface has a play button that we can press and listen to audio. We will pass speech.wav as an argument when we call the real_time_transcription function.

Calling the function

To call the real_time_transcription function, run this code:

real_time_transcription('speech.wav')

This code will produce a text transcript in real-time. The output is shown below:

Text transcription output

From the output above we can see the model was able to do real-time transcription. The model produced an output bit-by-bit as the audio processing continued. The acoustic and language models produce an accurate transcription.

Conclusion

In this tutorial, we have learned how to do a streaming transcription using Deep Speech. We discussed the types of transcription models. We explored the types of pre-trained Deep Speech models. We use the acoustic and language models to build a model that was able to do real-time transcription.

To access this notebook, click here.