wav2letter: A Facebook AI Research(FAIR) Automatic Speech Recognition Toolkit

Jan. 3, 2018, 8:15 a.m. By: Kirti Bakshi


Machine learning or Artificial Intelligence is essential when it comes to Facebook. It helps people to discover new content and also to connect with the stories they care about the most. The applied machine learning engineers and researchers of Facebook develop machine learning algorithms that not only ranks ads, feeds, and search results, but also helps create new text understanding algorithms that in return keeps misleading content as well as spam at bay.

And wav2letter from Facebook AI Research (FAIR) is a simple as well as efficient end-to-end Automatic Speech Recognition system (ASR). Automatic speech recognition (ASR), as we do know has been a major challenge and a machine learning problem that has been going on for decades. The ongoing research by them in this area examines the use of deep learning models for multilingual, and low-resource scenarios as well as distant and noisy recording conditions. The original authors who have contributed to this implementation of wav2letter are Ronan Collobert, Christian Puhrsch, Gabriel Synnaeve, Neil Zeghidour, and Vitaliy Liptchinsky.

wav2letter is said to implement the architecture that has been proposed in Wav2Letter:

  • An End-to-End Speech Recognition System that is ConvNet based.

  • Letter-Based Speech Recognition with Gated ConvNets.

Also If one right away wishes to get started transcribing speech, they are provided with pre-trained models for the Librispeech dataset.


  • A computer that runs either on MacOS or Linux.

  • Torch.

  • For training on CPU: Intel MKL.

  • For training on GPU: NVIDIA CUDA Toolkit (cuDNN v5.1 for CUDA 8.0).

  • For the reading of an audio file: Libsndfile - It should be available in any standard distribution.

  • For standard speech features: FFTW - It should be available in any standard distribution as well

Wav2Letter: an End-to-End ConvNet-based Speech Recognition System- The Paper:

There is also a paper in relation to the same that presents a simple end-to-end model for speech recognition, with the combination of an acoustic model that is convolutional network based and a graph decoding. It has been trained in order to output letters, with transcribed speech, without any such need for force alignment of phonemes.

It is also introduced in the same an automatic segmentation criterion for training from sequence annotation without any alignment. The criterion is said to be on par when it comes to CTC even while being simpler.

It also puts forward competitive results in word error rate with Mel-Frequency Cepstral Coefficients (MFCC) features on the Librispeech corpus, and brings back results that are very promising from raw waveform.

The paper on the basis of results shows that the AutoSegCriterion can be much faster as compared to CTC, and as accurate as well. The approach that has been put forward breaks free from force-alignment and HMM/GMM pre-training, and also as add-on as on average it is not as computationally intensive an RNN-based approaches, one LibriSpeech sentence is said to be processed in less than 60ms by this ConvNet, and the decoder runs at a speed of 8.6x on a single thread.

If one wishes to have more knowledge of the same, the link to the PDF has been provided below:

Wav2Letter: an End-to-End ConvNet-based Speech Recognition System: Click Here.

For More Information: GitHub