Speech-recognition app to convert MP3 to text?

Does any one know of an application that can convert audio to text? I'm running ubuntu 12.04 LTS.


The software you can use is Vosk-api, a modern speech recognition toolkit based on neural networks. It supports 7+ languages and works on variety of platforms including RPi and mobile.

First you convert the file to the required format and then you recognize it:

ffmpeg -i file.mp3 -ar 16000 -ac 1 file.wav

Then install vosk-api with pip:

pip3 install vosk

Then use these steps:

git clone https://github.com/alphacep/vosk-api
cd vosk-api/python/example
wget https://alphacephei.com/kaldi/models/vosk-model-small-en-us-0.3.zip
unzip vosk-model-small-en-us-0.3.zip
mv vosk-model-small-en-us-0.3 model
python3 ./test_simple.py test.wav > result.json

The result will be stored in json format.

The same directory also contains an srt subtitle output example, which is easier to evaluate and can be directly useful to some users:

python3 -m pip install srt
python3 ./test_srt.py test.wav

The example given in the repository says in perfect American English accent and perfect sound quality three sentences which I transcribe as:

one zero zero zero one
nine oh two one oh
zero one eight zero three

The "nine oh two one oh" is said very fast, but still clear. The "z" of the before last "zero" sounds a bit like an "s".

The SRT generated above reads:

1
00:00:00,870 --> 00:00:02,610
what zero zero zero one

2
00:00:03,930 --> 00:00:04,950
no no to uno

3
00:00:06,240 --> 00:00:08,010
cyril one eight zero three

so we can see that several mistakes were made, presumably in part because we have the understanding that all words are numbers to help us.

Next I also tried with the vosk-model-en-us-aspire-0.2 which was a 1.4GB download compared to 36MB of vosk-model-small-en-us-0.3 and is listed at https://alphacephei.com/vosk/models:

mv model model.vosk-model-small-en-us-0.3
wget https://alphacephei.com/vosk/models/vosk-model-en-us-aspire-0.2.zip
unzip vosk-model-en-us-aspire-0.2.zip
mv vosk-model-en-us-aspire-0.2 model

and the result was:

1
00:00:00,840 --> 00:00:02,610
one zero zero zero one

2
00:00:04,026 --> 00:00:04,980
i know what you window

3
00:00:06,270 --> 00:00:07,980
serial one eight zero three

which got one more word correct.

Tested on vosk-api 7af3e9a334fbb9557f2a41b97ba77b9745e120b3.


I know this is old, but to expand on Nikolay's answer and hopefully save someone some time in the future, in order to get an up-to-date version of pocketsphinx working you need to compile it from the github or sourceforge repository (not sure which is kept more up to date). Note the -j8 means run 8 separate jobs in parallel if possible; if you have more CPU cores you can increase the number.

git clone https://github.com/cmusphinx/sphinxbase.git
cd sphinxbase
./autogen.sh
./configure
make -j8
make -j8 check
sudo make install
cd ..
git clone https://github.com/cmusphinx/pocketsphinx.git
cd pocketsphinx
./autogen.sh
./configure
make -j8
make -j8 check
sudo make install
cd ..

Then, from: https://sourceforge.net/projects/cmusphinx/files/Acoustic%20and%20Language%20Models/US%20English/ download the newest versions of cmusphinx-en-us-....tar.gz and en-70k-....lm.gz

tar -xzf cmusphinx-en-us-....tar.gz
gunzip en-70k-....lm.gz

Then you can finally proceed with the steps from Nikolay's answer:

ffmpeg -i book.mp3 -ar 16000 -ac 1 book.wav
pocketsphinx_continuous -infile book.wav \
    -hmm cmusphinx-en-us-8khz-5.2 -lm en-70k-0.2.lm \
    2>pocketsphinx.log >book.txt

Sphinx works alright. I wouldn't rely on it to make a readable version of the text, but it's good enough that you can search it if you're looking for a particular quote. That works especially well if you use a search algorithm like Xapian (http://www.lesbonscomptes.com/recoll/) which accepts wildcards and doesn't require exact search expressions.

Hope this helps.


I you are looking to convert speech to text you could try opening up your Ubuntu Software Center and search for Julius

Description

"Julius" is a high-performance, two-pass large vocabulary continuous speech recognition (LVCSR) decoder software for speech-related researchers and developers.

Or another option that isn't in the Software Center is Simon

... is an open-source speech recognition program and replaces the mouse and keyboard.

Reference Links

http://julius.sourceforge.jp/en_index.php

http://sourceforge.net/projects/speech2text/

http://simon-listens.org/index.php?id=122&L=1