How Use Google's YouTube Speech Recognition without uploading videos to YouTube?
I have a lot of lecture video content that I would like to have the subtitles for. YouTube automatically generates subtitles for videos under certain conditions (those conditions are still somewhat of a mystery to me).
I would like to be able to use this speech recognition technology outside of YouTube. I don't want to upload every video just to get the transcript (too time consuming), plus, I don't think YouTube will do it for videos that are longer than about 30 minutes (most of them are), further, I don't think it will do it for non-public listed videos (which is a problem because it is premium content that is meant to be sold).
Perfect scenario: There is a program that I can run from my desktop to get the transcript out of these videos and it is of equal or better quality than YouTube's and has the time codes similar to an SRT or the XML that YouTube generates [How to get YouTube subtitles].
Acceptable scenario: There are some tricks I can do to force YouTube to transcribe the videos, whether set to private or public, and despite length.
Doable scenario: There is a library or something that I can use to code my own program. I am good with C# and okay with C++ (But I really prefer C#).
Solution 1:
Google implemented the Web Speech API (both for speech recognition and synthesis) into Chrome, which you can use if you are a developer. This is what YouTube uses to generate close captioning on some videos. Maybe you'll find code to interact with it.
The data flow would probably be:
A video file => extract and convert audio => send it to Google API => get the text => write into a SRT.
EDIT: there doesn't seem to be an official API page, other than the W3C spec. So here are more links:
- http://www.sitepoint.com/experimenting-web-speech-api/
- http://www.smashingmagazine.com/2014/12/05/enhancing-ux-with-the-web-speech-api/
These examples are about using the API from inside Chrome, but you can directly query Google's online speech recognition engine. For instance, Jasper, a speech-recognizing personal assistant for Raspberrry Pi, lets you choose Google as the speech recognition engine.
Solution 2:
There's a tool called "autosub" (see agermanidis/autosub on github) that does precisely this, although it uses the older Google speech API. The tool uses ffmpeg to strip the audio into FLAC files and then sends the FLAC files to Google for transcription. It produces an SRT or VTT file.
The accuracy is low in part because of the older Google API. There is a more recent API ("Cloud Speech REST API" at https://cloud.google.com/speech/docs/apis ). This API is pretty simple and at some point, I was going to fork autosub to use that.
The alternative is to upload to YouTube and download the VTT file when captioning is completed. The complication with this is that YouTube produces very fine-grained captions (e.g. a couple of words) rather than e.g. a sentence. This makes it harder to check the captions when doing a manual scan.
Solution 3:
The easiest way is this: go to google docs, open a new text document and select from tools "voice typing", then play your tape. Yes. It's THAT EASY! (and supports multiple languages)
Otherwise you can use a local webpage with HTML5 like this: https://www.labnol.org/software/add-speech-recognition-to-website/19989/