How to separate voice and background music from a video file
Unless they're separate audio tracks in your video, not easily. What you'll probably have to do is extract the audio track from the video into a separate file, edit the audio file with a dedicated tool, then remux the result back into the video.
The demux/remux part is easy. What's going to be difficult is attempting to isolate the background music. You'll probably have to experiment with different effects, all of which will most likely result in either a significant loss of fidelity in the audio or not entirely removing the dialogue, if not both. What's more is that you're going to be re-encoding that output into a new mp3/aac file, and between the re-encoding and audio processing, your output is going to sound much worse than the original.
You may have better results by trying to re-master the background music and replacing the audio track in the movie file entirely.
Spleeter
Spleeter is a Python library that can extract music and vocals from a joint audio source. It is machine-learning based and can provide different output types (the number of stems extracted).
It provides the following output:
- Vocals (singing voice) / accompaniment separation (2 stems)
- Vocals / drums / bass / other separation (4 stems)
- Vocals / drums / bass / piano / other separation (5 stems)
Audacity
Audacity – a free and open-source cross-platform audio editor – can do this, using the Vocal Reduction and Isolation effect. You should first extract the audio from the video file, e.g. using ffmpeg
:
ffmpeg -i video.mp4 -c:a pcm_s16le audio.wav
And then load the audio.wav
file into Audacity:
If you only want to get background music, select the Remove Vocals option; if you want the opposite, choose Isolate Vocals.
Note that this is never going to sound perfect. Vocal isolation is a hard task, as everything you hear is basically mixed into two tracks. An algorithm will never be as good as your brain in isolating different sound sources. Your audio source should be a stereo file with the vocals being panned dead-center. It might also produce false-positives, removing other instruments in the process.