Algorithms for determining the key of an audio sample
I am interested in determining the musical key of an audio sample. How would (or could) an algorithm go about trying to approximate the key of a musical audio sample?
Antares Autotune and Melodyne are two pieces of software that do this sort of thing.
Can anyone give a bit of a layman's explanation about how this would work? To mathematically deduce the key of a song by analysing the frequency spectrum for chord progressions etc.
This topic interests me a lot!
Edit - brilliant sources and a wealth of information to be found from everyone who contributed to this question.
Especially from: the_mandrill and Daniel Brückner.
Solution 1:
It's worth being aware that this is a very tricky problem and if you don't have a background in signal processing (or an interest in learning about it) then you have a very frustrating time ahead of you. If you're expecting to throw a couple of FFTs at the problem then you won't get very far. I hope you do have the interest as it is a really fascinating area.
Initially there is the problem of pitch recognition, which is reasonably easy to do for simple monophonic instruments (eg voice) using a method such as autocorrelation or harmonic sum spectrum (eg see Paul R's link). However, you'll often find that this gives the wrong results: you'll often get half or double the pitch that you were expecting. This is called pitch period doubling or octave errors and it occurs essentially because the FFT or autocorrelation has an assumption that the data has constant characteristics over time. If you have an instrument played by a human there will always be some variation.
Some people approach the problem of key recognition as being a matter of doing the pitch recognition first and then finding the key from the sequence of pitches. This is incredibly difficult if you have anything other than a monophonic sequence of pitches. If you do have a monophonic sequence of pitches then it's still not a clear cut method of determining the key: how you deal with chromatic notes, for instance, or determining whether it's major or minor. So you'd need to use a method similar to Krumhansl's key finding algorithm.
So, given the complexity of this approach, an alternative is to look at all the notes being played at the same time. If you have chords, or more than one instruments then you're going to have a rich spectral soup of many sinusoids playing at once. Each individual note is comprised of multiple harmonics a fundamental frequency, so A (at 440Hz) will be comprised of sinusoids at 440, 880, 1320... Furthermore, if you play an E (see this diagram for pitches) then that is 659.25Hz which is almost one and a half times that of A (actually 1.498). This means that every 3rd harmonic of A coincides with every 2nd harmonic of E. This is the reason that chords sound pleasant, because they share harmonics. (as an aside, the whole reason that western harmony works is due to the quirk of fate that the twelfth root of 2 to the power 7 is nearly 1.5)
If you looked beyond this interval of a 5th to major, minor and other chords then you'll find other ratios. I think that many key finding techniques will enumerate these ratios and then fill a histogram for each spectral peak in the signal. So in the case of detecting the chord A5 you would expect to find peaks at 440, 880, 659, 1320, 1760, 1977. For B5 it'll be 494, 988, 741, etc. So create a frequency histogram and for every sinusoidal peak in the signal (eg from the FFT power spectrum) increment the histogram entry. Then for each key A-G tally up the bins in your histogram and the ones with the most entries is most likely to be your key.
That's just a very simple approach but may be enough to find the key of a strummed or sustained chord. You'd also have to chop the signal into small intervals (eg 20ms) and analyse each one to build up a more robust estimate.
EDIT:
If you want to experiment then I'd suggest downloading a package like Octave or CLAM which makes it easier to visualise audio data and run FFTs and other operations.
Other useful links:
- My PhD thesis on some aspects of pitch recognition -- the maths is a bit heavy going but chapter 2 is (I hope) quite an accessible introduction to the different approaches of modelling musical audio
- http://en.wikipedia.org/wiki/Auditory_scene_analysis -- Bregman's Auditory Scene analysis which though not talking about music has some fascinating findings about how we perceive complex scenes
- Dan Ellis has done some great papers in this and similar areas
- Keith Martin has some interesting approaches
Solution 2:
I have worked at the problem of transcribing polyphonic CD recordings into scores for more than two years at university. The problem is notoriously hard. The first scientific papers related to the problem date back to the 1940s and up to today there are no robust solutions for the general case.
All the basic assumption you usually read are not exactly right and most of them are wrong enough that they become unusable for everything but very simple scenarios.
The frequencies of overtones are not multiples of the fundamental frequency - there are non-linear effects so that the high partials drift away from the expected frequency - and not only a few Hertz; it is not unusual to find the 7th partial where you expected the 6th.
Fourier transformations do not play nice with audio analysis because the frequencies one is interested in are spaced logarithmically while the Fourier transformation yields linearly spaced frequencies. At low frequencies you need high frequency resolution to separate neighboring pitches - but this yields bad time resolution and you lose the ability the separate notes played in quick succession.
An audio recording does (probably) not contain all the information needed to reconstruct the score. A large part of our music perception happens in our ears and brain. That is why some of the most successful systems are expert systems with large knowledge repositories about the structure of (western) music that only rely to a small portion on signal processing to extract information from the audio recording.
When I am back home I will look through the papers I have read and pick the 20 or 30 most relevant ones and add them here. I really suggest to read them before you decide to implement something - as stated before most common assumptions are somewhat incorrect and you really don't want to rediscover all this things found and analyzed for more than 50 year while implementing and testing.
It's a hard problem, but it's much fun, too. I would really like to hear what you tried and how well it worked.
For now you may have a look at the Constant Q transform, Cepstrum and Wigner(–Ville) distribution. There are also some good papers on how to extract the frequency from shifts in the phase of short time Fourier spectra - this allows to use very short windows sizes (for high time resolution) because the frequency can be determined with a precision several 1000 times larger than the frequency resolution of the underlying Fourier transformation.
All this transformations fit the problem of audio processing much better than ordinary Fourier transformations. For improving the results of basic transformations have a look at the concept of energy reassignment.
Solution 3:
You can use the Fourier Transform to calculate the frequency spectrum from an audio sample. From this output, you can use the frequency values for particular notes to turn this into a list of notes heard during the sample. Choosing the strongest notes heard per sample over a series of samples should give you a decent map of the different notes used, which you can compare to the different musical scales to get a list of the possible scales that contain that combination of notes.
To help decide which particular scale is being used, make a note (no pun intended) of the most frequently heard notes. In Western music, the root of the scale is typically the most common note heard, followed by the fifth, and then the fourth. You can also look for patterns such as common chords, arpeggios, or progressions.
Sample size will probably be important here. Ideally, each sample will be a single note (so that you don't get two chords in one sample). If you filter out and concentrate on the low frequencies, you may be able to use the volume spikes ("clicks") normally associated with percussion instruments in order to determine the song's tempo and "lock" your algorithm to the beat of the music. Start with samples that are a half-beat in length and adjust from there. Be prepared to throw out some samples that don't have a lot of useful data (such as a sample taken in the middle of a slide).
Solution 4:
As far as I can tell from this article, various keys each have their own common frequencies, so it likely analyzes the audio sample to detect what the most common notes and chords are. After all, you can have multiple keys that have the same configuration of sharps and flats, with the difference being the note that the key starts on and thus the chords that such keys, so it seems how often the significant notes and chords appear would be the only real way you could figure that sort of thing out. I don't really think you can get a layman's explanation of the actual mathematical formulas without leaving out a lot of information.
Do note that this is coming from somebody who has absolutely no experience in this area, with his first exposure being the article linked in this answer.