Python audio analysis: find real time values of the strongest beat in each meter

You would do well to call the process of parsing audio to identify its sonic archetypes acoustic fingerprinting

Audio has a time dimension so to witness your "major sounds" requires listening to the audio for a period of time ... across a succession of instantaneous audio samples. Audio can be thought of as a time series curve where for each instant in time you record the height of the audio curve digitized into PCM format. It takes wall clock time to hear a given "major sound". Here your audio is in its natural state in the time domain. However the information load of a stretch of audio can be transformed into its frequency domain counterpart by feeding a window of audio samples into a fft api call ( to take its Fourier Transform ).

A powerfully subtle aspect of taking the FFT is it removes the dimension of time from the input data and replaces it with a distillation while retaining the input information load. As an aside, if the audio is periodic once transformed from the time domain into its frequency domain representation by applying a Fourier Transform, it can be reconstituted back into the same identical time domain audio curve by applying an inverse Fourier Transform. The data which began life as a curve which wobbles up and down over time is now cast as a spread of frequencies each with an intensity and phase offset yet critically without any notion of time. Now you have the luxury to pluck from this static array of frequencies a set of attributes which can be represented by a mundane struct data structure and yet imbued by its underlying temporal origins.

Here is where you can find your "major sounds". To a first approximation you simply stow the top X frequencies along with their intensity values and this is a measure of a given stretch of time of your input audio captured as its "major sound". Once you have a collection of "major sounds" you can use this to identify when any subsequent audio contains an occurrence of a "major sound" by performing a difference match test between your pre stored set of "major sounds" and the FFT of the current window of audio samples. You have found a match when there is little or no difference between the frequency intensity values of each of those top X frequencies of the current FFT result compared against each pre stored "major sound"

I could digress by explaining how by sitting down and playing the piano you are performing the inverse Fourier Transform of those little white and black frequency keys, or by saying the muddied wagon tracks across a spring rain swollen pasture is the Fourier Transform of all those untold numbers of heavily laden market wagons as they trundle forward leaving behind an ever deepening track imprinted with each wagon's axle width, but I won't.

Here are some links to audio fingerprinting

Audio fingerprinting and recognition in Python https://github.com/worldveil/dejavu

Audio Fingerprinting with Python and Numpy http://willdrevo.com/fingerprinting-and-audio-recognition-with-python/

Shazam-like acoustic fingerprinting of continuous audio streams (github.com) https://news.ycombinator.com/item?id=15809291

https://github.com/dest4/stream-audio-fingerprint

Audio landmark fingerprinting as a Node Stream module - nodejs converts a PCM audio signal into a series of audio fingerprints. https://github.com/adblockradio/stream-audio-fingerprint

https://stackoverflow.com/questions/26357841/audio-matching-audio-fingerprinting