cocktail party algorithm SVD implementation ... in one line of code?

Solution 1:

I was trying to figure this out as well, 2 years later. But I got my answers; hopefully it'll help someone.

You need 2 audio recordings. You can get audio examples from http://research.ics.aalto.fi/ica/cocktail/cocktail_en.cgi.

reference for implementation is http://www.cs.nyu.edu/~roweis/kica.html

ok, here's code -

[x1, Fs1] = audioread('mix1.wav');
[x2, Fs2] = audioread('mix2.wav');
xx = [x1, x2]';
yy = sqrtm(inv(cov(xx')))*(xx-repmat(mean(xx,2),1,size(xx,2)));
[W,s,v] = svd((repmat(sum(yy.*yy,1),size(yy,1),1).*yy)*yy');

a = W*xx; %W is unmixing matrix
subplot(2,2,1); plot(x1); title('mixed audio - mic 1');
subplot(2,2,2); plot(x2); title('mixed audio - mic 2');
subplot(2,2,3); plot(a(1,:), 'g'); title('unmixed wave 1');
subplot(2,2,4); plot(a(2,:),'r'); title('unmixed wave 2');

audiowrite('unmixed1.wav', a(1,:), Fs1);
audiowrite('unmixed2.wav', a(2,:), Fs1);

enter image description here

Solution 2:

x(t) is the original voice from one channel/microphone.

X = repmat(sum(x.*x,1),size(x,1),1).*x)*x' is an estimation of the power spectrum of x(t). Although X' = X, the intervals between rows and columns are not the same at all. Each row represents the time of the signal, while each column is frequency. I guess this is an estimation and simplification of a more strict expression called spectrogram.

Singular Value Decomposition on spectrogram is used to factorize the signal into different components based on spectrum information. Diagonal values in s are the magnitude of different spectrum components. The rows in u and columns in v' are the orthogonal vectors that map the frequency component with the corresponding magnitude to X space.

I don't have voice data to test, but in my understanding, by means of SVD, the components fall into the similar orthogonal vectors are hopefully be clustered with the help of unsupervised learning. Say, if the first 2 diagonal magnitudes from s are clustered, then u*s_new*v' will form the one-person-voice, where s_new is the same of s except all the elements at (3:end,3:end) are eliminated.

Two articles about the sound-formed matrix and SVD are for your reference.