Gradient descent on non-convex function works. How?
Solution 1:
I was able to reach Brandyn Webb (aka S. Funk) and he clarified some of my questions. The ideas that led to SGD SVD can be found in this paper
http://courses.cs.washington.edu/courses/cse528/09sp/sanger_pca_nn.pdf
where Sanger, building on work of Oja
http://users.ics.aalto.fi/oja/Oja1982.pdf
talks about finding N top eigenvectors through a method called Generalized Hebbian Algorithm (GHA) using a neural network with probability 1. That is, training this particular network ends up doing PCA (very cool).
Webb was researching NNs at the time (way before Netflix);
I was looking at a reciprocal arrangement between neurons in cortex and had worked out an implied learning algorithm which struck me as something that might have an analytical solution. [..] My original incremental formulation (not too different) from a few years before that was, rather than a gradient, just an expansion of matrix multiplication based on the matrix in question being the sum of outer products of individual sample vectors, and was, therefore, exactly the same as a common numeric approach to finding eigenvectors (just keep squaring the matrix until it settles to many copies of your first eigenvector [i.e. the power method]; then repeat with that one removed, and so on) -- for sufficiently small learning rate, anyway... The result is practically the same as the gradient derivation; it is my assumption that it has the same properties.
Later Gorrell and Webb realized the formulation looked like it was solving an SVD problem (again before Netflix Prize). Webb first started using the SGD derivation for Netflix, he mentioned that he "went for the direct gradient method on the spur of the moment because it was faster to derive that on a napkin than to find my old paper".
The question can be asked "how much SVD is Funk SVD?". Some on the Net call it "Funk's heuristic approach to SVD" and "in missing data form it's a factor model not SVD". Another says "in the case of a partially observed matrix [..] the factorized solution U V'is not truely an SVD – there is no constraint that the matrices U and V be orthogonal".
Webb does not agree, to the first two
I'd personally probably just call it incremental approximate SVD. Factor model is a broad category; it's useful to retain reference to the connection to SVD even if it's no longer SVD since it helps (down the road) to understand some of the properties.
for the last,
Any regularization, whether implicit or explicit (implicit via not over-training) likely violates orthogonality, but if the underlying algorithm is run to convergence with adequately low learning rate and no regularization, it will settle on orthogonal matrices. Look to Sanger's GHA proof for this, but it's trivially understandable as reverse inhibition removing projections of existing vectors from the dataset which means the residual gradient necessarily slides into a subspace that lacks any projection on the other vectors (i.e., orthogonal).
Once you add regularization, or any of the non-linearities that the incremental approach allows you to conveniently insert, then all bets are off. However, assuming the primary gradient is still the majority influence, you can guarantee a degree of orthogonality, ergo the "approximate SVD" depiction.
Overall, we can say there are connections between GHA, the so-called "iterative power method", gradient solution and SVD; It is because of this, perhaps, Webb could delve into a solution without being scared off by the non-convexity of the problem.
No matter what route was taken during the invention of this method however, we can look at the error function for recommendations, and through its gradients one can see SGD can minimize it, which is proven by empirical tests. The error function is non-convex, however as Bottou and LeCun suggest absence of theory should not stop us using a method. Plus people started to look at stochastic approaches, their theoretical guarantees much closer now, as seen here
http://arxiv.org/pdf/1308.3509
The paper above also talks about power method and SGD connection BTW.
Note: power method is used to find a single eval/evec, that is evec for the largest eval, however it can be used to find all other evecs by "removing" the newlyfound evec from matrix A (through a process called deflation), and re-running the power method on A again which finds the second eigenvector, and so on.
http://www.maths.qmul.ac.uk/~wj/MTH5110/notes/MAS235_lecturenotes1.pdf
http://heim.ifi.uio.no/~tom/powerandqrslides.pdf