Isolating frequencies and phases of superposed cosine functions

First of all, discrete Fourier series do not really help here since these functions are not even periodic in general. Instead, the parameters can be recovered by using a linear recurrence. If you sample such a function on a fixed frequency then every five subsequent values satisfy a fixed linear relation. So take $$f(k) = A \cos(\omega_1 k + \varphi_1) + B \cos(\omega_2 k + \varphi_2)$$ for $k \in \mathbb Z$. Assume the generic case where $\cos(\omega_1) \neq \cos(\omega_2)$ and both unequal to $\pm 1$. Then there exist unique $c_1, c_2 \in \mathbb R$ such that $$f(k) + c_1 f(k+1) + c_2 f(k+2) + c_1 f(k+3) + f(k+4) = 0$$ for all $k\in \mathbb Z$. In particular these coefficients can be derived from any six subsequent values of $f$ by solving two linear equations in $c_1, c_2$. The characteristic polynomial for this recurrence factors as $$1 + c_1 x + c_2 x^2 + c_1 x^3 + x^4 = (x^2 - 2 \cos(\omega_1) x + 1)(x^2 - 2 \cos(\omega_2) x + 1).$$ This recovers $\cos(\omega_1)$ and $\cos(\omega_2)$ as solutions of the quadratic equation $$4x^2 + 2 c_1 x + c_2-2 = 0.$$ Now $f$ can be fitted as a linear combination of the four functions $$\cos(\omega_1 k), \sin(\omega_1 k), \cos(\omega_2 k), \sin(\omega_2 k)$$ noting that $$\cos(\omega k + \varphi) = \cos(\varphi) \cos(\omega k) - \sin(\varphi) \sin(\omega k).$$


This is a standard question and I believe there are many solution approaches. I like the Bayesian techniques described in the book "Bayesian Spectrum Analysis and Parameter Estimation" by Larry Bretthorst (who was a PhD student of E T Jaynes) because they deal with noise quite gracefully. The basic setting is the following: the observed data is $(x_1, y_1), \dots, (x_n, y_n)$ for which we use the model: $$y_i = f(x_i) + \epsilon_i$$ where $$f(x) = A_1 \cos(\omega_1 x + \Phi_1) + A_2 \cos(\omega_2 x + \Phi_2)$$ and $\epsilon_1, \dots, \epsilon_n$ are errors commonly modeled as independently distributed according to a zero-mean normal distribution $N(0, \sigma^2)$. One can assume that $x_1, \dots, x_n$ (which need not be equally spaced) are deterministic (or one can do the analysis conditional on the observed $x_1, \dots, x_n$). The goal is to recover the unknown parameters $\omega_1, \omega_2, A_1, A_2, \Phi_1, \Phi_2$ (note the noise level $\sigma$ is also an unknown parameter) from data $(x_1, y_1), \dots, (x_n, y_n)$. It is convenient to reparametrize the function $f$ as $$f(x) = a_1 \cos(\omega_1 x) + b_1 \sin(\omega_1 x) + a_2 \cos(\omega_2 x) + b_2 \sin (\omega_2 x)$$ where $a_i = A_i \cos(\Phi_i)$ and $b_i = -A_i \sin(\Phi_i)$. The model can be rewritten as: $$Y = X(\omega_1, \omega_2) \beta + \epsilon$$ where $Y$ is the vector with components $y_1, \dots, y_n$, $X$ is the $n \times 4$ matrix with columns corresponding to the evaluations of the four functions $\cos(\omega_1 x)$, $\sin(\omega_1 x)$, $\cos(\omega_2 x)$ and $\sin(\omega_2 x)$ for $x = x_1, \dots, x_n$, $\beta$ is the $4 \times 1$ vector with components $a_1, b_1, a_2, b_2$ and $\epsilon$ is the $n \times 1$ noise vector with components $\epsilon_1, \dots, \epsilon_n$. If $\omega_1$ and $\omega_2$ are known, this is a simple linear model which leads to easy estimation of $a_1, b_1, a_2, b_2$ by least squares. When $\omega_1$ and $\omega_2$ are unknown, this is a nonlinear model.

Now to estimate the unknown parameters, in a Bayesian approach, we need to make prior probabilistic assumptions on the unknown parameters $\omega_i, a_i, b_i$ (for $i=1,2$) and $\sigma$. A simple assumption reflecting prior ignorance on the values of these parameters is: $$\omega_1, a_1, b_1, \omega_2, a_2, b_2, \log \sigma \overset{\text{i.i.d}}{\sim} \text{Unif}(-C, C)$$ where $C$ is a very large positive constant (here Unif refers to the uniform distribution). The exact value of $C$ should not have a major effect on the results below (as long as $C$ is large).

The modeling assumptions allow us to calculate the conditional distribution of the parameters $$\theta = (\omega_1, \omega_2, a_1, b_1, a_2, b_2, \sigma) = (\omega_1, \omega_2, \beta, \sigma)$$ given the observed data (this is called the posterior distribution) by the use of Bayes rule: $$\begin{aligned}& p(\theta \mid \text{data}) \propto p(\theta) p(\text{data}\mid \theta) \\ & = \frac{(2C)^{-7}}{\sigma} I\{-C < \omega_1,\omega_2, a_1, b_1, a_2, b_2, \log \sigma < C\} \\ &\quad\, \times \left(\frac{1}{\sqrt{2\pi}\sigma} \right)^n \exp \left(-\frac{1}{2\sigma^2} \|Y - X(\omega_1, \omega_2) \beta\|^2 \right) \\ &\propto I\{-C < \omega_1,\omega_2, a_1, b_1, a_2, b_2, \log \sigma < C\} \sigma^{-n-1} \exp \left(-\frac{1}{2\sigma^2} \|Y - X(\omega_1, \omega_2) \beta\|^2 \right) \end{aligned}.$$ This posterior density (suitably normalized so it integrates to one) captures uncertainty in the parameters $\theta$ after observing the data. The indicator term is generally irrelevant as the term $$\sigma^{-n-1} \exp \left(-\frac{1}{2\sigma^2} \|Y - X(\omega_1, \omega_2) \beta\|^2 \right)$$ will usually be negligibly small outside the set $-C < \omega_1,\omega_2, a_1, b_1, a_2, b_2, \log \sigma < C$. Let us drop the indicator to get the simpler expression $$p(\theta \mid \text{data}) \propto \sigma^{-n-1} \exp \left(-\frac{1}{2\sigma^2} \|Y - X(\omega_1, \omega_2) \beta\|^2 \right).$$

If we are not interested in certain parameters, we can just integrate those away. Suppose we are only interested in the frequency parameters $\omega_1$ and $\omega_2$, we can then integrate away $\beta$ and $\sigma$. To integrate over $\beta$, we can write (with $X = X(\omega_1, \omega_2)$: $$\|Y - X \beta\|^2 = \|Y - X \hat{\beta}\|^2 + \|X \hat{\beta} - X \beta\|^2 = \|Y - X \hat{\beta}\|^2 + (\beta - \hat{\beta})^T X^T X (\beta - \hat{\beta})$$ where $\hat{\beta} = (X^T X)^{-1} X^T Y$ is the least squares estimate of $Y$ on $X$, and also note $$\int \exp\left(-\frac{1}{2 \sigma^2} (\beta - \hat{\beta})^T X^T X (\beta - \hat{\beta}) \right) d\beta = (\sigma\sqrt{2 \pi})^p |X^T X|^{-1/2}$$ where $p = 4$ (assuming that $X$ has full column rank; otherwise, one can drop some columns in $X$) and $|\cdot|$ denotes determinant. This gives $$p(\omega_1, \omega_2, \sigma \mid \text{data}) \propto \sigma^{-n+p-1} \exp \left(-\frac{\|Y - X \hat{\beta}\|^2}{2 \sigma^2}\right) |X^TX|^{-1/2}.$$ Note that the dependence on $\omega_1, \omega_2$ in the right hand side is through the matrix $X = X(\omega_1, \omega_2)$ and the vector $\hat{\beta}$. $\sigma$ can be easily integrated away from the above to obtain $$p(\omega_1, \omega_2 \mid \text{data}) \propto \|Y - X \hat{\beta}\|^{-n+p} |X^T X|^{-1/2}.$$ This density allows estimation of $\omega_1, \omega_2$ as well as uncertainty quantification. Practically this can be implemented as follows:

  1. Take a grid of values for $\omega_1, \omega_2$. For each value of $\omega_1, \omega_2$ in the grid, form the matrix $X = X(\omega_1, \omega_2)$ and perform least squares of $Y$ on $X$ to calculate $\hat{\beta}$. Then calculate the value of $h(\omega_1, \omega_2) := \|Y - X \hat{\beta}\|^{-n+p} |X^T X|^{-1/2}$. Do this for each pair of values $\omega_1,\omega_2$ in the grid and then normalize $h(\cdot, \cdot)$ so it sums to one over all $\omega_1,\omega_2$ in the grid.
  2. Plot the function $h(\omega_1, \omega_2)$ for $\omega_1,\omega_2$ in the grid. Typically, it will have a single sharp maximum at some $\hat{\omega}_1, \hat{\omega}_2$. You can then take $\hat{\omega}_1, \hat{\omega}_2$ to be accurate point estimates of $\omega_1, \omega_2$.
  3. After obtaining $\hat{\omega}_1, \hat{\omega}_2$, do least squares of $Y$ on $\hat{X} := X(\hat{\omega}_1, \hat{\omega}_2)$ to obtain estimates of $a_1, b_1, a_2, b_2$.

This method generalizes in a straightforward manner to deal with a larger number of sinusoids $\sum_j A_j \cos(\omega_j x + \Phi_j)$ (although the grid search method for estimating $\omega_1, \dots, \omega_k$ will be computationally intensive for larger $k$). The book ("Bayesian spectrum analysis and parameter estimation" by Bretthorst) also deals with the case of unknown $k$.