Why does calculating matrix inverses, roots, etc. using the spectrum of a matrix work?
Two classic books where you can find all the details about this stuff:
- Gantmacher, Matrix theory, Chelsea.
- Lancaster-Titsmenesky, The theory of matrices, Academic Press.
Actually, for "hand" calculations, this works through the Jordan canonical form: you find the Jordan canonical form of your matrix, together with the change of basis matrix
$$ A = SJS^{-1} \ . $$
Then you prove that, for any polynomial $p(t)$, you have
$$ p(A) = S p(J) S^{-1} \ . $$
Hence,
$$ f(A) = S f(J) S^{-1} $$
and you only need to compute $p(J)$ for Jordan matrices.
Which you do as follows: first, if you have a diagonal block matrix
$$ J = \begin{pmatrix} J_1 & 0 & \dots & 0 \\ 0 & J_2 & \dots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \dots & J_r \end{pmatrix} $$
you can easily prove that
$$ p(J) = \begin{pmatrix} p(J_1) & 0 & \dots & 0 \\ 0 & p(J_2) & \dots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \dots & p(J_r) \end{pmatrix} $$
So again, on one hand,
$$ f(J) = \begin{pmatrix} f(J_1) & 0 & \dots & 0 \\ 0 & f(J_2) & \dots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \dots & f(J_r) \end{pmatrix} $$
and, on the other hand, you just need to know $p(J)$ when $J$ is a Jordan block. If :
$$ J = \begin{pmatrix} \lambda & 0 & 0 & \dots & 0 & 0 \\ 1 & \lambda & 0 & \dots & 0 & 0 \\ 0 & 1 & \lambda & \dots & 0 & 0 \\ \vdots & \vdots & \ddots & \ddots & \vdots & \vdots \\ 0 & 0 & \dots & 1 & \lambda & 0 \\ 0 & 0 & \dots & 0 & 1 & \lambda \end{pmatrix} $$
is an $r\times r$ Jordan block, then
$$ p(J) = \begin{pmatrix} p(\lambda ) & 0 & 0 & \dots & 0 & 0 \\ p'(\lambda)/ 1! & p(\lambda) & 0 & \dots & 0 & 0 \\ p''(\lambda)/ 2! & p'(\lambda)/ 1! & p(\lambda) & \dots & 0 & 0 \\ \vdots & \vdots & \ddots & \ddots & \vdots & \vdots \\ p^{(r-2)}(\lambda)/(r-2)! &p^{(r-3)}(\lambda)/(r-3)! & \dots & p'(\lambda)/ 1! & p(\lambda) & 0 \\ p^{(r-1)}(\lambda)/(r-1)! &p^{(r-2)}(\lambda)/(r-2)! & \dots & p''(\lambda)/2! & p'(\lambda)/ 1! & p(\lambda) \end{pmatrix} $$
Hence, again you have all in terms of $f$ in fact:
$$ f(J) = \begin{pmatrix} f(\lambda ) & 0 & 0 & \dots & 0 & 0 \\ f'(\lambda)/ 1! & f(\lambda) & 0 & \dots & 0 & 0 \\ f''(\lambda)/ 2! & f'(\lambda)/ 1! & f(\lambda) & \dots & 0 & 0 \\ \vdots & \vdots & \ddots & \ddots & \vdots & \vdots \\ f^{(r-2)}(\lambda)/(r-2)! &f^{(r-3)}(\lambda)/(r-3)! & \dots & f'(\lambda)/ 1! & f(\lambda) & 0 \\ f^{(r-1)}(\lambda)/(r-1)! &f^{(r-2)}(\lambda)/(r-2)! & \dots & f''(\lambda)/2! & f'(\lambda)/ 1! & f(\lambda) \end{pmatrix} $$
And, in this version of the story, you actually don't need to know your polynomial $p(t)$ for your function $f(t)$ and matrix $A$ -but it's not difficult to find it, anyway: it's called the Lagrange-Sylvester polynomial, which is some sort of mixture between the classic Lagrange interpolation polynomial and a Taylor series.
EDIT
Nevertheless, seems that I forgot to answer the most important question: "Why does this whole thing actually work?"
That is, why defining
$$ f(A) = p(A) $$
for some polynomial $p(t)$ that agrees with $f(t)$ on the spectrum of $A$ all this makes sense? That is, for what reason can we actually call $p(A)$ (the computable thing) the "value" of $f(t)$ on the matrix $A$?
Because of the following:
Theorem. (Gantmacher, chapter V, $\S 4$, theorem 2). If the function $f(t)$ can be expanded in a power series in the circle $\vert t - \lambda \vert < r$,
$$ f(t) = \sum_{n=0}^\infty a_n(t-\lambda)^n \ , $$
then this expansion remains valid when the scalar argument $t$ is replaced by a matrix $A$ whose eigenvalues lie within the circle of convergence.
That is, under the hypotheses of the theorem, you have
$$ f(A) = \sum_{n=0}^\infty a_n(A-\lambda I)^n \ , $$
where the $f(A)$ in the left hand-side means $p(A)$, the value of the Lagrange-Sylvester polynomial on $A$.
So, why not define $f(A)$ as this last power series (i.e., the Taylor series of $f(t)$)? Well, because then you would have to talk a long time about convergence issues of matrix series... And you would end, finally, at the same point: relying on the Jordan canonical form for actual computations. So, the Lagrange-Sylvester device allows you to get rid of convergence issues -if you're willing to accept $f(A) = p(A)$ as a sound definition.
When the matrix $A$ is not diagonalizable, working with the "spectrum" becomes technically involved, but even then via Cauchy's theorem of complex analysis you can make sense of $\sin(A)$, etc., in a single formula not involving any eigenvalues or diagonalization process: When all eigenvalues of $A$ are contained in the disk $D_R$ of radius $R$ around $0$ then $$\sin(A)={1\over2\pi i}\int_{\partial D_R}{\sin z\over z-A}\ dz\ .$$
Back to matrices: In the one-dimensional case an $A:\ {\mathbb R}\to{\mathbb R}$ is just a scaling $t\mapsto \lambda t$ with a certain factor $\lambda\in{\mathbb C}$, and the only meaning $f(A)$ could have then is that $f(A)$ scales by the factor $f(\lambda)$.
In the $n$-dimensional diagonalizable case the space $V={\mathbb R}^n$ where $A$ acts can be written as a direct sum $V=\oplus_{i=1}^n V_i$ of the one-dimensional eigenspaces $V_i:=\{x| Ax=\lambda_i x\}$. This means that $A$ acts as $n$ independent one-dimensional scalings $t\mapsto \lambda_i t$ in parallel. If you are given a function $f$ whose domain $\Omega\subset{\mathbb C}$ contains the $\lambda_i$, then $f(A)$ restricted to one of the $V_i$ is just the map $$f(A)\restriction V_i: \quad t\mapsto f(\lambda_i) t\ ,$$ where $t$ is a "local" coordinate in $V_i$, denoted by $t_i$ when we look at all of these at the same time.
Now, since all this makes sense when we look at $A$ as being a linear map $V\to V$, by general principles this also makes sense in the world of matrices. In particular: If $T$ is the matrix whose columns are the eigenvectors of $A$, i.e., the basis vectors of the $V_i$ above, and $D={\rm diag}(\lambda_1,\ldots,\lambda_n)$ is the diagonal matrix containing the eigenvalues, then $$f(A)=T\ {\rm diag}\bigl(f(\lambda_1),\ldots,f(\lambda_n)\bigr)\ T^{-1}\ .$$