How to follow matrix operations in proofs?

This is a good question which you have already answered for yourself.

Mathematicians do this a lot:

carefully write down, try playing with concrete matrices

After a while (a long while sometimes) you see how it works. Then you can parse similar expressions more quickly.

There really is no shortcut.


Let's visualize it.

We have the scalar expression: $$\boldsymbol\theta^T \mathbf x_i - y_i = \begin{bmatrix}&&\boldsymbol\theta^T&&\end{bmatrix}\begin{bmatrix} \\ \\ \mathbf x_i \\ \\ \\ \end{bmatrix} - y_i \tag 1 $$ Transpose this expression to get the same scalar: $$(\boldsymbol\theta^T \mathbf x_i - y_i)^T = \mathbf x_i^T \boldsymbol\theta - y_i = \begin{bmatrix} & & \mathbf x_i^T & & \end{bmatrix}\begin{bmatrix}\\\\\boldsymbol\theta\\\\\\\end{bmatrix} - y_i \tag 2 $$ Extend into matrices and vectors: $$\begin{bmatrix}\\X \boldsymbol\theta - \mathbf y \\\\\end{bmatrix} = \begin{bmatrix} & & \mathbf x_1^T & & \\ &&\vdots\\ & & \mathbf x_n^T & & \\\end{bmatrix} \begin{bmatrix}\\\\\boldsymbol\theta\\\\\\\end{bmatrix} - \begin{bmatrix}\\\mathbf y\\\\\end{bmatrix} \tag 3 $$ The definition of the norm says: $$\sum |a_i|^2 = \|\mathbf a\|^2 = \mathbf a^T \mathbf a \tag 4$$ Substitute our expression in the norm: $$\sum_{i=1}^n |\boldsymbol\theta^T \mathbf x_i - y_i|^2 = \sum_{i=1}^n |(X \boldsymbol\theta - \mathbf y)_i|^2 = \|X \boldsymbol\theta - \mathbf y\|^2 = (X \boldsymbol\theta - \mathbf y)^T (X \boldsymbol\theta - \mathbf y) \tag 5$$


The fist sum $$\sum_{i=1}^n\|\theta^Tx_i-y_i\|^2$$ is nothing other than the expanded form for the norm squared of some vector (in this case can be regarded as some distance, but I want to simplify things). The norm squared can be expressed as the scalar product of the above mentioned vector with itself. Let me explain: take a vector $\mathbf{v}\in\mathbb{R}^3$ defined as $$\mathbf{v} = (x,y,z).$$ From the Pythagorean theorem, we know that the length squared of this vector is $$\|\mathbf{v}\|^2 = x^2+y^2+z^2.$$ Mathematically speaking, the length and the norm are the same thing (norm is more general). We can define the norm of a vector with the scalar product in this manner $$\|\mathbf{v}\|^2=\mathbf{v}\cdot \mathbf{v} = \mathbb{v}^T\mathbb{v}= \left(\begin{matrix}x&y&z\end{matrix}\right)\left(\begin{matrix}x\\y\\z\end{matrix}\right) = x^2+y^2+z^2 = \sum_{i=1}^3 x_i^2$$ from basic matrix multiplication and where I've written $(x,y,z)=(x_1,x_2,x_3)$.

So back to our example $$\sum_{i=1}^n\|\theta^Tx_i-y_i\|^2.$$ In this case the vector we're taking the norm is defined as having the $i$-th component as $v_i = \theta^Tx_i-y_i$ which is the difference (component by component) of two vectors $\theta\mathbf{x}$ and $\mathbf{y}$. So the vector itself is to be written as $$\mathbf{v} = \theta\mathbf{x}-\mathbf{y}.$$ Now knowing what I told earlier, the norm squared of this vector is $$\mathbf{v}^T \mathbf{v}= (\theta\mathbf{x}-\mathbf{y})^T(\theta\mathbf{x}-\mathbf{y}).$$


Most mathematicians would make the jump from the first line to the second line instantly, not because they are smart, but because they have seen that sort of thing before. The first couple of times you encounter $\|\mathbf{v}\|^2 = \mathbf{v}^T\mathbf{v}$ you have to puzzle through it. Sooner rather than later you simply recognize it.

You can doubtless think of similar examples in your software background. When you are learning a new programming language it might all seem mysterious. You might need to spend a lot of time understanding what a single line of code does, so understanding a large program in that language seems out of reach. But, as you become fluent in that language you start to pick up on common idioms. Rather than puzzling over every single line you develop an ability to stand back and instantly see what an entire block of code does. Similarly, the more math you do the more you develop the ability to stand back and see the overall flow of a proof or computation. It never becomes as easy as reading a novel, but it ceases to be an endless stream of new enigmas.

You can also use your ability to program as a tool to help you assimilate the math. The equality of those two lines is a computational fact. You can write two functions, one which does the computation on the first line, and one that does the computation on the second line. Verify that they give the same output (up to floating point error) for various inputs. That wouldn't constitute a proof, but would give you more insight into what is going on. For linear algebra using something like R which has things like transpose and matrix multiplication built into the language itself would a good language choice. Python with numpy is another good choice for quickly translating mathematical ideas to code.