A path to truly understanding probability and statistics
Solution 1:
As someone who started out their career thinking of statistics as a messy discipline, I'd like to share my epiphany regarding the matter. For me, the insight came from Linear Algebra, so I would urge you to push in that direction.
Specifically, once you realize that the sum of squares, $\sum_i X_i^2$, and sum of products, $\sum_i X_i Y_i$, are both inner products (aka dot products), you realize that nearly all of statistics can be thought of as various operations from linear algebra.
If you sample $n$ values from a population, you have an $n$-dimensional vector. The sample mean is a projection of this vector onto the $n$-dimensional all-ones vector. The standard deviation is projection onto the $(n-1)$-dimensional hyperplane normal to the all-ones vector (finally an intuitive reason for the "$n-1$" in the denominator!). Specifically, for the sample variance $s^2$ for sample $X$, here is the linear algebra:
First, we work with deviations from the mean. The mean in linear algebra terms is
$\bar{X}=\frac{\langle X,\mathbf{1}\rangle}{\langle \mathbf{1},\mathbf{1}\rangle} \mathbf{1}$
where $\langle \cdot, \cdot \rangle$ is the inner product and $\mathbf{1}$ is the $n$-dimensional ones vector. Then the deviation from the mean is
$x = X - \bar{X}$
Note that $x$ is constrained to an $(n-1)$-dimensional subspace. The usual equation for variance is
$s^2 = \dfrac{\sum_i (X_i - \bar{X})^2}{n-1}$
For us, that's
$s^2 = \dfrac{\langle x, x \rangle}{\langle \mathbf{1}, \mathbf{1} \rangle}$
which, without going into too much detail (too late) is a normalized deviation. The trick there is that the new $\mathbf{1}$ has dimension $n-1$.
The other good example is that correlation between two samples is related to the angle between them in that $n$-dimensional space. To see this, consider that the angle between two vectors $v$ and $w$ is:
$\theta = \arccos \dfrac{\langle v, w \rangle}{\|v\|\|w\|}$
where $\|\cdot\|$ is vector length. Compare this to one of the forms for the Pearson Correlation and you will see that $r = \cos \theta$.
There are many other examples, and these have barely been explained here, but I just hope to give an impression of how you can think in these terms.
Solution 2:
My humble contribution to your book list: Linear Algebra Done Right by Axler. It's a brilliant book that makes a lot of abstract things very clear. It had been recommended to me many times.
Also, I recently found a book entitled Statistical Methods: The Geometric Approach. I haven't read through all of it yet, but it gives a very basic introduction to probability from a linear algebra perspective, which I think is very intuitive (much easier on the eyes than looking at sigmas with a bunch of random indices I feel).
(Sorry, I'm too noob on this website to post a comment.)