Solution 1:

We work with fields of numbers, such as $\Bbb Q$, the field of rational numbers, $\Bbb R$, the field of real numbers, and $\Bbb C$, the field of complex numbers. What is a field? It's a set in which two invertible operations - addition and multiplication - interact. Elementary algebra is simply the study of that interaction.

What's a linear function defined on one of these fields? It's a function that is compatible with the two operations. If $f(x+y)=f(x)+f(y)$, and $f(cx)=cf(x)$, then the whole domain, before and after applying $f$, is structurally preserved. (That's as long as $f$ is invertible; I'm glossing over some details.) Essentially, such a function is simply taking the field and scaling it, possibly flipping it around as well. In the complex field, the picture is a little more.... complex, but fundamentally the same.

The most intuitive vector spaces - finite dimensional ones over our familiar fields - are basically just multiple copies of the base field, set at "right angles" to each other. Invertible linear functions now just scale, reflect, rotate and shear this basic picture, but they preserve the algebraic structure of the space.

Now, we often work with transformations that do more complicated things that this, but if they are smooth transformations, then they "look like" linear transformations when you "zoom in" at any point. To analyze something complicated, you have to simplify it in some way, and a good way to simplify working with some weird non-linear transformation is to describe and study the linear transformations that it "looks like" up close.

This is why we see linear problems arise so frequently. Some situations are modeled by linear transformations, and that's great. However, even situations modeled by non-linear transformations are often approximated with appropriate linear maps. The first and roughest way to approximate a function is with a constant, but we don't get a lot of mileage out of that. The next fancier approach is the approximate with a linear function at each point, and we do get a lot of mileage out of that. If you want to do better, you can use a quadratic approximation. These are great for describing, for instance, critical points of multi-variable functions. Even the quadratic description, however, uses tools from linear algebra.


Edit: I've thought more about this, and I think I can speak further to your question, from comments, "why does the property of linearity make linear functions so "rigid"?"

Consider restricting a linear function on $\Bbb R$ to the integers. The integers are a nice, evenly spaced, discrete subset of $\Bbb R$. After applying a linear map, their image is still a nice, evenly spaced, discrete subset of $\Bbb R$. Take all the points with integer coordinates in $\Bbb R^2$ or $\Bbb R^3$, and the same thing is true. You start with evenly spaced points all in straight lines, and after applying a linear map, you still have evenly spaced points, all in straight lines. Linear maps preserve lattices, in a sense, and that's precisely because they preserve addition and scalar multiplication. Keeping evenly spaced things evenly spaced, and keeping straight lines straight, seems to be a pretty good description of "rigidity".

Does that help at all?

Solution 2:

I'll give my two cents, from an applied perspective: what makes linearity so powerful is that linear operations are easily invertible. Many, many, many problems in mathematics boil down to having to solve for $x$ in some relation of the form $$ y = f(x). $$ There is of course no general method of computing $x = f^{-1}(y)$ for arbitrary $f$, but if $f$ is linear, i.e. $$ y = Ax $$ for some matrix $A$, if $A$ is invertible, then we can simply do some arithmetic and compute $$ x = A^{-1}y. $$ Even if the the problem is overconstrained and there is no exact solution, we can still use linear algebra to compute a pseudo-inverse: $$ x = (A^TA)^{-1}A^Ty, $$ and get the least-squares solution (the best we can hope for) minimizing $\|y-Ax\|_2^2$.

Maybe this begs the question "well then why can we invert linear functions so easily?" but I think this can be explained by the fact that field addition and multiplication are invertible, by definition, and linear maps are composed of nothing but addition and multiplication. It seems pretty natural to me that transformations composed of the fundamentally invertible field operations $(+, \cdot)$ will be invertible by e.g. back-substitution (in non-degenerate cases of course). Note that linear algebra is ubiquitous in applications, while module theory is not --- the only difference is that a module's scalar multiplication is not invertible!


I think this line of reasoning also addresses the question:

An idea of why linear problems, or linearization, shows up so frequently

The reason is that we frequently need to invert things, and often the only way to go about that is to compute a linear approximation and invert that. Two examples:

  • The extended Kalman filter takes a state update and observation model \begin{align*}x_k &= f(x_{k-1}) + w_k \\ z_k &= h(x_k) + v_k \end{align*} and linearizes it to \begin{align*} x_k &= Fx_{k-1} + w_k \\ z_k &= H x_k + v_k \end{align*} with $F = df$, $H = dh$, which makes it possible to compute the Kalman gain, which requires an inversion: $$ K_k = P_{k|k-1} H_k^T(H_k P_{k|k-1}H_k^T + R_k)^{-1}. $$
  • Newton's method in optimization requires solving $$ \frac{\partial}{\partial \delta} f(x_k + \delta) = 0 $$ for $\delta$ at each step. Taking a quadratic approximation $m_k(\delta) = f_k + \nabla f_k^T \delta + \delta^T \nabla^2f_k \delta$ makes this equation linear, and we are able to invert and solve for the optimal step: $$ \delta = - (\nabla^2 f_k)^{-1} \nabla f_k. $$

Solution 3:

Linear problems are so very useful because they describe well small deviations, displacements, signals, etc., and because they admit single solutions. For sufficiently small $x$, $f(x) = a_0 + a_1 x + a_2 x^2 + a_3 x^3 + \ldots$ can be very well approximated by $a_0 + a_1 x$. Even the simplest nonlinear equation, $x^2 - 3 = 0$ has two real solutions, making analysis more difficult. Linear equations have a single solution (if one exists).

Solution 4:

I think this question misses the point slightly. It's not about why the axioms of linearity make it so well understood, it's more so (in my view) about why such simple operations fully characterize what we think of as linear.

Here is a pretty fluffy answer, but something that might help(?)

There are two ways to understand why linear functions are so desirable.

  1. For a linear function, where you are headed is not determined by where you are.

That is, say for optimization, if you are before a peak and after a peak, it's unclear whether or not you should increase or decrease your parameter. This is not so for linear functions, and unlike other functions, the cost of increasing your parameter ($\Delta x$) does not depend on where you are either.

This is succinctly the fact that $f(x+\Delta x)=f(x)+f(\Delta x)$ for all choices of $x$.

It makes sense then that we would use linear functions for a variety of problems to decide what is going on locally and this is the key insight of the derivative.

  1. Linear transformations have a nontrivial and easy to describe geometry.

This is not any more difficult than why one computes riemann integrals (which is a linearization in its own right) by looking at the area of squares, or decides the angle of intersection by intersecting tangent lines. Linear things just have a clear geometry that well approximates large classes of objects. This geometry is one that can scale too, which is essential to our geometric pictures. If we transform something, the picture should not depend on a choice of co-ordinate axes, which is to say that $kf(x)=f(kx)$.

  1. Dimension plays a different role in (finite dimensional) linear algebra that is, when things behave linearly, the linear geometry (and math) of $\mathbb R^4$ is not that different from $\mathbb R^3$. This is basically a consequence of the fact that $V \cong k^n$ for the ground field $k$, where the latter is read as a $k$-vector space. This basically imposes a certain homogeneity on a vector space (and linear functions on it) since they are "homogeneous" as you go up in dimension: essentially more copies of $k$.

Basically: if you solve a problem in a given dimension for one vector space, you got it for all vector paces, as long as you can transform one into the other.