Where do the higher order terms in Taylor series come from?
To simplify, let's discuss first what happens near $x_0 = 0$.
The way the local linear approximation works is that we are looking for the line that "best approximates" the function near $0$. This means that we want a function $y=a+bx$ with the property that the error between $y=a+bx$ and $y=f(x)$ goes to $0$ faster than $x$ approaches $0$. The first That is, we want $$\lim_{x\to 0}\frac{f(x) - (a+bx)}{x} = 0.$$ A bit of work shows that the only $a$ and $b$ that can work are $a=f(0)$ and $b=f'(0)$: $$\begin{align*} \lim_{x\to 0}\frac{f(x)-(a+bx)}{x} &= \lim_{x\to 0}\frac{f(x)-f(0)+f(0)-(a+bx)}{x}\\ &= \lim_{x\to 0}\frac{f(x)-f(0)}{x-0} + \lim_{x\to 0}\frac{f(0)-(a+bx)}{x}\\ &= \lim_{x\to 0}\frac{f(x)-f(0)}{x-0} + \lim_{x\to 0}\frac{f(0)-a}{x} - \lim_{x\to 0}\frac{bx}{x}. \end{align*}$$ If the function is differentiable, then the first limit is $f'(0)$; the last limit is $b$; and if we want the limit in the middle to exist, we need $f(0)=a$. So the only way this limit is equal to $0$ is if $b=f'(0)$ and $a=f(0)$.
This tells us that the line of "best approximation" to $f(x)$ near zero is $y = f(0) + f'(0)x$, which is precisely the local linear approximation.
(The fraction $\frac{f(x)-(a+bx)}{x}$ is the "relative error", because the numerator measures the error, but by dividing it by $x$ we are trying to take into account how large the quantity is; if I tell you that I measured a distance and I was off by 2 miles, you don't know if it was a very good approximation or a very bad one; it would be very good if I were measuring the distance to the moon; it would be very bad if I was measuring the length of my desk; the denominator tries to "normalize" the measurement so that it tells us how big the error is relative to how big the thing we are measuring is)
Now, the advantage of the local linear approximation is that it is very simple; the disadvantage is that it can get bad pretty quickly; for example, the local linear approximation of $y=\cos x$ at $0$ is $y=1$, which gets "bad" very quickly.
So we may want to approximate with something else which, while still easy, has a better chance of being a good approximation.
One possibility is to go from linear to quadratic: let's try to approximate $y=f(x)$ with a quadratic function, $y = a+bx+cx^2$; again, in order for the approximation to be very good, we want the error to go to zero faster than $x$ goes to $0$. And since now we have a parabola, we can require that the curvature of the parabola at $0$ be the same as the curvature of $y=f(x)$ at $0$ (this is what messes us up with $y=\cos x$: it has big curvature, but the local linear approximation is flat).
Now, for this parabola to approximate $y=f(x)$ well near $0$, we are going to want their local linear approximations to be the same (if they have different local linear approximations, then we can't expect $y=a+bx+cx^2$ to be "close to" $f(x)$). The local linear approximation to $y=a+bx+cx^2$ at $x=0$ is give by $a+bx$, so we want $a+bx = f'(0)x+f(0)$; so $a=f(0)$, and $b=f'(0)$.
To get the same curvature, we want $f''(0) = y''(0)$. Since $y''(0) = 2c$, we want $c = \frac{1}{2}f''(0)$. So our $y$ will be $$y = f(0) + f'(0)x + \frac{f''(0)}{2}x^2.$$
In fact, this is indeed a very good approximation: it has the same value as $f(x)$ at $x=0$; the relative error goes to zero, since $$\begin{align*} \lim_{x\to 0}\frac{f(x) - (f(0)+f'(0)x + \frac{1}{2}f''(0)x^2)}{x} &= \lim_{x\to 0}\frac{f(x) - (f(0)+f'(0)x)}{x} - \lim_{x\to 0}\frac{f''(0)x^2}{2x}\\ &= 0 - \lim_{x\to 0}\frac{1}{2}f''(0)x = 0. \end{align*}$$ But more than that: the relative error between the derivatives goes to $0$: $$\begin{align*} \lim_{x\to 0}\frac{f'(x) - (f'(0)-f''(0)x)}{x} &= \lim_{x\to 0}\frac{f'(x)-f'(0)}{x}- f''(0)\\ &= f''(0)-f''(0) = 0. \end{align*}$$
So not only does $y=f(0) + f'(0)x + \frac{1}{2}f''(0)x^2$ have the same value, and have the best possible relative error, the derivative is the best possible approximation to the derivative of $f(x)$; so the graph will have very similar shape near $0$.
If we then go on to a degree $3$ approximation, $y=a+bx+cx^2+dx^3$; if we want the relative error to go $0$, we are going to need $a=f(0)$, $b=f'(0)$. If we want the relative error in the derivative to go to $0$, we are going to need $c=\frac{1}{2}f''(0)$ as before. What if we want the relative error of the second derivative to go to $0$ as well? The second derivative of $y=a+bx+cx^2+dx^3$ is $2c+6dx = f''(0) + 6dx$. So we have: $$\begin{align*} \lim_{x\to 0}\frac{f''(x) - (f''(0)+6dx)}{x} &= \lim_{x\to 0}\frac{f''(x)-f''(0)}{x} - \lim_{x\to 0}6d\\ &= f'''(0) - 6d. \end{align*}$$ For this to be equal to $0$, we need $6d = f'''(0)$, or $d = \frac{1}{6}f'''(0)$.
If we then go to a degree $4$ approximation and ask that the relative error, the relative error of the derivatives, the relative error of the second derivatives, and the relative errors of the third derivatives all go to $0$, then we find that we need the function $y= f(0) +f'(0)x + \frac{1}{2}f''(0)x^2 + \frac{1}{6}f'''(0)x^3 + ex^4$, where $f^{(4)}(0) = 12e$, leading to $e = \frac{1}{12}f^{(4)}(0)$.
It is not hard to then see, inductively, that the coefficient we get for $x^n$ if we proceed this way will always be $\frac{f^{(n)}(0)}{n!}$.
You can repeat the same idea around any point $x_0$, but if you do it directly it turns out that you cannot easily use the coefficients you found for, say, the linear approximation, in the quadratic approximation; you end up with a system of equations instead of simply repeating the old coefficients. The simple way of "fixing" this is to shift everything to $0$, solve the problem at $0$, and then shift it back.
So if you want to solve the problem at $x_0$, we consider instead $g(x) = f(x+x_0)$, because then approximating $f$ near $x_0$ is the same as approximating $g$ at $0$. Moreover, $g'(x) = f'(x+x_0)(x+x_0)' = f'(x+x_0)$, $g''(x) = f''(x+x_0)$, etc. So, from the work above, we see that the degree $n$ approximation to $g$ near $0$ that has best possible relative error, best possible relative error between derivatives, best possible relative error between 2nd derivatives, etc. is $$g(0) + g'(0)x + \frac{1}{2}g''(0)x^2 + \cdots + \frac{1}{n!}g^{(n)}(0)x^n.$$ But form the above, this is the same as $$f(x_0) + f'(x_0)x + \frac{1}{2}f'(x_0)x^2 +\cdots + \frac{1}{n!}f^{(n)}(x_0)x^n.$$ This is the local approximation to $f(x+x_0)$ near $x=0$. Replacing $x$ with $x-x_0$, we get $f(x-x_0+x_0) = f(x)$, the "old $x$" was near $0$, so the new $x$ needs to be near $x_0$, and we get that for $x$ near $x_0$, we have $$f(x)\approx f(x_0) + f'(x_0)(x-x_0) + \frac{1}{2}f'(x_0)(x-x_0)^2 + \cdots + \frac{1}{n!}f^{(n)}(x_0)(x-x_0)^n$$ exactly the formula for the $n$th Taylor polynomial approximation to $f(x)$ near $x_0$.
Sam L. raises the fair question of why we would want to assume that our approximations are polynomials. If you are going via the "make the relative error go to $0$; make the relative error of the relative error go to $0$", etc., you will discover that you are naturally led to polynomials.
Here's another motivation: when we encounter a function which we want to integrate and for which we are unable to find an antiderivative, we end up with two possible approaches:
Attempt to approximate the value of the integral via Riemann sums; in essence, the integral is a limit, so we can approximate the limit by using terms of the sequence whose limit we are trying to compute. This leads to left, right, midpoint, trapezoid, Simpson approximations, and other methods of numerical approximations to the integral using the function $f$.
Find a function $\mathcal{F}$ which approximates $f$ but is easy to integrate. (In fact, this is in part what is behind the Simpson rule approximation, in that it approximates the function $f$ in each subinterval by a quadratic function). Since polynomials are generally easy to integrate, trying to find polynomials that are good approximations to $f$ seems like a good idea; that is, the class of polynomials are a good proving ground to try to find "good approximations to $f$", because they are easy to integrate.
Making several assumptions you can argue as follows.
First of all to simplify the reasoning we consider the case $x_0=0$. One way to get such coefficients is to assume that $f$ is "well approximated" by polynomials, which could be for example: for any $n\in\mathbb{N}$ we have $$ f(x)=a_0+a_1x+a_2x^2+\ldots a_nx^n+R(x) \qquad (1), $$ where $R(x)/x^n\to 0$ when $x\to 0$.
Now supposing that the function $R$ is $n$ times differentiable at the origin we get by computing successive derivatives of both sides of (1) at $x=0$, that $a_j$ has to be the Taylor coefficients.
I think the most naive approximation method leads to polynomials all by itself - without explicitly trying to approximate our initial function $f$ by polynomials at all:
The most primitive approximation of a continuous function around $0$ is simply given by $f(x) \approx f(0)$. Of course we are often not very satisfied with this, so it is natural to ask what the error $f(x) - f(0)$ looks like for growing $x$, i.e. we want to compare $f(x)-f(0)$ to $x$ - that is, we want to investigate the expression
$$err_1(x) = \frac{f(x) - f(0)}{x}$$
If this error function is continuous in $0$, then we can use the same idea of an approximation as above and we obtain $err_1(x) \approx err_1(0) = f'(0)$. Solving for $f(x)$, we obtain the approximation
$$f(x) = f(0) + err_1(x) x \approx f(0) + f'(0) x$$
If this approximation is still not good enough, then we can go on to investigate the error in the approximation of the first error, i.e.
$$err_2(x) = \frac{err_1(x) - err_1(0)}{x} = \frac{\frac{f(x)-f(0)}{x} - f'(0)}{x} = \frac{f(x) - f(0) - f'(0)x}{x^2}$$
If this is continuous in $0$, then - as above - we can make the approximation
$$err_2(x) \approx err_2(0) = \lim_{x\to0} \frac{f(x) - f(0) - f'(0)x}{x^2} = \frac{f''(0)}{2}$$
So solving for $f(x)$ we obtain:
$$f(x) = f(0) + f'(0)x + err_1(x)x^2 \approx f(0) + f'(0) x + \frac{f''(0)}{2}x^2$$
and so on.
So all we ever do is approximate the error of the error of the error in the most crude way and hope that each step will lead to a better approximation of our initial function (which is exactly what one would do naively, I think).
In particular, it need not even be clear that this leads to a polynomial approximation a priori. (I'm only writing this, because other answers seem to assume a certain form of the approximation and then show that this approximation is exactly the Taylor polynomial - it is however not clear to me, where the motivation for using polynomials rather than other functions comes from...)
I'll contribute with my grain of sand.
When I studied Taylor polynomials, this was the idea under it (or what I got):
Suppose $f(x)$ is an $n$ times differentiable function, continuous in a neighborhood of $x=a$. Suppose we build up another polynomial function $p(x)$, such that $f^{(n)}(a) = p^{(n)}(a)$ for $0,\dots,n$. Then $p$ should approximate $f$ with a good degree of accuracy. Why is this so? If the functions coincide in $a$ then they are equal in $a$. If they have the same derivative, then they approach $(a,f(a))$ in a similar manner. If they have the same second derivative, then they have similar curvature and their derivative behaves similarly in that neighborhood.
The idea is that each higher order term "extends" the neighborhood in which the functions are more and more like each other, and provides a higher accuracy (this can be explained by finding a formula for the error produced by the approximation, which is very interesting theory too).
As a bonus, you can inductively check that if we build up a polynomial around $x=a$
$$p_a^n(x) = a_0 + a_1 (x-a)+a_2(x-a)^2+\cdots+a_n(x-a)^n$$
then by differentiating and equating using our condition we'll find
$$a_n = \frac{f^{(n)}(a)}{n!}$$
for ever coefficient of the polynomial.
Introducing some more theory, we can talk about Lanadu's $o$ notation. The notation is basically the following:
We say $f(x)=o(g(x))$ as $x \to a$, read "$f$ is little $o$ of $g$ when $x$ approaches $a$" if
$$\lim\limits_{x \to a} \frac{f(x)}{g(x)} \to 0$$
If you recall comparing infinitesimals, this means that $f$ is of smaller order than $g$ in $a$.
We can use this notation to write $\sin x \sim x$, as
$$\sin x = x + o(x) \text{ for } x \to 0$$
This means
$$\lim\limits_{x \to 0} \frac{\sin x -x}{x} = 0$$
$$\lim\limits_{x \to 0} \frac{\sin x}{x}-1 = 0$$
$$\lim\limits_{x \to 0} \frac{\sin x}{x}=1 $$
Which I guess isn't too new news.
So, how can we use this with Taylor's polynomial approximations?
Let $p_a^n(x)$ be the Taylor polynomial of order $n$ around $a$ of an $n$ times differentiable function. Then, what we can write is that
$$f(x)=p_a^n(x)+o((x-a)^n)$$
This means that the difference $f-p$ is of smaller order than $(x-a)^n$ for $x \to a$, or that the error is small in comparison to $(x-a)^n$.
As an example
$$\cos x = 1-\frac{x^2}{2}+o(x^2)$$
(That is why, when handling limits with polynomials of second or first or zero degree, we can substitute $\cos x$ with the given expression.$)
I hope this helps you, but remember, I'm not a professor as Arturo is. I welcome reviews or edits of this by anyone which feels can improve the ideas or wording, for example.