Motivation of Weierstrass-approximation Theorem?
Solution 1:
In order to see why we use Bernstein polynomials, first note that what we essentially want to do is approximate the identity with polynomials, this will be easier to understand if you are familiar with convolutions, and approximations of the Dirac delta $\displaystyle\delta(x)$. We want to do a sort of convolution, or weighted average, something like $(p_{n,k}\ast f)(x)=\sum_{j=0}^np_{n,j}(x-y_j)f(y_j)$ of a sequence of polynomials $p_{n,k}$ of degree $n$ with a continuous function $f$, let's say on $[0,1]$, such that $p_{n,k}\ast f\to f$ as $n\to\infty$.
Being a weighted average, we should have $(p_{n,k}\ast 1)(x)=\sum_{j=0}^n p_{n,k}(y)=1$. So we want to write $1$ as a sum of polynomials of degree $n$. The easiest way to do this is to notice that $1=1-x+x$, hence $\displaystyle 1^n=(1-x+x)^n=\sum_{k=0}^n {n\choose k}(1-x)^{n-k}x^k$. Now we want to insert $f(x)$ in here somehow so that when $n$ is large, $(p_{n,k}\ast f)(x)$ will be approximately equal to $f(x)$. Now if you plot $\displaystyle{n\choose k}(1-x)^{n-k}x^k$, you will see that it is largest when $\frac{k}{n}\approx x$, and for large $n$ $\displaystyle{n\choose k}(1-x)^{n-k}x^k\approx 0$ when $\frac{k}{n}$ is far from $x$. So if we put $\displaystyle(p_{n,k}\ast f)(x)=\sum_{k=0}^n {n\choose k}(1-x)^{n-k}x^k f\Big(\frac{k}{n}\Big)$, then for large $n$ the summands are $\approx 0$ when $\frac{k}{n}$ is far from $x$, so they hardly contribute to the sum, hence for large $n$ $\displaystyle(p_{n,k}\ast f)(x)=\sum_{k=0}^n {n\choose k}(1-x)^{n-k}x^k f\Big(\frac{k}{n}\Big)\approx f(x)\sum_{k=0}^n {n\choose k}(1-x)^{n-k}x^k=f(x)$.
To elaborate on why, for large $n$, $\displaystyle{n\choose k}(1-x)^{n-k}x^k$ is largest when $\frac{k}{n}\approx x$, you may want to read about the random walk if you don't already know it, but I will explain. Let's say you are dizzy, or in some confused state (it's usually said that you are drunk) and you are taking steps left and right, with a probability $x$ of going to the right, $1-x$ of going to the left, with each step. If you take a total of $n$ steps, then the probability of $k$ of those being to the right is $\displaystyle{n\choose k}(1-x)^{n-k}x^k$. Now $\frac{k}{n}$ is just the ratio of steps taken to the right to the number of steps taken in total. Now the probability of taking $k$ steps to the right is going to be highest when the ratio of steps taken to the right to the total number of steps taken is just equal to the probability of taking a step to the right; this is because you should expect that the number of steps taken to the right compared to the total number of steps taken is just the probability of going to the right, ie. when $\frac{k}{n}\approx x$. When the ratio of steps taken to the right to the total number of steps is far from the probability of going right, then $\displaystyle{n\choose k}(1-x)^{n-k}x^k$ will be $\approx 0$. And when you increase $n$ to a very large number, it becomes increasingly unlikely that you deviate from what should happen, this is the content of the law of large numbers, so these approximations get better and better.
Solution 2:
Here's my humble opinion. We want that the polynomials are dense in $C([a,b])$ so we use the Berstein polynomials to constructively do this. This isn't the only way (surely) but we showed one instance. We like this result because it gives us different information than Taylor's theorem; which states that a function with sufficiently many derivatives can be approximated locally by its Taylor polynomial. So we get pointwise convergence with Taylor's theorem; however, the Weierstrass approximation theorem applies to a continuous function which may not even be differentiable and states we can get uniform convergence (i.e. there is a global polynomial approximation on the whole interval). Its especially nice for applied mathematicians because it allows us to say, "Its okay if my function is rough, I can approximate it as best as I wish on the entire interval with functions who have better properties which I can analyze using a computer."
Solution 3:
Fix an interval $[a,b]$. A slightly famous problem asks to find the Riemann integrable $f$ over $[a,b]$ such that $$\int_a^b x^n f(x)dx=0$$ for each $n=0,1,2,\ldots$ implies that $f$ vanishes identically. The Weierstrass Approximation Theorem gives a positive answer: continuous functions are in such collection. The Weierstrass Approximation Theorem is also used to prove the Stone--Kakutani theorem, which in turn gives the famous Stone Weierstrass theorem, a generalization of Weierstrass' theorem.
As a personal comment, I don't think the proof using Berstein polynomials is "artificial" ---on the contrary, I think it quite sweet and interesting, and is certainly motivated from a probabilistic point of view. The proof indeed has a probabilistic flavour, and is relatively simple in comparison to other proofs, and more memorable.