Solution 1:

Maybe this simple example will help. I use it when I teach conditional expectation.

(1) The first step is to think of ${\mathbb E}(X)$ in a new way: as the best estimate for the value of a random variable $X$ in the absence of any information. To minimize the squared error $${\mathbb E}[(X-e)^2]={\mathbb E}[X^2-2eX+e^2]={\mathbb E}(X^2)-2e{\mathbb E}(X)+e^2,$$ we differentiate to obtain $2e-2{\mathbb E}(X)$, which is zero at $e={\mathbb E}(X)$.

For example, if I throw a fair die and you have to estimate its value $X$, according to the analysis above, your best bet is to guess ${\mathbb E}(X)=3.5$. On specific rolls of the die, this will be an over-estimate or an under-estimate, but in the long run it minimizes the mean square error.

(2) What happens if you do have additional information? Suppose that I tell you that $X$ is an even number. How should you modify your estimate to take this new information into account?

The mental process may go something like this: "Hmmm, the possible values were $\lbrace 1,2,3,4,5,6\rbrace$ but we have eliminated $1,3$ and $5$, so the remaining possibilities are $\lbrace 2,4,6\rbrace$. Since I have no other information, they should be considered equally likely and hence the revised expectation is $(2+4+6)/3=4$".

Similarly, if I were to tell you that $X$ is odd, your revised (conditional) expectation is 3.

(3) Now imagine that I will roll the die and I will tell you the parity of $X$; that is, I will tell you whether the die comes up odd or even. You should now see that a single numerical response cannot cover both cases. You would respond "3" if I tell you "$X$ is odd", while you would respond "4" if I tell you "$X$ is even". A single numerical response is not enough because the particular piece of information that I will give you is itself random. In fact, your response is necessarily a function of this particular piece of information. Mathematically, this is reflected in the requirement that ${\mathbb E}(X\ |\ {\cal F})$ must be $\cal F$ measurable.

I think this covers point 1 in your question, and tells you why a single real number is not sufficient. Also concerning point 2, you are correct in saying that the role of $\cal F$ in ${\mathbb E}(X\ |\ {\cal F})$ is not a single piece of information, but rather tells what possible specific pieces of (random) information may occur.

Solution 2:

I think a good way to answer question 2 is as follows.

I am performing an experiment, whose outcome can be described by an element $\omega$ of some set $\Omega$. I am not going to tell you the outcome, but I will allow you to ask certain questions yes/no questions about it. (This is like "20 questions", but infinite sequences of questions will be allowed, so it's really "$\aleph_0$ questions".) We can associate a yes/no question with the set $A \subset \Omega$ of outcomes for which the answer is "yes".

Now, one way to describe some collection of "information" is to consider all the questions which could be answered with that information. (For example, the 2010 Encyclopedia Britannica is a collection of information; it can answer the questions "Is the dodo extinct?" and "Is the elephant extinct?" but not the question "Did Justin Bieber win a 2011 Grammy?") This, then, would be a set $\mathcal{F} \subset 2^\Omega$.

If I know the answer to a question $A$, then I also know the answer to its negation, which corresponds to the set $A^c$ (e.g. "Is the dodo not-extinct?"). So any information that is enough to answer question $A$ is also enough to answer question $A^c$. Thus $\mathcal{F}$ should be closed under taking complements. Likewise, if I know the answer to questions $A,B$, I also know the answer to their disjunction $A \cup B$ ("Are either the dodo or the elephant extinct?"), so $\mathcal{F}$ must also be closed under (finite) unions. Countable unions require more of a stretch, but imagine asking an infinite sequence of questions "converging" on a final question. ("Can elephants live to be 90? Can they live to be 99? Can they live to be 99.9?" In the end, I know whether elephants can live to be 100.)

I think this gives some insight into why a $\sigma$-field can be thought of as a collection of information.

Solution 3:

An example. Suppose that $X \sim {\rm binomial}(m,p)$ and $Y \sim {\rm binomial}(n,p)$ are independent ($0 < p < 1$). For any integer $0 \leq s \leq m+n$, it holds $$ {\rm E}[X|X + Y = s] = \frac{{m }}{{m + n }}s. $$ This means that $$ {\rm E}[X|X + Y] = \frac{{m }}{{m + n }}(X+Y). $$ Note that ${\rm E}[X|X + Y]$ is a random variable which is a function of $X+Y$.

Note that, in general, the conditional expectation of $X$ given $Z$, denoted ${\rm E}[X|Z]$, is defined as ${\rm E}[X|\sigma(Z)]$, where $\sigma(Z)$ is the $\sigma$-algebra generated by $Z$.

EDIT. In response to the OP's request, I note that the binomial distribution (which is discrete) plays no special role in the above example. For completely analogous results for the normal and gamma distributions (both are continuous) see this and this, respectively; for a substantial generalization, see this.