Have you read about Maxwell's demon?

Maxwell realized that if it was possible to track the trajectories of moving particles, and open and close a shutter and just the right times, it would be possible to locally increase a system's temperature without doing any work on the system. The idea would be to selectively allow the hottest particles to move into a chamber, and leave the cooler ones behind.

This obviously breaks the second law of thermodynamics.

Maxwell reasoned that making the measurement takes work. Even though the demon doesn't have to touch any of the particles, the measurements are imparting energy into the system, so that information is effectively another form of energy.

This has been experimentally verified. See http://en.wikipedia.org/wiki/Landauer%27s_principle, for example. Information is a conserved quantity. There is nothing hand wavy about this. Indeed, it is of foundational physical importance.

Now, having at least informally "established" that information is a physical quantity, consider that every (computable)[1] probability distribution can be encoded as a sequence of 0's and 1's, in a minimal sense (with respect to the Kolmogorov complexity), and that this sequence can be transmitted. At this point, the abstract, non-physical probability distribution has been transformed into a physical entity (a message) with bona fide physical information content. Since this information content is a conserved quantity, it seems fair to say that it is a property of the abstract, non-physical probability distribution. After all, up to scalar constant, the entropy is a property of the message, not the encoding.

'Entropy' is overloaded in a way that makes it confusing, especially with respect to physical versus information entropy. So let's take a step back and note that "information entropy" is the average information content per sample, where a sample is a message, a random variable, or the result an experiment. In each of these, there is an "unknown" quantity, and an observation that reveals aspects of it. The higher the unknown quantity's entropy, the more informative an observation is, insofar as the sample is 'representative' of the whole. We can call this the outside the box view. We are trying to estimate what the inside of the box is like by taking a limited number of samples and analyzing them under "steady state" assumptions.

And this is partly why information entropy is a decent enough measure of aggregate information. For the channel, what we really want to know is how many messages can we send in parallel, and to figure that out we need to know how much information each message contains, on average. Throw in some operations research, and boom, you have the tools to manage a modern telecommunications infrastructure.

On the other hand, we have the in the box view -- the claim that entropy is a property of a system. I suggest that we broaden this, in analogy with the outside of the box view. In particular, to probability distributions, and ... well, I'm not quite sure how to phrase this ... experimental subjects. In any case, what I really mean is the thing that yields data to be observed by scientists, engineers, people on the other side of the internet, etc. These "subjects" are the unknown quantities people are trying to estimate.

These systems have an internal structure, and part of that structure substantively relates to the data they emit. The physical law inside the box is that entropy increases, energy and information are "lost" as heat and signal noise. Energy and information are still conserved quantities: this particular microstate could only have occurred because of the initial microstate conditions, so, in principle (and ignoring quantum mechanics), if we observe the microstate in sufficient detail, we can work backwards to recover the initial condition. But if we want to estimate the initial microstate from outside the box, we are very much stuck.

Probability distributions also fall under this rubric. I am specifically ignoring stochastic processes insofar as they are "time-dependent" and considering them as time-less (i.e., eternal) formal objects. The structure does not change over time, and their entropy is constant (since the structure is fixed).

At a high level, the only difference between the physical and formal structures is time. In both cases, the structure is encoded by information, and that structure is revealed by sampling. The physical system's structure decays, and the more that it decays, the more representative of the whole any given sample will be.

Now, one might notice that as entropy increases in the box, information is effectively lost inside the box, but at the same time, each sample effectively becomes more informative. The facts are clearly related, but I'm not in a position to estimate bounds or make a strong claim other than "look, that's neat". Jensen's inequality comes to mind.

[1]: Every probability distribution can be encoded as a tree of 0's and 1's, even if it the distribution isn't computable.