How to calculate the entropy of a file?

How to calculate the entropy of a file? (Or let's just say a bunch of bytes)
I have an idea, but I'm not sure that it's mathematically correct.

My idea is the following:

  • Create an array of 256 integers (all zeros).
  • Traverse through the file and for each of its bytes,
    increment the corresponding position in the array.
  • At the end: Calculate the "average" value for the array.
  • Initialize a counter with zero,
    and for each of the array's entries:
    add the entry's difference to "average" to the counter.

Well, now I'm stuck. How to "project" the counter result in such a way that all results would lie between 0.0 and 1.0? But I'm sure, the idea is inconsistent anyway...

I hope someone has better and simpler solutions?

Note: I need the whole thing to make assumptions on the file's contents:
(plaintext, markup, compressed or some binary, ...)


Solution 1:

  • At the end: Calculate the "average" value for the array.
  • Initialize a counter with zero, and for each of the array's entries: add the entry's difference to "average" to the counter.

With some modifications you can get Shannon's entropy:

rename "average" to "entropy"

(float) entropy = 0
for i in the array[256]:Counts do 
  (float)p = Counts[i] / filesize
  if (p > 0) entropy = entropy - p*lg(p) // lgN is the logarithm with base 2

Edit: As Wesley mentioned, we must divide entropy by 8 in order to adjust it in the range 0 . . 1 (or alternatively, we can use the logarithmic base 256).

Solution 2:

A simpler solution: gzip the file. Use the ratio of file sizes: (size-of-gzipped)/(size-of-original) as measure of randomness (i.e. entropy).

This method doesn't give you the exact absolute value of entropy (because gzip is not an "ideal" compressor), but it's good enough if you need to compare entropy of different sources.

Solution 3:

To calculate the information entropy of a collection of bytes, you'll need to do something similar to tydok's answer. (tydok's answer works on a collection of bits.)

The following variables are assumed to already exist:

  • byte_counts is 256-element list of the number of bytes with each value in your file. For example, byte_counts[2] is the number of bytes that have the value 2.

  • total is the total number of bytes in your file.

I'll write the following code in Python, but it should be obvious what's going on.

import math

entropy = 0

for count in byte_counts:
    # If no bytes of this value were seen in the value, it doesn't affect
    # the entropy of the file.
    if count == 0:
        continue
    # p is the probability of seeing this byte in the file, as a floating-
    # point number
    p = 1.0 * count / total
    entropy -= p * math.log(p, 256)

There are several things that are important to note.

  • The check for count == 0 is not just an optimization. If count == 0, then p == 0, and log(p) will be undefined ("negative infinity"), causing an error.

  • The 256 in the call to math.log represents the number of discrete values that are possible. A byte composed of eight bits will have 256 possible values.

The resulting value will be between 0 (every single byte in the file is the same) up to 1 (the bytes are evenly divided among every possible value of a byte).


An explanation for the use of log base 256

It is true that this algorithm is usually applied using log base 2. This gives the resulting answer in bits. In such a case, you have a maximum of 8 bits of entropy for any given file. Try it yourself: maximize the entropy of the input by making byte_counts a list of all 1 or 2 or 100. When the bytes of a file are evenly distributed, you'll find that there is an entropy of 8 bits.

It is possible to use other logarithm bases. Using b=2 allows a result in bits, as each bit can have 2 values. Using b=10 puts the result in dits, or decimal bits, as there are 10 possible values for each dit. Using b=256 will give the result in bytes, as each byte can have one of 256 discrete values.

Interestingly, using log identities, you can work out how to convert the resulting entropy between units. Any result obtained in units of bits can be converted to units of bytes by dividing by 8. As an interesting, intentional side-effect, this gives the entropy as a value between 0 and 1.

In summary:

  • You can use various units to express entropy
  • Most people express entropy in bits (b=2)
    • For a collection of bytes, this gives a maximum entropy of 8 bits
    • Since the asker wants a result between 0 and 1, divide this result by 8 for a meaningful value
  • The algorithm above calculates entropy in bytes (b=256)
    • This is equivalent to (entropy in bits) / 8
    • This already gives a value between 0 and 1