Automatically 'brute force' a few bytes to recover a corrupt file

Solution 1:

Here's a small Python program which does what you seem to be describing.

#!/usr/bin/env python3
from hashlib import sha1

with open('binaryfile', 'rb') as bin:
    binary = bin.read()

base = 0x0078
# ... is not valid Python; add more sequences, or take it out (or see below)
for seq in [[0xCA, 0xC5, 0xC5, 0x8A], [0xCB, 0xC5, 0xC5, 0x8A], ...]:
    copy = binary[0:base]
    copy += bytes(seq)
    copy += binary[base+len(seq):]
    if sha1(copy).hexdigest() == '9968733ce3ff0893bbb0a19e75faaf2fb0000e19':
        print('success with bytes {0}'.format(seq))
        break
else:
    print('no success')

UnOnly briefly tested; please ping me if you find typos.

The base specifies where to try to apply the four bytes, and the long string '996873... is the hex representation of the expected SHA1. The line for seq in... defines the bytes to try; and of course replace 'binaryfile' with the path to the file you want to attempt to salvage.

You can replace the literal list [[0xCA, 0xC5,...]] with something to actually loop over all possible values but it's basically just a placeholder for something more useful because I'm not really sure what exactly you want there.

Something like for seq in itertools.product(range(256), repeat=4)): will loop over all possible values from 0 to 232-1. (You will need to add import itertools near the top then.) Or perhaps you could simply add an offset; update the script to replace the current for seq in with the following (where again the import needs to go before the main program);

import struct

for n in range(2**32):
    val=(n+0x8AC5C5CA) % 2**32  # notice reverse order
    seq=list(reversed(struct.pack(">I", val)))
    copy = ...

I reversed the order of the bytes so that it naturally increments from 0x8AC5C5CA to 0x8AC5C5CB but then the next increment will be 0x8AC5C5CC etc. The struct magic is to convert this to a sequence of bytes (had to look it up from https://stackoverflow.com/a/26920983/874188). This will start at 0x8AC5C5CA and go to 0xFFFFFFFF, then wrap around to 0x00000000 and climb back up to 0x8AC5C5C9.

If you have multiple candidate ranges you would like to examine in a particular order, maybe something like

for rge in [(0x8AC5C5CA, 0x8AFFFFFF), (0x00C6C58A, 0x00FFFFFF),
        (0x00000000, 0x00C6C589), (0x01000000, 0x8AC5C5C9)]:
    for val in range(*rge):
        seq=list(reversed(struct.pack(">I", val)))
        copy = ...

but then you'll need to make sure yourself that the (start, end) pairs in rge cover all of the space between 0x00000000 and 0xFFFFFFFF if you really want to examine all of it. (And again, notice that the range increments the last byte and that seq applies the bytes of the value in reverse, in accordance with your stated requirements.)

If you wanted to use two different base addresses, you quickly run up against the limits of what's feasible to do in your lifetime with brute force; but you could, for example, split the 4-byte number into two 2-byte parts and apply those at different offsets.

base1 = 0x1234
base2 = 0x2345

for seq in range(whatever):
    copy = binary[0:base1]
    copy += bytes(seq[0:1])
    copy += binary[base1+2:base1+base2]
    copy += bytes(seq[2:3])
    copy += binary[base2+2:]

Solution 2:

No, no, no and again NO!

Seldom the answer you get is not what you expect.

Some questions for you:

  • Is it possible that an expert doesn't know that it is possible to brute force a string of for bytes and iteratively try the SHA-1 until it converges? No
  • Is it possible he forget it ? No
  • Is it possible that you cannot do it on a rar file? No
  • Is the other answer wrong? absolutely NO

So what? ... Time.

The point is that you have to change so few bytes... only 4!

What does it means? 2564 that is 256x256x256x256 possibilities, a really really big number.
If your computer were able to process 1 operation per second (substitution in the file + sha1)...
you should wait more then 136 years, or if you prefer more then 49710 days.

You're enough lucky, a 5MB pre-cached file (already loaded in ram and in the cache) asks only about 0.03 seconds (min 0.025s), on an old computer. That shrinks your expecting time down to 1242-1492 days (something more then 3 years).

It is true, BTW, that statistically you should have a positive answer in half of the time. Nonetheless you should wait till you will have tried all the possibilities to be sure that there is only 1 substitution that will give you the same SHA-1 checksum...

Now that IMPOSSIBLE sounds as "not possible in a WORTHWHILE amount of time".


How to proceed

A more proper answer to your technical question: when you speak about brute force it have not to be necessary blind brute force.

  • It is just stated in a comment in the other answer that you do not need to calculate the sha1 checksum on the part before the corruption. You do the 1st time and you save time for each successive iteration (maybe a factor 2 it depends from the position).

  • Something that can change the worthless of the effort is to write a parallel code that will run on the GPU. If you have a good graphic card you may have around 1000 cores that can compute for you in parallel (even more but they have a frequency lower than the cpu, but still they are a lot). If you are able to decrease the time from 1400 to 1.4 days maybe you can even do it.

  • A different approach can lead you to a faster solution.
    You said it is a rar file. The rar file structure is divided into blocks. If you take count of it you can see where the corruption falls. If it is on the part of the data, on the part of the headers or on both. Then you can act consequently. For sake of simplicity let's suppose it is over the data:
    you can do the brute force of your offset, check for each positive CRC of that block if it is even positive the SHA1 on the whole file. Again you can do a parallel code.

Final note

If they were 6 bytes instead of 4 you were out of the game with the present technology.