How do I prove two files are the same legally?

The technical issues are pretty straightforward. Using a combination of SHA and MD5 hashes is pretty typical in the forensics industry.

If you're talking about text files that might've been modified-- say source code files, etc, then performing some type of structured "diff" would be pretty common. I can't cite cases, but there's definitely precedent out there re: the "stolen" file being a derivative work of the "original".

Chain-of-custody issues are a LOT more of a worry to you than proving that the files match. I'd talk to your attorney about what they're looking for, and would strongly consider getting in touch with an attorney experienced with this type of litigation or computer forensics professinal and get their advice on the best way to proceed so that you don't blow your case.

If you actually received a copy of the files I hope you did a good job of maintaining a chain-of-custody. If I were the opposing counsel I'd argue that you received the CD and used it as the source material to produce the "original" files that were "stolen". I'd have kept that CD of "copied" files far, far away from the "originals" and had an independent party perform "diffs" of the files.


Typically your attorney should already have a lot of this under control.

To prove the files are the same, md5 should be used. But even more than that, you need to prove chain of custody using auditable trails. If someone else has had the files in their custody, then you will have a hard time proving in court that the evidence wasn't 'planted'.

There are electronic evidence and forensics companies that deal specifically with this issue. Depending on how serious your company is about this case, you need to hire a lawyer that has knowledge in this area and can refer you to a firm who can assist you through this process.


An important question is how you log access to your firm's files, and how you manage version control over your firm's files.

As far as the files themselves, you want to use a tool like diff rather than a tool like md5 because you want to demonstrate that the files are the "same" except that one has one copyright notice at the start and the other has a different copyright notice at the start of the file.

Ideally you can demonstrate exactly where the files in question came from, and when they would have been copied from your environment, and who had access to those files at the time, and who made copies of them.