Does compiling a program twice produce a bit-for-bit identical binary?

If I were to compile a program into a single binary, make a checksum, and then recompile it on the same machine with the same compiler and compiler settings and checksum the recompiled program, would the checksum fail?

If so, why is this? If not, would having a different CPU result in a non-identical binary?


Solution 1:

  1. Compile same program with same settings on same machine:

    Although the definitive answer is "it depends", it is reasonable to expect that most compilers will be deterministic most of the time, and that the binaries produced should be identical. Indeed, some version control systems depend on this. Still, there are always exceptions; it is quite possible that some compiler somewhere will decide to insert a timestamp or some such (iirc, Delphi does, for example). Or the build process itself might do that; I've seen makefiles for C programs which set a preprocessor macro to the current timestamp. (I guess that would count as being a different compiler setting, though.)

    Also, be aware that if you statically link the binary, then you are effectively incorporating the state of all relevant libraries on your machine, and any change in any one of those will also affect your binary. So it is not just compiler settings which are relevant.

  2. Compile same program on a different machine with a different CPU.

    Here, all bets are off. Most modern compilers are capable of doing target-specific optimizations; if this option is enabled, then the binaries are likely to differ unless the CPUs are similar (and even then, it's possible). Also, see the above note about static linking: the configuration environment goes far beyond the compiler settings. Unless you have very strict configuration control, it's extremely likely that something differs between the two machines.

Solution 2:

  • -frandom-seed=123 controls some GCC internal randomness. man gcc says:

    This option provides a seed that GCC uses in place of random numbers in generating certain symbol names that have to be different in every compiled file. It is also used to place unique stamps in coverage data files and the object files that produce them. You can use the -frandom-seed option to produce reproducibly identical object files.

  • __FILE__: put the source in a fixed folder (e.g. /tmp/build)

  • for __DATE__, __TIME__, __TIMESTAMP__:
    • libfaketime : https://github.com/wolfcw/libfaketime
    • override those macros with -D
    • -Wdate-time or -Werror=date-time: warn or fail if either __TIME__, __DATE__ or __TIMESTAMP__ are is used. The Linux kernel 4.4 uses it by default.
  • use the D flag with ar, or use https://github.com/nh2/ar-timestamp-wiper/tree/master to wipe stamps
  • -fno-guess-branch-probability: older manual versions say it is a source of non-determinism, but not anymore. Not sure if this is covered by -frandom-seed or not.

The Debian Reproducible builds project attempts to standardize Debian packages byte-by-byte, and recently got a Linux Foundation grant. That includes more than just compilation, but it should be of interest.

Buildroot has a BR2_REPRODUCIBLE option which may give some ideas on the package level, but it is far from complete at this point.

Related threads:

  • https://stackoverflow.com/questions/14653874/deterministic-binary-output-with-g
  • https://www.quora.com/What-can-be-the-possible-reasons-for-the-object-code-of-an-unchanged-C-file-to-change-on-recompilation