Git is moving to new hashing algorithm SHA-256 but why git community settled on SHA‑256

I have presented that move in "Why doesn't Git use more modern SHA?" in Aug. 2018

The reasons were discussed here by Brian M. Carlson:

I've implemented and tested the following algorithms, all of which are 256-bit (in alphabetical order):

  • BLAKE2b (libb2)
  • BLAKE2bp (libb2)
  • KangarooTwelve (imported from the Keccak Code Package)
  • SHA-256 (OpenSSL)
  • SHA-512/256 (OpenSSL)
  • SHA3-256 (OpenSSL)
  • SHAKE128 (OpenSSL)

I also rejected some other candidates.
I couldn't find any reference or implementation of SHA256×16, so I didn't implement it.
I didn't consider SHAKE256 because it is nearly identical to SHA3-256 in almost all characteristics (including performance).

SHA-256 and SHA-512/256

These are the 32-bit and 64-bit SHA-2 algorithms that are 256 bits in size.

I noted the following benefits:

  • Both algorithms are well known and heavily analyzed.
  • Both algorithms provide 256-bit preimage resistance.

Summary

The algorithms with the greatest implementation availability are SHA-256, SHA3-256, BLAKE2b, and SHAKE128.

In terms of command-line availability, BLAKE2b, SHA-256, SHA-512/256, and SHA3-256 should be available in the near future on a reasonably small Debian, Ubuntu, or Fedora install.

As far as security, the most conservative choices appear to be SHA-256, SHA-512/256, and SHA3-256.

The performance winners are BLAKE2b unaccelerated and SHA-256 accelerated.

The suggested conclusion was based on:

Popularity

Other things being equal we should be biased towards whatever's in the widest use & recommended for new projects.

Hardware acceleration

The only widely deployed HW acceleration is for the SHA-1 and SHA-256 from the SHA-2 family, but notably nothing from the newer SHA-3 family (released in 2015).

Age

Similar to "popularity" it seems better to bias things towards a hash that's been out there for a while, i.e. it would be too early to pick SHA-3.

The hash transitioning plan, once implemented, also makes it easier to switch to something else in the future, so we shouldn't be in a rush to pick some newer hash because we'll need to keep it forever, we can always do another transition in another 10-15 years.

Result: commit 0ed8d8d, Git v2.19.0-rc0, Aug 4, 2018.

SHA-256 has a number of advantages:

  • It has been around for a while, is widely used, and is supported by just about every single crypto library (OpenSSL, mbedTLS, CryptoNG, SecureTransport, etc).

  • When you compare against SHA1DC, most vectorized SHA-256 implementations are indeed faster, even without acceleration.

  • If we're doing signatures with OpenPGP (or even, I suppose, CMS), we're going to be using SHA-2, so it doesn't make sense to have our security depend on two separate algorithms when either one of them alone could break the security when we could just depend on one.

So SHA-256 it is.

The idea remains: Any notion of SHA1 is being removed from Git codebase and replaced by a generic "hash" variable.
Tomorrow, that hash will be SHA2, but the code will support other hashes in the future.

As Linus Torvalds delicately puts it (emphasis mine):

Honestly, the number of particles in the observable universe is on the order of 2**256. It's a really really big number.

Don't make the code base more complex than it needs to be.
Make a informed technical decision, and say "256 bits is a lot".

The difference between engineering and theory is that engineering makes trade-offs.
Good software is well engineered, not theorized
.

Also, I would suggest that git default to "abbrev-commit=40", so that nobody actually sees the new bits by default.
So the perl scripts etc that use "[0-9a-f]{40}" as a hash pattern would just silently continue to work.

Because backwards compatibility is important (*)

(*) And 2**160 is still a big big number, and hasn't really been a practical problem, and SHA1DC is likely a good hash for the next decade or longer.

(SHA1DC, for "Detecting(?) Collision", was discussed in early 2017, after the collision attack shattered.io instance: see commit 28dc98e, Git v2.13.0-rc0, March 2017, from Jeff King, and "Hash collision in git")


See more in Documentation/technical/hash-function-transition.txt

The transition to SHA-256 can be done one local repository at a time.

a. Requiring no action by any other party.
b. A SHA-256 repository can communicate with SHA-1 Git servers (push/fetch).
c. Users can use SHA-1 and SHA-256 identifiers for objects interchangeably (see "Object names on the command line", below).
d. New signed objects make use of a stronger hash function than SHA-1 for their security guarantees.


That transition is facilitated with Git 2.27 (Q2 2020), and its git fast-import --rewrite-submodules-from/to=<name>:<file>

See commit 1bdca81, commit d9db599, commit 11d8ef3, commit abe0cc5, commit ddddf8d, commit 42d4e1d, commit e02a714, commit efa7ae3, commit 3c9331a, commit 8b8f718, commit cfe3917, commit bf154a8, commit 8dca7f3, commit 6946e52, commit 8bd5a29, commit 1f5f8f3, commit 192b517, commit 9412759, commit 61e2a70, commit dadacf1, commit 768e30e, commit 2078991 (22 Feb 2020) by brian m. carlson (bk2204).
(Merged by Junio C Hamano -- gitster -- in commit f8cb64e, 27 Mar 2020)

fast-import: add options for rewriting submodules

Signed-off-by: brian m. carlson

When converting a repository using submodules from one hash algorithm to another, it is necessary to rewrite the submodules from the old algorithm to the new algorithm, since only references to submodules, not their contents, are written to the fast-export stream.
Without rewriting the submodules, fast-import fails with an "Invalid dataref" error when encountering a submodule in another algorithm.

Add a pair of options, --rewrite-submodules-from and --rewrite-submodules-to, that take a list of marks produced by fast-export and fast-import, respectively, when processing the submodule.
Use these marks to map the submodule commits from the old algorithm to the new algorithm.

We read marks into two corresponding struct mark_set objects and then perform a mapping from the old to the new using a hash table. This lets us reuse the same mark parsing code that is used elsewhere and allows us to efficiently read and match marks based on their ID, since mark files need not be sorted.

Note that because we're using a khash table for the object IDs, and this table copies values of struct object_id instead of taking references to them, it's necessary to zero the struct object_id values that we use to insert and look up in the table. Otherwise, we would end up with SHA-1 values that don't match because of whatever stack garbage might be left in the unused area.

The git fast-import documentation now includes:

Submodule Rewriting

--rewrite-submodules-from=<name>:<file>
--rewrite-submodules-to=<name>:<file>

Rewrite the object IDs for the submodule specified by <name> from the values used in the from <file> to those used in the to <file>.
The from marks should have been created by git fast-export, and the to marks should have been created by git fast-import when importing that same submodule.

<name> may be any arbitrary string not containing a colon character, but the same value must be used with both options when specifying corresponding marks.
Multiple submodules may be specified with different values for . It is an error not to use these options in corresponding pairs.

These options are primarily useful when converting a repository from one hash algorithm to another; without them, fast-import will fail if it encounters a submodule because it has no way of writing the object ID into the new hash algorithm.

And:

commit: use expected signature header for SHA-256

Signed-off-by: brian m. carlson

The transition plan anticipates that we will allow signatures using multiple algorithms in a single commit.
In order to do so, we need to use a different header per algorithm so that it will be obvious over which data to compute the signature.

The transition plan specifies that we should use "gpgsig-sha256", so wire up the commit code such that it can write and parse the current algorithm, and it can remove the headers for any algorithm when creating a new commit.
Add tests to ensure that we write using the right header and that git fsck doesn't reject these commits.


Note: that last fast-import evolution had a nasty side-effect: "git fast-import"(man) wasted a lot of memory when many marks were in use.
That should be fixed with Git 2.30 (Q1 2020)

See commit 3f018ec (15 Oct 2020) by Jeff King (peff).
(Merged by Junio C Hamano -- gitster -- in commit cd47bbe, 02 Nov 2020)

fast-import: fix over-allocation of marks storage

Reported-by: Sergey Brester
Signed-off-by: Jeff King

Fast-import stores its marks in a trie-like structure made of mark_set structs.
(Trie: digital tree)
Each struct has a fixed size (1024). If our id number is too large to fit in the struct, then we allocate a new struct which shifts the id number by 10 bits. Our original struct becomes a child node of this new layer, and the new struct becomes the top level of the trie.

This scheme was broken by ddddf8d7e2 ("fast-import: permit reading multiple marks files", 2020-02-22, Git v2.27.0-rc0 -- merge listed in batch #2). Before then, we had a top-level "marks" pointer, and the push-down worked by assigning the new top-level struct to "marks". But after that commit, insert_mark() takes a pointer to the mark_set, rather than using the global "marks". It continued to assign to the global "marks" variable during the push down, which was wrong for two reasons:

  • we added a call in option_rewrite_submodules() which uses a separate mark set; pushing down on "marks" is outright wrong here. We'd corrupt the "marks" set, and we'd fail to correctly store any submodule mappings with an id over 1024.
  • the other callers passed "marks", but the push-down was still wrong. In read_mark_file(), we take the pointer to the mark_set as a parameter. So even though insert_mark() was updating the global "marks", the local pointer we had in read_mark_file() was not updated. As a result, we'd add a new level when needed, but then the next call to insert_mark() wouldn't see it! It would then allocate a new layer, which would also not be seen, and so on. Lookups for the lost layers obviously wouldn't work, but before we even hit any lookup stage, we'd generally run out of memory and die.

Our tests didn't notice either of these cases because they didn't have enough marks to trigger the push-down behavior. The new tests in t9304 cover both cases (and fail without this patch).

We can solve the problem by having insert_mark() take a pointer-to-pointer of the top-level of the set. Then our push down can assign to it in a way that the caller actually sees. Note the subtle reordering in option_rewrite_submodules(). Our call to read_mark_file() may modify our top-level set pointer, so we have to wait until after it returns to assign its value into the string_list.