A Regex to match a SHA1

I'm trying to match SHA1's in generic text with a regular expression.

Ideally I want to avoid matching words.

It's safe to say that full SHA1's have a distinctive pattern (they're long and a consistent length) - so I can match these reliably - but what about abbreviated SHA1's?

Can I rely on the presence of numbers?

Looking at the SHA1's in my commit log - numbers always appear in the first 3 characters. But is this too short? How many characters of SHA1 do I need to consider before I can assume a number would have appeared?

This does not have to be 100% accurate - I just need to match an abbreviated SHA1 99% of the time.

You can consider the SHA1 hashes to be completely random, so this reduces to a matter of probabilities. The probability that a given digit is not a number is 6/16, or 0.375. The probability that three SHA1 digits are all not numbers is 0.375 ** 3, or 0.0527 (5% ish). At six digits, this reduces again to 0.00278 (0.2%). At five digits, the probability of all letters drops below 1% (you said you wanted to match 99% of the time).

It's easy to craft a regular expression that always matches SHA1 values:

\b[0-9a-f]{5,40}\b

However, this may also match perfectly good five letter words, like "added" or "faded". In my /usr/share/dict/words file, there are several six letter words that would match: "accede", "beaded", "bedded", "decade", "deface", "efface", and "facade" are the most likely. At seven letters, there is only "deedeed" which is unlikely to appear in prose. It all depends on how many false positives you can tolerate, and what the likely words you will encounter actually are.

What exactly are you trying to do? You shouldn't need to parse anything git outputs with heuristics -- you can always request exactly the data you need.

If you want to match a full hex representation of an SHA1 sum, try:

/\b([a-f0-9]{40})\b/

That is, a word consisting of 40 characters which are either digits or the letters a through f.

If you only have a few characters and don't know where they are, then you are pretty much out of luck. Is "e78fd98" an abbreviated commit ID? Maybe, but what about "1234567"? Is that a commit ID? A problem ticket number? A number that makes a test fail?

Without context, you can't really know what the data means.

To answer your direct question, there is no property of SHA1 that would make the first three characters (in hex form) digits. You are just lucky, or perhaps unlucky, depending on how you look at it.

I'm going to assume you want to match against hexadecimal printed representation of a SHA1, and not against the equivalent 20 raw bytes. Furthermore, I'm going to assume that the SHA1's in question use only lower-case letters to represent hex digits. You'll have to adjust the regular expression if your requirements differ.

grep -o -E -e "[0-9a-f]{40}"

Will match such a SHA1. You'll need to translate the above regular expression from egrep's dialect to whatever tool you happen to be using. Since the match must be exactly 40 characters long I don't think you're in danger of accidentally matching words. I don't know of any 40-character words that consist only of the letters a through f.

edit:

Better yet: use A Regex to match a SHA1 as his solution includes checking for word boundaries at both ends. I overlooked that above.

A Regex to match a SHA1

Related

Recent Posts