Is there an Excel function to create a hash value?
I'm working with a number of data lists that are keyed by document name. The document names, while very descriptive, are quite cumbersome if I need to view them on (up to 256 bytes is a lot of real estate) and I'd love to be able to create a smaller keyfield that's readily reproducible in case I need to do a VLOOKUP
from another workseet or workbook.
I'm thinking a hash from the title that'd be unique and reproducible for each title would be most appropriate. Is there a function available, or am I looking at developing my own algorithm?
Any thoughts or ideas on this or another strategy?
Solution 1:
You don't need to write your own function - others already did that for you.
For example I collected and compared five VBA hash functions on this stackoverflow answer
Personally I use this VBA function
- its called with
=BASE64SHA1(A1)
in Excel after you copied the macro to a VBA module - requires .NET since it uses the library "Microsoft MSXML" (with late binding)
Public Function BASE64SHA1(ByVal sTextToHash As String)
Dim asc As Object
Dim enc As Object
Dim TextToHash() As Byte
Dim SharedSecretKey() As Byte
Dim bytes() As Byte
Const cutoff As Integer = 5
Set asc = CreateObject("System.Text.UTF8Encoding")
Set enc = CreateObject("System.Security.Cryptography.HMACSHA1")
TextToHash = asc.GetBytes_4(sTextToHash)
SharedSecretKey = asc.GetBytes_4(sTextToHash)
enc.Key = SharedSecretKey
bytes = enc.ComputeHash_2((TextToHash))
BASE64SHA1 = EncodeBase64(bytes)
BASE64SHA1 = Left(BASE64SHA1, cutoff)
Set asc = Nothing
Set enc = Nothing
End Function
Private Function EncodeBase64(ByRef arrData() As Byte) As String
Dim objXML As Object
Dim objNode As Object
Set objXML = CreateObject("MSXML2.DOMDocument")
Set objNode = objXML.createElement("b64")
objNode.DataType = "bin.base64"
objNode.nodeTypedValue = arrData
EncodeBase64 = objNode.text
Set objNode = Nothing
Set objXML = Nothing
End Function
Customizing the hash length
- the hash initially is a 28 characters long unicode string (case sensitive + special chars)
- You customize the hash length with this line:
Const cutoff As Integer = 5
- 4 digits hash = 36 collisions in 6895 lines = 0.5 % collision rate
- 5 digits hash = 0 collisions in 6895 lines = 0 % collision rate
There are also hash functions (all three CRC16 functions) which doesn't require .NET and doesn't use external libraries. But the hash is longer and produces more collisions.
You could also just download this example workbook and play around with all 5 hash implementations. As you see there is a good comparison on the first sheet
Solution 2:
I don't care very much about collisions, but needed a weak pseudorandomizer of rows based on a variable-length string field. Here's one insane solution that worked well:
=MOD(MOD(MOD(MOD(MOD(IF(LEN(Z2)>=1,CODE(MID(Z2,1,1))+10,31),1009)*IF(LEN(Z2)>=3,CODE(MID(Z2,3,1))+10,41),1009)*IF(LEN(Z2)>=5,CODE(MID(Z2,5,1))+10,59),1009)*IF(LEN(Z2)>=7,CODE(MID(Z2,7,1))+10,26),1009)*IF(LEN(Z2)>=9,CODE(MID(Z2,9,1))+10,53),1009)
Where Z2
is the cell containing the string you want to hash.
"MOD"s are there to prevent overflowing to scientific notation. 1009
is a prime, could use anything X so that X*255 < max_int_size
. 10 is arbitrary; use anything. "Else" values are arbitrary (digits of pi here!); use anything. Location of characters (1,3,5,7,9) are arbitrary; use anything.
Solution 3:
For a reasonably small list you can create a scrambler (poor man's hash function) using built-in Excel functions.
E.g.
=CODE(A2)*LEN(A2) + CODE(MID(A2,$A$1,$B$1))*LEN(MID(A2,$A$1,$B$1))
Here A1 and B1 hold a random start letter and string length.
A little fiddling and checking and in most cases you can get a workable unique ID quite quickly.
How it Works: The formula uses the first letter of the string and a fixed letter taken from mid-string and uses LEN() as a 'fanning function' to reduce the chance of collisions.
CAVEAT: this is not a hash, but when you need to get something done quickly, and can inspect the results to see that there are no collisions, it works quite well.
Edit: If your strings should have variable lengths (e.g. full names) but are pulled from a database record with fixed width fields, you will want to do it like this:
=CODE(TRIM(C8))*LEN(TRIM(C8))
+CODE(MID(TRIM(C8),$A$1,1))*LEN(MID(TRIM(C8),$A$1,$B$1))
so that the lengths are a meaningful scrambler.