In data vault 2.0 one hashes the business key and takes this hash as a primary key of the table. Also Link Tables use the hash primary key to create a relationship.

My problem is with hashes that are basically random, the query optimizer cannot apply any good estimation since the statistics - of course - are not usable for randomly distributed data.

So the query optimizer uses weird plans where it wants to sort often (because it thinks there are only 4 rows to sort). Since I am surely not the first one to deal with data vault in sql server, how is this fixable?.

When query optimizer uses an index seek or a join operator it completely misses the row estimation hence chooses ridiculous plans.

I have to pimp them with join hints and query hints such as (FORCE ORDER) to get anything out of it.

What's the common approach for this?


Solution 1:

I firmly agree with your conclusion that hashing will make all data that had a structure / order to it totally random, which will make any form of useful database statistics impossible.

I actually did some experimenting on SQL server myself and came to the same conclusion as you did, supported by the Explain Plans

That is why I firmly believe you/we should consider using the concatenated business key as a primary key INSTEAD of hashing it.

Arguments that are give for hashing are in the realm of:

  1. Joining in Char(32) (the character string of an MD5 hash) is more performant compared to joining on variable character fields
  2. Hashing reduces hotspots in your MPP cluster when writing data

But I have yet to see proof for argument 1: as you are mentioning you lose any useful statistics when joining! Furthermore: a lot of natural business keys I know are actually much SMALLER than 32 characters... I actually have asked a question related to this subject a few days ago...

Then to argument 2: In most MPP NoSQL databases (Key-value, Document, Column Family) the advise is to actually use a good NATURAL (concatenated) key as the sharding key, not a hash. Example: see this advise for Cassandra.

This is why I do not agree with the Data Vault 2 Hashing theory: I have not seen any proof supporting this... It is one of the reasons why I am proposing a new Ensemble modeling approach @ DMZone Berlin in October.