Data Vault 2.0 in SQL Server
In data vault 2.0 one hashes the business key and takes this hash as a primary key of the table. Also Link Tables use the hash primary key to create a relationship.
My problem is with hashes that are basically random, the query optimizer cannot apply any good estimation since the statistics - of course - are not usable for randomly distributed data.
So the query optimizer uses weird plans where it wants to sort often (because it thinks there are only 4 rows to sort). Since I am surely not the first one to deal with data vault in sql server, how is this fixable?.
When query optimizer uses an index seek or a join operator it completely misses the row estimation hence chooses ridiculous plans.
I have to pimp them with join hints and query hints such as (FORCE ORDER) to get anything out of it.
What's the common approach for this?
Solution 1:
I firmly agree with your conclusion that hashing will make all data that had a structure / order to it totally random, which will make any form of useful database statistics impossible.
I actually did some experimenting on SQL server myself and came to the same conclusion as you did, supported by the Explain Plans
That is why I firmly believe you/we should consider using the concatenated business key as a primary key INSTEAD of hashing it.
Arguments that are give for hashing are in the realm of:
- Joining in Char(32) (the character string of an MD5 hash) is more performant compared to joining on variable character fields
- Hashing reduces hotspots in your MPP cluster when writing data
But I have yet to see proof for argument 1: as you are mentioning you lose any useful statistics when joining! Furthermore: a lot of natural business keys I know are actually much SMALLER than 32 characters... I actually have asked a question related to this subject a few days ago...
Then to argument 2: In most MPP NoSQL databases (Key-value, Document, Column Family) the advise is to actually use a good NATURAL (concatenated) key as the sharding key, not a hash. Example: see this advise for Cassandra.
This is why I do not agree with the Data Vault 2 Hashing theory: I have not seen any proof supporting this... It is one of the reasons why I am proposing a new Ensemble modeling approach @ DMZone Berlin in October.