Avg of float inconsistency
The select returns right at 23,000 rows
The except will return between 60 to 200 rows (and not the same rows)
The except should return 0 as it is select a except select a
PK: [docSVenum1].[enumID], [docSVenum1].[valueID], [FTSindexWordOnce].[wordID]
[tf] is a float and and I get float is not exact
But I naively thought avg(float) would be repeatable
Avg(float) does appear to be repeatable
What is the solution?
TF is between 0 and 1 and I only need like 5 significant digits
I just need avg(TF) to be the same number run to run
Decimal(9,8) gives me enough precision and if I cast to decimal(9,8) the except properly returns 0
I can change [TF] to decimal(9,8) but it will be bit of work and lot of regression testing as some of the test that use [tf] take over a day to run
Is change [TF] to decimal(9,8) the best solution?
SELECT [docSVenum1].[enumID], [docSVenum1].[valueID], [FTSindexWordOnce].[wordID]
, avg([FTSindexWordOnce].[tf]) AS [avgTFraw]
FROM [docSVenum1]
JOIN [docFieldLock]
ON [docFieldLock].[sID] = [docSVenum1].[sID]
AND [docFieldLock].[fieldID] = [docSVenum1].[enumID]
AND [docFieldLock].[lockID] IN (4, 5) /* secLvl docAdm */
JOIN [FTSindexWordOnce]
ON [FTSindexWordOnce].[sID] = [docSVenum1].[sID]
GROUP BY [docSVenum1].[enumID], [docSVenum1].[valueID], [FTSindexWordOnce].[wordID]
except
SELECT [docSVenum1].[enumID], [docSVenum1].[valueID], [FTSindexWordOnce].[wordID]
, avg([FTSindexWordOnce].[tf]) AS [avgTFraw]
FROM [docSVenum1]
JOIN [docFieldLock]
ON [docFieldLock].[sID] = [docSVenum1].[sID]
AND [docFieldLock].[fieldID] = [docSVenum1].[enumID]
AND [docFieldLock].[lockID] IN (4, 5) /* secLvl docAdm */
JOIN [FTSindexWordOnce]
ON [FTSindexWordOnce].[sID] = [docSVenum1].[sID]
GROUP BY [docSVenum1].[enumID], [docSVenum1].[valueID], [FTSindexWordOnce].[wordID]
order by [docSVenum1].[enumID], [docSVenum1].[valueID], [FTSindexWordOnce].[wordID]
In this case tf is term frequency of tf-idf
tf normalization is subjective and does not require much precision
Avg(tf) needs to be consistent from select to select or the results are not consistent
In a single select with joins I need a consistent avg(tf)
Going with decimal and a low precision for tf got consistent results
Solution 1:
This is very similiar to: SELECT SUM(...) is non-deterministic when adding the column-values of datatype float
.
The problem is that with inaccurate datatype (FLOAT/REAL
) the order of of arithmetic operations on floating point matters. Demo from connect:
DECLARE @fl FLOAT = 100000000000000000000
DECLARE @i SMALLINT = 0
WHILE (@i < 100)
BEGIN
SET @fl = @fl + CONVERT(float, 5000)
SET @i = @i + 1
END
SET @fl = @fl - 100000000000000000000
SELECT CONVERT(NVARCHAR(40), @fl, 2)
-- 0.000000000000000e+000
DECLARE @fl FLOAT = 0
DECLARE @i SMALLINT = 0
WHILE (@i < 100)
BEGIN
SET @fl = @fl + CONVERT(float, 5000)
SET @i = @i + 1
END
SET @fl = @fl + 100000000000000000000
SET @fl = @fl - 100000000000000000000
SELECT @fl
-- 507904
LiveDemo
Possible solutions:
-
CAST
all arguments to accurate datatype likeDECIMAL/NUMERIC
- alter table and change
FLOAT
toDECIMAL
- you can try to force query optimizer to calculate the sum with the same order.
The good news is that when a stable query result matters to your application, you can force the order to be the same by preventing parallelism with OPTION (MAXDOP 1).
It looks like intial link is dead. WebArchive