浮动不一致的平均值 [英] Avg of float inconsistency

查看:18
本文介绍了浮动不一致的平均值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

选择返回 23,000 行
除了将返回 60 到 200 行(而不是相同的行)
例外应该返回 0 因为它是选择一个除了选择一个

The select returns right at 23,000 rows
The except will return between 60 to 200 rows (and not the same rows)
The except should return 0 as it is select a except select a

PK: [docSVenum1].[enumID], [docSVenum1].[valueID], [FTSindexWordOnce].[wordID]

PK: [docSVenum1].[enumID], [docSVenum1].[valueID], [FTSindexWordOnce].[wordID]

[tf] 是一个浮点数,而我得到的浮点数不准确
但我天真地认为 avg(float) 是可重复的
Avg(float) 似乎是可重复的

[tf] is a float and and I get float is not exact
But I naively thought avg(float) would be repeatable
Avg(float) does appear to be repeatable

解决办法是什么?
TF 介于 0 和 1 之间,我只需要 5 个有效数字
我只需要 avg(TF) 是相同的运行数
Decimal(9,8) 给了我足够的精度,如果我转换为 decimal(9,8) ,except 正确返回 0
我可以将 [TF] 更改为十进制(9,8),但由于使用 [tf] 的某些测试需要花费一天的时间来运行,因此需要一些工作和大量回归测试
将 [TF] 更改为 decimal(9,8) 是最佳解决方案吗?

What is the solution?
TF is between 0 and 1 and I only need like 5 significant digits
I just need avg(TF) to be the same number run to run
Decimal(9,8) gives me enough precision and if I cast to decimal(9,8) the except properly returns 0
I can change [TF] to decimal(9,8) but it will be bit of work and lot of regression testing as some of the test that use [tf] take over a day to run
Is change [TF] to decimal(9,8) the best solution?

  SELECT [docSVenum1].[enumID], [docSVenum1].[valueID], [FTSindexWordOnce].[wordID]
       , avg([FTSindexWordOnce].[tf]) AS [avgTFraw]
    FROM [docSVenum1] 
    JOIN [docFieldLock] 
           ON [docFieldLock].[sID] = [docSVenum1].[sID] 
          AND [docFieldLock].[fieldID] = [docSVenum1].[enumID] 
          AND [docFieldLock].[lockID] IN (4, 5) /* secLvl docAdm */ 
    JOIN [FTSindexWordOnce] 
           ON [FTSindexWordOnce].[sID] = [docSVenum1].[sID]
GROUP BY [docSVenum1].[enumID], [docSVenum1].[valueID], [FTSindexWordOnce].[wordID]

except 

  SELECT [docSVenum1].[enumID], [docSVenum1].[valueID], [FTSindexWordOnce].[wordID]
       , avg([FTSindexWordOnce].[tf]) AS [avgTFraw]
    FROM [docSVenum1] 
    JOIN [docFieldLock] 
           ON [docFieldLock].[sID] = [docSVenum1].[sID] 
          AND [docFieldLock].[fieldID] = [docSVenum1].[enumID] 
          AND [docFieldLock].[lockID] IN (4, 5) /* secLvl docAdm */ 
    JOIN [FTSindexWordOnce] 
           ON [FTSindexWordOnce].[sID] = [docSVenum1].[sID]
GROUP BY [docSVenum1].[enumID], [docSVenum1].[valueID], [FTSindexWordOnce].[wordID] 

order by [docSVenum1].[enumID], [docSVenum1].[valueID], [FTSindexWordOnce].[wordID]

在这种情况下,tf 是 tf-idf
tf 归一化是主观的,不需要很高的精度
avg(tf) 需要从select到select保持一致否则结果不一致
在带有连接的单个选择中,我需要一致的 avg(tf)
使用十进制和低精度的 tf 得到一致的结果

In this case tf is term frequency of tf-idf
tf normalization is subjective and does not require much precision
Avg(tf) needs to be consistent from select to select or the results are not consistent
In a single select with joins I need a consistent avg(tf)
Going with decimal and a low precision for tf got consistent results

推荐答案

这非常类似于:SELECT SUM(...) 在添加列时是不确定的- 数据类型浮点值.

问题在于,由于数据类型不准确(FLOAT/REAL),浮点数算术运算的顺序很重要.来自连接的演示:

The problem is that with inaccurate datatype (FLOAT/REAL) the order of of arithmetic operations on floating point matters. Demo from connect:

DECLARE @fl FLOAT = 100000000000000000000
DECLARE @i SMALLINT = 0
WHILE (@i < 100)
BEGIN
    SET @fl = @fl + CONVERT(float, 5000)
    SET @i = @i + 1
END
SET @fl = @fl - 100000000000000000000
SELECT CONVERT(NVARCHAR(40), @fl, 2)
-- 0.000000000000000e+000


DECLARE @fl FLOAT = 0
DECLARE @i SMALLINT = 0
WHILE (@i < 100)
BEGIN
    SET @fl = @fl + CONVERT(float, 5000)
    SET @i = @i + 1
END
SET @fl = @fl + 100000000000000000000
SET @fl = @fl - 100000000000000000000
SELECT @fl
-- 507904

LiveDemo

可能的解决方案:

  • CAST 将所有参数转换为准确的数据类型,例如 DECIMAL/NUMERIC
  • 改变表并将FLOAT改为DECIMAL
  • 您可以尝试强制查询优化器以相同的顺序计算总和.
  • CAST all arguments to accurate datatype like DECIMAL/NUMERIC
  • alter table and change FLOAT to DECIMAL
  • you can try to force query optimizer to calculate the sum with the same order.

好消息是,当一个稳定的查询结果对您很重要时应用程序,您可以通过阻止来强制顺序相同与 OPTION (MAXDOP 1) 的并行性.

The good news is that when a stable query result matters to your application, you can force the order to be the same by preventing parallelism with OPTION (MAXDOP 1).

<小时>

看起来初始链接已失效.WebArchive

这篇关于浮动不一致的平均值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆