在data.table v1.8.10和v1.9.2中对非常小的数字(例如1e-28)和0.0进行分组 [英] Grouping very small numbers (e.g. 1e-28) and 0.0 in data.table v1.8.10 vs v1.9.2
问题描述
我注意到data.table在R中创建的频率表似乎没有区分非常小的数字和零?
可重现的范例:
> library(data.table)
DT< - data.table(c(0.0000000000000000000000000001,2,9999,0))
test1< - as.data.frame DT [,V1]))
test2 < - DT [,.N,by = V1]
$ b b
如您所见,频率表(test2)不会识别0.0000000000000000000000000001和0之间的差异,并将两个观测值放在同一个类中。
数据。表版本:1.8.10
R:3.02
值得阅读 R FAQ 7.31 并考虑浮点表示的准确性。 / p>
我无法在当前的起重机版本(1.9.2)中重现这一点。使用
R版本3.0.3(2014-03-06)
平台:x86_64-w64-mingw32 / x64 (64位)
我猜猜behaivour的改变会与这个新闻项目有关。 / p>
o数字数据仍然加入并按照前面的公差分组,而不是容差
为sqrt(.Machine $ double.eps )== 1.490116e-08(与base :: all.equal的默认值相同)
有效位数现在舍入为最后2个字节,apx 11 sf这对于大(1.23e20)和小(1.23e-20)数字更合适
,并且通过简单的比特旋转更快。
一些函数提供了一个tolerance参数,但这没有被传递,因此删除了
。我们的目标是在未来版本中添加一个全局选项(例如2,1或0字节四舍五入)。
从Matt更新
是的,这是v1.9.2和数据的有意更改。表
现在区分 0.0000000000000000000000000001 从
0
(因为user3340145正确认为它应该)改进舍入方法,从新闻。
我还从Rick的测试套件回答中添加了 for
循环测试。
Btw,#5369现在在v1.9.3中实现(虽然这些问题都不需要):
o bit64 :: integer64现在可以在分组和联接中使用,#5369。感谢
James Sams用于突出显示UPC。
o新功能setNumericRounding()可用于减少1字节
或0字节舍入加入或分组类型numeric的列,#5369。
参见?setNumericRounding中的示例和v1.9.2中的NEWS项。
getNumericRounding()返回当前设置。
请注意,舍入现在(从v1.9.2起)的有效位数;即有效数字的数目。 0.0000000000000000000000000001 == 1.0e-28
准确到只有1 sf,因此新的舍入方法不会将此与 0.0
。
总之,问题的答案是:从v1.8.10升级到v1.9.2或更高版本。
I noticed that frequency tables created by data.table in R seem not to distinguish between very small numbers and zero? Can I change this behavior or is this a bug?
Reproducible example:
>library(data.table)
DT <- data.table(c(0.0000000000000000000000000001,2,9999,0))
test1 <- as.data.frame(unique(DT[,V1]))
test2 <- DT[, .N, by = V1]
As you can see, the frequency table (test2) will not recognize the differences between 0.0000000000000000000000000001 and 0 and put both observations in the same class.
Data.table version: 1.8.10
R: 3.02
It is worth reading R FAQ 7.31 and thinking about the accuracy of floating point represenations.
I can't reproduce this in the current cran version (1.9.2). using
R version 3.0.3 (2014-03-06)
Platform: x86_64-w64-mingw32/x64 (64-bit)
My guess that the change in behaivour will be related to this news item.
o Numeric data is still joined and grouped within tolerance as before but instead of tolerance being sqrt(.Machine$double.eps) == 1.490116e-08 (the same as base::all.equal's default) the significand is now rounded to the last 2 bytes, apx 11 s.f. This is more appropriate for large (1.23e20) and small (1.23e-20) numerics and is faster via a simple bit twiddle. A few functions provided a 'tolerance' argument but this wasn't being passed through so has been removed. We aim to add a global option (e.g. 2, 1 or 0 byte rounding) in a future release.
Update from Matt
Yes this was a deliberate change in v1.9.2 and data.table
now distinguishes 0.0000000000000000000000000001
from 0
(as user3340145 rightly thought it should) due to the improved rounding method highlighted above from NEWS.
I've also added the for
loop test from Rick's answer to the test suite.
Btw, #5369 is now implemented in v1.9.3 (although neither of these are needed for this question) :
o bit64::integer64 now works in grouping and joins, #5369. Thanks to James Sams for highlighting UPCs.
o New function setNumericRounding() may be used to reduce to 1 byte or 0 byte rounding when joining to or grouping columns of type 'numeric', #5369. See example in ?setNumericRounding and NEWS item from v1.9.2. getNumericRounding() returns the current setting.
Notice that rounding is now (as from v1.9.2) about the accuracy of the significand; i.e. the number of significant figures. 0.0000000000000000000000000001 == 1.0e-28
is accurate to just 1 s.f., so the new rounding method doesn't group this together with 0.0
.
In short, the answer to the question is : upgrade from v1.8.10 to v1.9.2 or greater.
这篇关于在data.table v1.8.10和v1.9.2中对非常小的数字(例如1e-28)和0.0进行分组的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!