data.table中的大整数。分组结果在1.9.2中不同于1.8.10 [英] Large integers in data.table. Grouping results different in 1.9.2 compared to 1.8.10

查看:130
本文介绍了data.table中的大整数。分组结果在1.9.2中不同于1.8.10的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我最近将data.table从1.8.10升级到1.9.2,在大型整数分组时,我发现两个版本之间存在以下差异。

I recently upgraded data.table from 1.8.10 to 1.9.2, and I found the following difference between the two versions when grouping across large integers.

有一个设置,我需要在1.9.2中改变,以使下面两个组语句中的第一个如1.8.10中所做的那样工作(我认为1.8.10是正确的行为)。

Is there a setting that I need to change in 1.9.2 to have the first of the following two group statements work as it did in 1.8.10 (and I presume 1.8.10 is the correct behavior)?

另外,以下两个语句的第二个语句的结果是相同的,但是这是预期的行为吗?

Also, the results are the same in the two packages for the second of the following two group statements, but is that behavior expected?

1.8.10

>   library(data.table)
data.table 1.8.10  For help type: help("data.table")
>   foo = data.table(i = c(2884199399609098249, 2884199399608934409))
>   lapply(foo, class)
$i
[1] "numeric"

>   foo
                     i
1: 2884199399609098240
2: 2884199399608934400
>   foo[, .N, by=i]
                     i N
1: 2884199399609098240 1
2: 2884199399608934400 1
>   foo = data.table(i = c(9999999999999999999, 9999999999999999998))
>   foo[, .N, by=i]
                      i N
1: 10000000000000000000 2
> 

和1.9.2

>   library(data.table)
data.table 1.9.2  For help type: help("data.table")
>   foo = data.table(i = c(2884199399609098249, 2884199399608934409))
>   lapply(foo, class)
$i
[1] "numeric"

>   foo
                     i
1: 2884199399609098240
2: 2884199399608934400
>   foo[, .N, by=i]
                     i N
1: 2884199399609098240 2
>   foo = data.table(i = c(9999999999999999999, 9999999999999999998))
>   foo[, .N, by=i]
                      i N
1: 10000000000000000000 2
> 

用于第一个测试(显示data.table版本之间的差异)的数字是我的实际数据集,以及导致一些我的回归测试在升级data.table后失败的数据集。

The numbers used for the first test (shows difference between the data.table versions) are the numbers from my actual dataset, and the ones that caused a few of my regression tests to fail after upgrading data.table.

我对第二个测试很好奇,数字以另一个数量级表示,如果在两个版本的data.table包中预期忽略最后一个有效数字中的微小差异。

I'm curious about the second test, after I increase the numbers by another order of magnitude, if it is expected in both versions of the data.table package to ignore minor differences in the last significant digit.

我假设这一切都与浮点表示有关。也许我正确的方式来处理这是表示这些大整数作为整数64还是字符?我不确定整数64,因为我不知道data.table和R环境是否完全支持他们,例如,我不得不添加在以前的data.table代码:

I'm assuming this all has to do with floating-point representation. Maybe the correct way for me to handle this is to represent these large integers either as integer64 or character? I'm hesitant to do integer64 as I'm not sure if data.table and the R environment fully support them, e.g., I've had to add this in previous data.table code:

options(datatable.integer64="character") # Until integer64 setkey is implemented

也许已经实现了,但是不管更改该设置如何都不会改变这些测试的结果,至少在我的环境中。我想这是有意义的,因为这些值作为数字存储在 foo 数据表中。

Maybe that has been implemented, but regardless changing that setting does not change the results of these tests at least in my environment. I suppose that that makes sense given that these values are stored as numeric in the foo data table.

推荐答案

是的,v1.8.10中的结果是正确的行为。我们改进了v1.9.2中的舍入方法。这里最好解释一下:

Yes the result in v1.8.10 was the correct behaviour. We improved the method of rounding in v1.9.2. That's best explained here :

在data.table v1.8.10与v1中对非常小的数字(例如1e-28)和0.0进行分组。 9.2

这意味着我们向后支持存储在类型 numeric 中的整数> 2 ^ 31。现在在v1.9.3(可从R-Forge获得)中解决,请参阅新闻

That meant we went backwards on supporting integers > 2^31 stored in type numeric. That's now addressed in v1.9.3 (available from R-Forge), see NEWS :


o bit64 :: integer64 现在工作在分组和联接,#5369。感谢James Sams强调UPC和Clayton Stanley。

提醒: fread()已经能够检测并读取 integer64

o bit64::integer64 now works in grouping and joins, #5369. Thanks to James Sams for highlighting UPCs and Clayton Stanley.
Reminder: fread() has been able to detect and read integer64 for a while.

o新功能 setNumericRounding()可用于减少当将
加入或分组类型为 numeric ,#5369的列时,将字节
或0字节舍入。请参见?setNumericRounding 中的示例和来自v1.9.2的NEWS
项。 getNumericRounding()返回当前设置。

o New function setNumericRounding() may be used to reduce to 1 byte or 0 byte rounding when joining to or grouping columns of type numeric, #5369. See example in ?setNumericRounding and NEWS item from v1.9.2. getNumericRounding() returns the current setting.

code> setNumericRounding(0)以关闭所有数字列的全局舍入,或更好地为列使用更合适的类型

So you can either call setNumericRounding(0) to switch off rounding globally for all numeric columns, or better, use the more appropriate type for the column: bit64::integer64 now that it's supported.

v1.9.2的改变是:

The change in v1.9.2 was :


o数字数据仍然加入和分组在公差之前的
,而不是容差sqrt(.Machine $ double.eps)== 1.490116e-08(与base :: all.equal的默认值相同),有效位数现在舍入为最后2个字节,apx 11 sf这更适合大(1.23e20)和小(1.23e-20)数字,并通过简单的位旋转更快。一些函数提供了一个tolerance参数,但是这没有被传递,所以已被删除。我们的目标是在未来版本[DONE]中添加一个全局选项(例如2,1或0字节舍入)。

o Numeric data is still joined and grouped within tolerance as before but instead of tolerance being sqrt(.Machine$double.eps) == 1.490116e-08 (the same as base::all.equal's default) the significand is now rounded to the last 2 bytes, apx 11 s.f. This is more appropriate for large (1.23e20) and small (1.23e-20) numerics and is faster via a simple bit twiddle. A few functions provided a 'tolerance' argument but this wasn't being passed through so has been removed. We aim to add a global option (e.g. 2, 1 or 0 byte rounding) in a future release [DONE].

?setNumericRounding 是:

> DT = data.table(a=seq(0,1,by=0.2),b=1:2, key="a")
> DT
     a b
1: 0.0 1
2: 0.2 2
3: 0.4 1
4: 0.6 2
5: 0.8 1
6: 1.0 2
> setNumericRounding(0)   # turn off rounding; i.e. if we didn't round
> DT[.(0.4)]   # works
     a b
1: 0.4 1
> DT[.(0.6)]   # no match!, confusing to users
     a  b      # 0.6 is clearing there in DT, and 0.4 worked ok!
1: 0.6 NA
>     
> setNumericRounding(2)   # restore default
> DT[.(0.6)]   # now works as user expects
     a b
1: 0.6 2
>     
> # using type 'numeric' for integers > 2^31 (typically ids)
> DT = data.table(id = c(1234567890123, 1234567890124, 1234567890125), val=1:3)
> DT[,.N,by=id]   # 1 row (the last digit has been rounded)
             id N
1: 1.234568e+12 3
> setNumericRounding(0)  # turn off rounding
> DT[,.N,by=id]   # 3 rows (the last digit wasn't rounded)
             id N
1: 1.234568e+12 1
2: 1.234568e+12 1
3: 1.234568e+12 1
>  # but, better to use bit64::integer64 for such ids instead of numeric
>  setNumericRounding(2)  # restore default, preferred

这篇关于data.table中的大整数。分组结果在1.9.2中不同于1.8.10的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆