data.table 中的大整数.与 1.8.10 相比,1.9.2 中的分组结果不同 [英] Large integers in data.table. Grouping results different in 1.9.2 compared to 1.8.10

查看:12
本文介绍了data.table 中的大整数.与 1.8.10 相比,1.9.2 中的分组结果不同的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我最近将 data.table 从 1.8.10 升级到 1.9.2,在跨大整数分组时,我发现两个版本之间存在以下差异.

I recently upgraded data.table from 1.8.10 to 1.9.2, and I found the following difference between the two versions when grouping across large integers.

我是否需要在 1.9.2 中更改设置,以使以下两个组语句中的第一个能够像在 1.8.10 中那样工作(我认为 1.8.10 是正确的行为)?

Is there a setting that I need to change in 1.9.2 to have the first of the following two group statements work as it did in 1.8.10 (and I presume 1.8.10 is the correct behavior)?

此外,对于以下两个组语句中的第二个,两个包中的结果相同,但这种行为是预期的吗?

Also, the results are the same in the two packages for the second of the following two group statements, but is that behavior expected?

1.8.10

>   library(data.table)
data.table 1.8.10  For help type: help("data.table")
>   foo = data.table(i = c(2884199399609098249, 2884199399608934409))
>   lapply(foo, class)
$i
[1] "numeric"

>   foo
                     i
1: 2884199399609098240
2: 2884199399608934400
>   foo[, .N, by=i]
                     i N
1: 2884199399609098240 1
2: 2884199399608934400 1
>   foo = data.table(i = c(9999999999999999999, 9999999999999999998))
>   foo[, .N, by=i]
                      i N
1: 10000000000000000000 2
> 

和 1.9.2

>   library(data.table)
data.table 1.9.2  For help type: help("data.table")
>   foo = data.table(i = c(2884199399609098249, 2884199399608934409))
>   lapply(foo, class)
$i
[1] "numeric"

>   foo
                     i
1: 2884199399609098240
2: 2884199399608934400
>   foo[, .N, by=i]
                     i N
1: 2884199399609098240 2
>   foo = data.table(i = c(9999999999999999999, 9999999999999999998))
>   foo[, .N, by=i]
                      i N
1: 10000000000000000000 2
> 

第一次测试使用的数字(显示 data.table 版本之间的差异)是来自我的实际数据集的数字,以及在升级 data.table 后导致我的一些回归测试失败的数字.

The numbers used for the first test (shows difference between the data.table versions) are the numbers from my actual dataset, and the ones that caused a few of my regression tests to fail after upgrading data.table.

我很好奇第二个测试,在我将数字增加另一个数量级之后,如果在两个版本的 data.table 包中都期望忽略最后一个有效数字的微小差异.

I'm curious about the second test, after I increase the numbers by another order of magnitude, if it is expected in both versions of the data.table package to ignore minor differences in the last significant digit.

我假设这一切都与浮点表示有关.也许我处理这个问题的正确方法是将这些大整数表示为 integer64 或字符?我犹豫是否要使用 integer64,因为我不确定 data.table 和 R 环境是否完全支持它们,例如,我不得不在之前的 data.table 代码中添加它:

I'm assuming this all has to do with floating-point representation. Maybe the correct way for me to handle this is to represent these large integers either as integer64 or character? I'm hesitant to do integer64 as I'm not sure if data.table and the R environment fully support them, e.g., I've had to add this in previous data.table code:

options(datatable.integer64="character") # Until integer64 setkey is implemented

也许已经实现了,但无论更改该设置都不会改变这些测试的结果,至少在我的环境中是这样.我认为这是有道理的,因为这些值在 foo 数据表中存储为数字.

Maybe that has been implemented, but regardless changing that setting does not change the results of these tests at least in my environment. I suppose that that makes sense given that these values are stored as numeric in the foo data table.

推荐答案

是的,v1.8.10 中的结果是正确的行为.我们在 v1.9.2 中改进了舍入方法.这是最好的解释:

Yes the result in v1.8.10 was the correct behaviour. We improved the method of rounding in v1.9.2. That's best explained here :

在data.table v1.8.10 vs v1.9.2中分组非常小的数字(例如1e-28)和0.0

这意味着我们在支持存储在类型 numeric 中的大于 2^31 的整数方面倒退了.这现在在 v1.9.3(可从 R-Forge 获得)中得到解决,请参阅 新闻 :

That meant we went backwards on supporting integers > 2^31 stored in type numeric. That's now addressed in v1.9.3 (available from R-Forge), see NEWS :

o bit64::integer64 现在可用于分组和连接,#5369.感谢 James Sams 强调 UPC 和 Clayton Stanley.
提醒:fread() 已经能够检测和读取 integer64 有一段时间了.

o bit64::integer64 now works in grouping and joins, #5369. Thanks to James Sams for highlighting UPCs and Clayton Stanley.
Reminder: fread() has been able to detect and read integer64 for a while.

o 新函数 setNumericRounding() 可用于减少到 1 字节或加入时 0 字节舍入numeric 类型的列或对列进行分组,#5369.请参阅 ?setNumericRounding 和 NEWS 中的示例v1.9.2 中的项目.getNumericRounding() 返回当前设置.

o New function setNumericRounding() may be used to reduce to 1 byte or 0 byte rounding when joining to or grouping columns of type numeric, #5369. See example in ?setNumericRounding and NEWS item from v1.9.2. getNumericRounding() returns the current setting.

因此,您可以调用 setNumericRounding(0) 来关闭所有 numeric 列的全局舍入,或者更好地为列使用更合适的类型:bit64::integer64 现在支持了.

So you can either call setNumericRounding(0) to switch off rounding globally for all numeric columns, or better, use the more appropriate type for the column: bit64::integer64 now that it's supported.

v1.9.2 的变化是:

The change in v1.9.2 was :

o 数字数据仍然像以前一样在容差范围内连接和分组但不是容差是 sqrt(.Machine$double.eps) == 1.490116e-08 (与 base::all.equal 的默认值相同),有效数字现在四舍五入到最后 2 个字节,apx 11 s.f.这更适合大 (1.23e20) 和小 (1.23e-20) 数字,并且通过简单的位旋转更快.一些函数提供了一个公差"参数,但它没有被传递,因此已被删除.我们的目标是在未来的版本中添加一个全局选项(例如 2、1 或 0 字节舍入)[DONE].

o Numeric data is still joined and grouped within tolerance as before but instead of tolerance being sqrt(.Machine$double.eps) == 1.490116e-08 (the same as base::all.equal's default) the significand is now rounded to the last 2 bytes, apx 11 s.f. This is more appropriate for large (1.23e20) and small (1.23e-20) numerics and is faster via a simple bit twiddle. A few functions provided a 'tolerance' argument but this wasn't being passed through so has been removed. We aim to add a global option (e.g. 2, 1 or 0 byte rounding) in a future release [DONE].

?setNumericRounding 中的例子是:

> DT = data.table(a=seq(0,1,by=0.2),b=1:2, key="a")
> DT
     a b
1: 0.0 1
2: 0.2 2
3: 0.4 1
4: 0.6 2
5: 0.8 1
6: 1.0 2
> setNumericRounding(0)   # turn off rounding; i.e. if we didn't round
> DT[.(0.4)]   # works
     a b
1: 0.4 1
> DT[.(0.6)]   # no match!, confusing to users
     a  b      # 0.6 is clearing there in DT, and 0.4 worked ok!
1: 0.6 NA
>     
> setNumericRounding(2)   # restore default
> DT[.(0.6)]   # now works as user expects
     a b
1: 0.6 2
>     
> # using type 'numeric' for integers > 2^31 (typically ids)
> DT = data.table(id = c(1234567890123, 1234567890124, 1234567890125), val=1:3)
> DT[,.N,by=id]   # 1 row (the last digit has been rounded)
             id N
1: 1.234568e+12 3
> setNumericRounding(0)  # turn off rounding
> DT[,.N,by=id]   # 3 rows (the last digit wasn't rounded)
             id N
1: 1.234568e+12 1
2: 1.234568e+12 1
3: 1.234568e+12 1
>  # but, better to use bit64::integer64 for such ids instead of numeric
>  setNumericRounding(2)  # restore default, preferred

这篇关于data.table 中的大整数.与 1.8.10 相比,1.9.2 中的分组结果不同的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆