R的data.table截断位? [英] R's data.table Truncating Bits?

查看:107
本文介绍了R的data.table截断位?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

所以我是一个巨大的 data.table 风扇在R.我使用它几乎所有的时间,但遇到了一个情况,它不会工作我在所有。我有一个包(我公司内部)使用R的 double 存储一个无符号64位整数的值,其位序列对应于一些花哨编码。这个包在除了data.table之外的所有地方工作非常好。我发现,如果我聚合在这个数据的一列,我失去了大量的唯一值。我只猜到这里是 data.table 在某种奇怪的 double 优化中截断位。

So I'm a huge data.table fan in R. I use it almost all the time but have come across a situation in which it won't work for me at all. I have a package (internal to my company) that uses R's double to store the value of an unsigned 64 bit integer whose bit sequence corresponds to some fancy encoding. This package works very nicely everywhere except data.table. I found that if I aggregate on a column of this data that I lose a large number of my unique values. My only guess here is that data.table is truncating bits in some kind of weird double optimization.

任何人都可以确认是这样的吗?这是一个错误吗?

Can anyone confirm that this is the case? Is this simply a bug?

下面是一个问题的复制和比较我目前必须使用,但想避免与激情( dplyr )。

Below see a reproduction of the issue and comparison to the package I currently must use but want to avoid with a passion (dplyr).

temp <- structure(list(obscure_math = c(6.95476896592629e-309, 6.95476863436446e-309, 
6.95476743245288e-309, 6.95476942182375e-309, 6.95477149408563e-309, 
6.95477132830476e-309, 6.95477132830476e-309, 6.95477149408562e-309, 
6.95477174275702e-309, 6.95476880014538e-309, 6.95476896592647e-309, 
6.95476896592647e-309, 6.95476900737172e-309, 6.95476900737172e-309, 
6.95476946326899e-309, 6.95476958760468e-309, 6.95476958760468e-309, 
6.95477020928318e-309, 6.95477124541406e-309, 6.95476859291965e-309, 
6.95476875870014e-309, 6.95476904881676e-309, 6.95476904881676e-309, 
6.95476904881676e-309, 6.95476909026199e-309, 6.95476909026199e-309, 
6.95476909026199e-309, 6.95476909026199e-309, 6.9547691317072e-309, 
6.9547691317072e-309, 6.9547691317072e-309, 6.9547691317072e-309, 
6.9547691317072e-309, 6.9547691317072e-309, 6.9547691317072e-309, 
6.9547691317072e-309, 6.9547691317072e-309, 6.9547691317072e-309, 
6.9547691317072e-309, 6.9547691317072e-309, 6.95477211576406e-309, 
6.95476880014538e-309, 6.95476880014538e-309, 6.95476880014538e-309, 
6.95476892448104e-309, 6.95476880014538e-309, 6.95476892448105e-309, 
6.9547689659263e-309, 6.95476913170719e-309, 6.95476933893334e-309
)), .Names = "obscure_math", class = c("data.table", "data.frame"), row.names = c(NA, 
-50L))

dt_collapsed <- temp[, .(count=.N), by=obscure_math]
nrow(dt_collapsed) == length(unique(temp$obscure_math))

setDF(temp)
dplyr_collapsed <- temp %>% group_by(obscure_math) %>% summarise(count=n())
nrow(dplyr_collapsed) == length(unique(temp$obscure_math))


推荐答案

p> 更新:当前开发中的默认舍入功能已删除data.table(v1.9.7)版本。请参阅此处的开发版本的安装说明。

Update: the default rounding feature has been removed in the current development version of data.table (v1.9.7). See installation instructions for devel version here.

这也意味着你有责任了解表示浮点数并处理浮点数的限制。

This also means that you're responsible for understanding the limitations in representing floating point numbers and dealing with it.

data.table已经存在很长时间了。我们曾经通过使用阈值来处理浮点表示法的限制(例如, all.equal )。然而它根本不工作,因为它需要是自适应的,这取决于比较的数字有多大。 本系列文章

data.table has been around for a long time. We used to deal with limitations in floating point representations by using a threshold (like base R does, e.g., all.equal). However it simply does not work, since it needs to be adaptive depending on how big the numbers compared are. This series of articles is an excellent read on this topic and other potential issues.

这是一个反复出现的问题,因为a)人们没有意识到这些限制,或b) threshold ing没有真正帮助他们的问题,意味着人们一直在这里或在项目页面上张贴。

This being a recurring issue because a) people don't realise the limitations, or b) thresholding did not really help their issue, meant that people kept asking here or posting on the project page.

我们重新实现了data.table的顺序快速radix顺序,我们借此机会提供了一种解决问题的替代方法,并且如果它证明是不合需要的(导出 setNumericRounding ),则提供一种解决方法。 #1642问题,订购可能不需要有二进制舍入(但不是那么简单,因为订单直接影响基于二进制搜索的子集)。

While we reimplemented data.table's order to fast radix ordering, we took the opportunity to provide an alternative way of fixing the issue, and providing a way out if it proves undesirable (exporting setNumericRounding). With #1642 issue, ordering probably doesn't need to have rounding of doubles (but it's not that simple, since order directly affects binary search based subsets).

这里的实际问题是对浮点数进行分组,更糟糕的是这样的数字在你的情况。

The actual problem here is grouping on floating point numbers, even worse is such numbers as in your case. That is just a bad choice IMHO.

我可以想到两种方法:


  1. 当对实际上是双精度的列进行分组时(在R中,1是双精度,而不是1L,这些情况没有问题),我们提供一个警告,最后2个字节四舍五入,人们应该阅读?setNumericRounding 。还建议使用 bit64 :: integer64

删除允许对真正double值的操作或强制他们在继续之前修正某些数字的精度。我不能想到一个有效的理由,为什么一个人想要真正的浮点数组(愿意听到谁做的人)。

Remove the functionality of allowing grouping operations on really double values or force them to fix the precision to certain digits before continuing. I can't think of a valid reason why one would want to group by floating point numbers really (would love to hear from people who do).

不太可能发生的是回到基于阈值的检查,以确定哪些双打应属于同一组。

What is very unlikely to happen is going back to thresholding based checks for identifying which doubles should belong to the same group.

只要让Q仍然有效,请使用 setNumericRounding(0L)

Just so that the Q remains answered, use setNumericRounding(0L).

这篇关于R的data.table截断位?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆