在双错误类型的连接列中使用 NA 的 data.table 内部/外部连接? [英] data.table inner/outer join with NA in join column of type double bug?
问题描述
按照这篇维基百科文章 SQL 连接,我想清楚地了解我们如何进行连接与数据表.在这个过程中,我们可能在加入 NA 时发现了一个错误.以wiki为例:
Following this wikipedia article SQL join I wanted to have a clear view on how we could have joins with data.table. In the process we might have uncovered a bug when joining with NAs. Taking the wiki example:
R) X = data.table(name=c("Raf","Jon","Ste","Rob","Smi","Joh"),depID=c(31,33,33,34,34,NA),key="depID")
R) Y = data.table(depID=c(31,33,34,35),depName=c("Sal","Eng","Cle","Mar"),key="depID")
R) X
name depID
1: Joh NA
2: Raf 31
3: Jon 33
4: Ste 33
5: Rob 34
6: Smi 34
R) Y
depID depName
1: 31 Sal
2: 33 Eng
3: 34 Cle
4: 35 Mar
<小时>
左外连接
R) merge.data.frame(X,Y,all.x=TRUE)
depID name depName
1 31 Raf Sal
2 33 Jon Eng
3 33 Ste Eng
4 34 Rob Cle
5 34 Smi Cle
6 NA Joh <NA>
merge.data.table
不输出相同的结果并显示我认为是 lign 2 的错误.
merge.data.table
do not output the same result and show what I think is a bug on lign 2.
R) merge(X,Y,all.x=TRUE)
depID name depName
1: NA Joh Eng
2: 31 Raf NA
3: 33 Jon Eng
4: 33 Ste Eng
5: 34 Rob Cle
6: 34 Smi Cle
R) Y[X] #same -> :(
depID depName name
1: NA Eng Joh
2: 31 NA Raf
3: 33 Eng Jon
4: 33 Eng Ste
5: 34 Cle Rob
6: 34 Cle Smi
<小时>
右外连接好像一样
R) merge.data.frame(X,Y,all.y=TRUE)
depID name depName
1 31 Raf Sal
2 33 Jon Eng
3 33 Ste Eng
4 34 Rob Cle
5 34 Smi Cle
6 35 <NA> Mar
R) merge(X,Y,all.y=TRUE)
depID name depName
1: NA Joh Eng
2: 31 NA Sal
3: 33 Jon Eng
4: 33 Ste Eng
5: 34 Rob Cle
6: 34 Smi Cle
7: 35 NA Mar
<小时>
内部(自然)联接
R) merge.data.frame(X,Y)
depID name depName
1 31 Raf Sal
2 33 Jon Eng
3 33 Ste Eng
4 34 Rob Cle
5 34 Smi Cle
R) merge(X,Y)
depID name depName
1: NA Joh Eng
2: 33 Jon Eng
3: 33 Ste Eng
4: 34 Rob Cle
5: 34 Smi Cle
推荐答案
是的,它看起来像一个与键中的 NA 相关的(令人尴尬的)新错误.还有其他关于 NA in key 的讨论是不可能的,但我没有意识到它会以这种方式搞砸.会调查.谢谢...
Yes it looks like an (embarassing) new bug related to the NA in key. There have been other discussions about NA in key not being possible but I didn't realise it could mess up in that way. Will investigate. Thanks ...
#2453 NA in double键列搞乱了连接(整数和字符中的 NA ok)
现在已在 1.8.7(提交 780)中修复,来自 NEWS:
Now fixed in 1.8.7 (commit 780), from NEWS :
在 double 类型的连接列中的 NA 可能导致 X[Y] 和 merge(X,Y) 返回不正确的结果,#2453.由于 C 源代码中的错误 x==NA_REAL 应该是 ISNA(x).对双键连接的支持是对 data.table 的相对较新的补充,但同样令人尴尬.已修复并添加了测试.非常感谢 statquant 提供全面且可重复的报告.
NA in a join column of type double could cause both X[Y] and merge(X,Y) to return incorrect results, #2453. Due to an errant x==NA_REAL in the C source which should have been ISNA(x). Support for double in keyed joins is a relatively recent addition to data.table, but embarassing all the same. Fixed and tests added. Many thanks to statquant for the thorough and reproducible report.
这篇关于在双错误类型的连接列中使用 NA 的 data.table 内部/外部连接?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!