data.table内部/外部联接与NA在类型double bug的join列中? [英] data.table inner/outer join with NA in join column of type double bug?
问题描述
按照维基百科文章 SQL连接,我想了解如何连接与data.table。
在这个过程中,我们可能发现了一个错误,加入NAs。
以wiki为例:
Following this wikipedia article SQL join I wanted to have a clear view on how we could have joins with data.table. In the process we might have uncovered a bug when joining with NAs. Taking the wiki example:
R) X = data.table(name=c("Raf","Jon","Ste","Rob","Smi","Joh"),depID=c(31,33,33,34,34,NA),key="depID")
R) Y = data.table(depID=c(31,33,34,35),depName=c("Sal","Eng","Cle","Mar"),key="depID")
R) X
name depID
1: Joh NA
2: Raf 31
3: Jon 33
4: Ste 33
5: Rob 34
6: Smi 34
R) Y
depID depName
1: 31 Sal
2: 33 Eng
3: 34 Cle
4: 35 Mar
LEFT OUTER JOIN
R) merge.data.frame(X,Y,all.x=TRUE)
depID name depName
1 31 Raf Sal
2 33 Jon Eng
3 33 Ste Eng
4 34 Rob Cle
5 34 Smi Cle
6 NA Joh <NA>
merge.data.table
同样的结果,并显示我认为是一个在木2的错误。
merge.data.table
do not output the same result and show what I think is a bug on lign 2.
R) merge(X,Y,all.x=TRUE)
depID name depName
1: NA Joh Eng
2: 31 Raf NA
3: 33 Jon Eng
4: 33 Ste Eng
5: 34 Rob Cle
6: 34 Smi Cle
R) Y[X] #same -> :(
depID depName name
1: NA Eng Joh
2: 31 NA Raf
3: 33 Eng Jon
4: 33 Eng Ste
5: 34 Cle Rob
6: 34 Cle Smi
正确的外部加入
看起来像相同
RIGHT OUTER JOIN Looks like the same
R) merge.data.frame(X,Y,all.y=TRUE)
depID name depName
1 31 Raf Sal
2 33 Jon Eng
3 33 Ste Eng
4 34 Rob Cle
5 34 Smi Cle
6 35 <NA> Mar
R) merge(X,Y,all.y=TRUE)
depID name depName
1: NA Joh Eng
2: 31 NA Sal
3: 33 Jon Eng
4: 33 Ste Eng
5: 34 Rob Cle
6: 34 Smi Cle
7: 35 NA Mar
INNER(NATURAL)JOIN b
$ b
INNER (NATURAL) JOIN
R) merge.data.frame(X,Y)
depID name depName
1 31 Raf Sal
2 33 Jon Eng
3 33 Ste Eng
4 34 Rob Cle
5 34 Smi Cle
R) merge(X,Y)
depID name depName
1: NA Joh Eng
2: 33 Jon Eng
3: 33 Ste Eng
4: 34 Rob Cle
5: 34 Smi Cle
推荐答案
是的,它看起来像一个(embarassing)有关于关键不可能的NA的其他讨论,但我没有意识到,它可能以这种方式搞砸。会调查。感谢...
Yes it looks like an (embarassing) new bug related to the NA in key. There have been other discussions about NA in key not being possible but I didn't realise it could mess up in that way. Will investigate. Thanks ...
在1.8.7(提交780),从新闻:
Now fixed in 1.8.7 (commit 780), from NEWS :
NA在类型double的连接列可能会导致X [Y] merge(X,Y)返回不正确的结果,#2453。由于C源中错误的x == NA_REAL,应该是ISNA(x)。在keyed连接中支持double是一个相对较近的对data.table的添加,但是同样令人尴尬。固定和测试添加。非常感谢statquant为完整和可重复的报告。
NA in a join column of type double could cause both X[Y] and merge(X,Y) to return incorrect results, #2453. Due to an errant x==NA_REAL in the C source which should have been ISNA(x). Support for double in keyed joins is a relatively recent addition to data.table, but embarassing all the same. Fixed and tests added. Many thanks to statquant for the thorough and reproducible report.
这篇关于data.table内部/外部联接与NA在类型double bug的join列中?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!