在双错误类型的连接列中使用 NA 的 data.table 内部/外部连接? [英] data.table inner/outer join with NA in join column of type double bug?

查看:9
本文介绍了在双错误类型的连接列中使用 NA 的 data.table 内部/外部连接?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

按照这篇维基百科文章 SQL 连接,我想清楚地了解我们如何进行连接与数据表.在这个过程中,我们可能在加入 NA 时发现了一个错误.以wiki为例:

Following this wikipedia article SQL join I wanted to have a clear view on how we could have joins with data.table. In the process we might have uncovered a bug when joining with NAs. Taking the wiki example:

R) X = data.table(name=c("Raf","Jon","Ste","Rob","Smi","Joh"),depID=c(31,33,33,34,34,NA),key="depID")
R) Y = data.table(depID=c(31,33,34,35),depName=c("Sal","Eng","Cle","Mar"),key="depID")
R) X
   name depID
1:  Joh    NA
2:  Raf    31
3:  Jon    33
4:  Ste    33
5:  Rob    34
6:  Smi    34
R) Y
   depID depName
1:    31     Sal
2:    33     Eng
3:    34     Cle
4:    35     Mar

<小时>

左外连接

R) merge.data.frame(X,Y,all.x=TRUE)
  depID name depName
1    31  Raf     Sal
2    33  Jon     Eng
3    33  Ste     Eng
4    34  Rob     Cle
5    34  Smi     Cle
6    NA  Joh    <NA>

merge.data.table 不输出相同的结果并显示我认为是 lign 2 的错误.

merge.data.table do not output the same result and show what I think is a bug on lign 2.

R) merge(X,Y,all.x=TRUE)
   depID name depName
1:    NA  Joh     Eng
2:    31  Raf      NA
3:    33  Jon     Eng
4:    33  Ste     Eng
5:    34  Rob     Cle
6:    34  Smi     Cle
R) Y[X] #same -> :(
   depID depName name
1:    NA     Eng  Joh
2:    31      NA  Raf
3:    33     Eng  Jon
4:    33     Eng  Ste
5:    34     Cle  Rob
6:    34     Cle  Smi

<小时>

右外连接好像一样

R) merge.data.frame(X,Y,all.y=TRUE)
  depID name depName
1    31  Raf     Sal
2    33  Jon     Eng
3    33  Ste     Eng
4    34  Rob     Cle
5    34  Smi     Cle
6    35 <NA>     Mar

R) merge(X,Y,all.y=TRUE)
   depID name depName
1:    NA  Joh     Eng
2:    31   NA     Sal
3:    33  Jon     Eng
4:    33  Ste     Eng
5:    34  Rob     Cle 
6:    34  Smi     Cle
7:    35   NA     Mar

<小时>

内部(自然)联接

R) merge.data.frame(X,Y)
  depID name depName
1    31  Raf     Sal
2    33  Jon     Eng
3    33  Ste     Eng
4    34  Rob     Cle
5    34  Smi     Cle
R) merge(X,Y)
   depID name depName
1:    NA  Joh     Eng
2:    33  Jon     Eng
3:    33  Ste     Eng
4:    34  Rob     Cle
5:    34  Smi     Cle

推荐答案

是的,它看起来像一个与键中的 NA 相关的(令人尴尬的)新错误.还有其他关于 NA in key 的讨论是不可能的,但我没有意识到它会以这种方式搞砸.会调查.谢谢...

Yes it looks like an (embarassing) new bug related to the NA in key. There have been other discussions about NA in key not being possible but I didn't realise it could mess up in that way. Will investigate. Thanks ...

#2453 NA in double键列搞乱了连接(整数和字符中的 NA ok)

现在已在 1.8.7(提交 780)中修复,来自 NEWS:

Now fixed in 1.8.7 (commit 780), from NEWS :

在 double 类型的连接列中的 NA 可能导致 X[Y] 和 merge(X,Y) 返回不正确的结果,#2453.由于 C 源代码中的错误 x==NA_REAL 应该是 ISNA(x).对双键连接的支持是对 data.table 的相对较新的补充,但同样令人尴尬.已修复并添加了测试.非常感谢 statquant 提供全面且可重复的报告.

NA in a join column of type double could cause both X[Y] and merge(X,Y) to return incorrect results, #2453. Due to an errant x==NA_REAL in the C source which should have been ISNA(x). Support for double in keyed joins is a relatively recent addition to data.table, but embarassing all the same. Fixed and tests added. Many thanks to statquant for the thorough and reproducible report.

这篇关于在双错误类型的连接列中使用 NA 的 data.table 内部/外部连接?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆