与 data.tables 的非连接 [英] non-joins with data.tables
问题描述
我有一个关于非连接"的 data.table
习语的问题,灵感来自 Iterator 的 问题.这是一个例子:
I have a question on the data.table
idiom for "non-joins", inspired from Iterator's question. Here is an example:
library(data.table)
dt1 <- data.table(A1=letters[1:10], B1=sample(1:5,10, replace=TRUE))
dt2 <- data.table(A2=letters[c(1:5, 11:15)], B2=sample(1:5,10, replace=TRUE))
setkey(dt1, A1)
setkey(dt2, A2)
data.table
看起来像这样
> dt1 > dt2
A1 B1 A2 B2
[1,] a 1 [1,] a 2
[2,] b 4 [2,] b 5
[3,] c 2 [3,] c 2
[4,] d 5 [4,] d 1
[5,] e 1 [5,] e 1
[6,] f 2 [6,] k 5
[7,] g 3 [7,] l 2
[8,] h 3 [8,] m 4
[9,] i 2 [9,] n 1
[10,] j 4 [10,] o 1
要查找 dt2
中的哪些行与 dt1
中的键相同,请将 which
选项设置为 TRUE
:
To find which rows in dt2
have the same key in dt1
, set the which
option to TRUE
:
> dt1[dt2, which=TRUE]
[1] 1 2 3 4 5 NA NA NA NA NA
马修在这个 回答,表示非加入"成语
Matthew suggested in this answer, that a "non join" idiom
dt1[-dt1[dt2, which=TRUE]]
将 dt1
子集到那些具有未出现在 dt2
中的索引的行.在我的机器上使用 data.table
v1.7.1 我得到一个错误:
to subset dt1
to those rows that have indexes that don't appear in dt2
. On my machine with data.table
v1.7.1 I get an error:
Error in `[.default`(x[[s]], irows): only 0's may be mixed with negative subscripts
相反,使用选项 nomatch=0
,非连接"有效
Instead, with the option nomatch=0
, the "non join" works
> dt1[-dt1[dt2, which=TRUE, nomatch=0]]
A1 B1
[1,] f 2
[2,] g 3
[3,] h 3
[4,] i 2
[5,] j 4
这是预期的行为吗?
推荐答案
据我所知,这是base R的一部分.
As far as I know, this is a part of base R.
# This works
(1:4)[c(-2,-3)]
# But this gives you the same error you described above
(1:4)[c(-2, -3, NA)]
# Error in (1:4)[c(-2, -3, NA)] :
# only 0's may be mixed with negative subscripts
文本错误消息表明它是预期的行为.
The textual error message indicates that it is intended behavior.
这是我对为什么这是预期行为的最佳猜测:
Here's my best guess as to why that is the intended behavior:
从他们在其他地方对待 NA
的方式(例如,通常默认为 na.rm=FALSE
),R 的设计者似乎认为 NA
携带重要信息,并且不愿意在没有明确说明的情况下放弃它.(幸运的是,设置 nomatch=0
为您提供了一种清晰的方式来传递该指令!)
From the way they treat NA
's elsewhere (e.g. typically defaulting to na.rm=FALSE
), it seems that R's designers view NA
's as carrying important information, and are loath to drop that without some explicit instruction to do so. (Fortunately, setting nomatch=0
gives you a clean way to pass that instruction along!)
在这种情况下,设计师的偏好可能解释了为什么 NA
被接受用于正索引,而不是用于负索引:
In this context, the designers' preference probably explains why NA
's are accepted for positive indexing, but not for negative indexing:
# Positive indexing: works, because the return value retains info about NA's
(1:4)[c(2,3,NA)]
# Negative indexing: doesn't work, because it can't easily retain such info
(1:4)[c(-2,-3,NA)]
这篇关于与 data.tables 的非连接的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!