与 data.tables 的非连接 [英] non-joins with data.tables

查看:13
本文介绍了与 data.tables 的非连接的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个关于非连接"的 data.table 习语的问题,灵感来自 Iterator 的 问题.这是一个例子:

I have a question on the data.table idiom for "non-joins", inspired from Iterator's question. Here is an example:

library(data.table)

dt1 <- data.table(A1=letters[1:10], B1=sample(1:5,10, replace=TRUE))
dt2 <- data.table(A2=letters[c(1:5, 11:15)], B2=sample(1:5,10, replace=TRUE))

setkey(dt1, A1)
setkey(dt2, A2)

data.table看起来像这样

> dt1               > dt2
      A1 B1               A2 B2
 [1,]  a  1          [1,]  a  2
 [2,]  b  4          [2,]  b  5
 [3,]  c  2          [3,]  c  2
 [4,]  d  5          [4,]  d  1
 [5,]  e  1          [5,]  e  1
 [6,]  f  2          [6,]  k  5
 [7,]  g  3          [7,]  l  2
 [8,]  h  3          [8,]  m  4
 [9,]  i  2          [9,]  n  1
[10,]  j  4         [10,]  o  1

要查找 dt2 中的哪些行与 dt1 中的键相同,请将 which 选项设置为 TRUE:

To find which rows in dt2 have the same key in dt1, set the which option to TRUE:

> dt1[dt2, which=TRUE]
[1]  1  2  3  4  5 NA NA NA NA NA

马修在这个 回答,表示非加入"成语

Matthew suggested in this answer, that a "non join" idiom

dt1[-dt1[dt2, which=TRUE]]

dt1 子集到那些具有未出现在 dt2 中的索引的行.在我的机器上使用 data.table v1.7.1 我得到一个错误:

to subset dt1 to those rows that have indexes that don't appear in dt2. On my machine with data.table v1.7.1 I get an error:

Error in `[.default`(x[[s]], irows): only 0's may be mixed with negative subscripts

相反,使用选项 nomatch=0,非连接"有效

Instead, with the option nomatch=0, the "non join" works

> dt1[-dt1[dt2, which=TRUE, nomatch=0]]
     A1 B1
[1,]  f  2
[2,]  g  3
[3,]  h  3
[4,]  i  2
[5,]  j  4

这是预期的行为吗?

推荐答案

据我所知,这是base R的一部分.

As far as I know, this is a part of base R.

# This works
(1:4)[c(-2,-3)]

# But this gives you the same error you described above
(1:4)[c(-2, -3, NA)]
# Error in (1:4)[c(-2, -3, NA)] : 
#   only 0's may be mixed with negative subscripts

文本错误消息表明它预期的行为.

The textual error message indicates that it is intended behavior.

这是我对为什么这是预期行为的最佳猜测:

Here's my best guess as to why that is the intended behavior:

从他们在其他地方对待 NA 的方式(例如,通常默认为 na.rm=FALSE),R 的设计者似乎认为 NA 携带重要信息,并且不愿意在没有明确说明的情况下放弃它.(幸运的是,设置 nomatch=0 为您提供了一种清晰的方式来传递该指令!)

From the way they treat NA's elsewhere (e.g. typically defaulting to na.rm=FALSE), it seems that R's designers view NA's as carrying important information, and are loath to drop that without some explicit instruction to do so. (Fortunately, setting nomatch=0 gives you a clean way to pass that instruction along!)

在这种情况下,设计师的偏好可能解释了为什么 NA 被接受用于正索引,而不是用于负索引:

In this context, the designers' preference probably explains why NA's are accepted for positive indexing, but not for negative indexing:

# Positive indexing: works, because the return value retains info about NA's
(1:4)[c(2,3,NA)]

# Negative indexing: doesn't work, because it can't easily retain such info
(1:4)[c(-2,-3,NA)]

这篇关于与 data.tables 的非连接的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆