为什么X [Y]加入data.tables不允许完全外部连接或左连接? [英] Why does X[Y] join of data.tables not allow a full outer join, or a left join?
问题描述
这是一个关于data.table连接语法的哲学问题。我正在找到越来越多的用于data.tables,但仍在学习...
This is a bit of a philosophical question about data.table join syntax. I am finding more and more uses for data.tables, but still learning...
连接格式 X [Y]
data.tables是非常简洁,方便和高效,但就我所知,它只支持内部联接和右外部联接。要获得左侧或全部外部联接,我需要使用 merge
:
The join format X[Y]
for data.tables is very concise, handy and efficient, but as far as I can tell, it only supports inner joins and right outer joins. To get a left or full outer join, I need to use merge
:
-
X [Y,nomatch = NA]
- Y中的所有行 - 右外连接(默认) -
X [Y,nomatch = 0]
- 只有在X和Y中都匹配的行 - 内部连接 -
(X,Y,all = TRUE)
- 来自X和Y的所有行 - 完全外连接 -
,Y,all.x = TRUE)
- X中的所有行 - 左外连接
X[Y, nomatch = NA]
-- all rows in Y -- right outer join (default)X[Y, nomatch = 0]
-- only rows with matches in both X and Y -- inner joinmerge(X, Y, all = TRUE)
-- all rows from both X and Y -- full outer joinmerge(X, Y, all.x = TRUE)
-- all rows in X -- left outer join
在我看来,如果 X [Y]
加入格式支持所有4种类型的连接,这将是方便的。是否有理由只支持两种类型的联接?
It seems to me that it would be handy if the X[Y]
join format supported all 4 types of joins. Is there a reason only two types of joins are supported?
对我来说, nomatch = 0
和 nomatch = NA
参数值对于执行的操作不是很直观。我更容易理解并记住 merge
语法: all = TRUE
, all .x = TRUE
和 all.y = TRUE
。由于 X [Y]
操作类似 merge
大于匹配
,为什么不使用 merge
语法进行连接,而不是使用 match
函数 / code>参数?
For me, the nomatch = 0
and nomatch = NA
parameter values are not very intuitive for the actions being performed. It is easier for me to understand and remember the merge
syntax: all = TRUE
, all.x = TRUE
and all.y = TRUE
. Since the X[Y]
operation resembles merge
much more than match
, why not use the merge
syntax for joins rather than the match
function's nomatch
parameter?
以下是4种连接类型的代码示例:
Here are code examples of the 4 join types:
# sample X and Y data.tables
library(data.table)
X <- data.table(t = 1:4, a = (1:4)^2)
setkey(X, t)
X
# t a
# 1: 1 1
# 2: 2 4
# 3: 3 9
# 4: 4 16
Y <- data.table(t = 3:6, b = (3:6)^2)
setkey(Y, t)
Y
# t b
# 1: 3 9
# 2: 4 16
# 3: 5 25
# 4: 6 36
# all rows from Y - right outer join
X[Y] # default
# t a b
# 1: 3 9 9
# 2: 4 16 16
# 3: 5 NA 25
# 4: 6 NA 36
X[Y, nomatch = NA] # same as above
# t a b
# 1: 3 9 9
# 2: 4 16 16
# 3: 5 NA 25
# 4: 6 NA 36
merge(X, Y, by = "t", all.y = TRUE) # same as above
# t a b
# 1: 3 9 9
# 2: 4 16 16
# 3: 5 NA 25
# 4: 6 NA 36
identical(X[Y], merge(X, Y, by = "t", all.y = TRUE))
# [1] TRUE
# only rows in both X and Y - inner join
X[Y, nomatch = 0]
# t a b
# 1: 3 9 9
# 2: 4 16 16
merge(X, Y, by = "t") # same as above
# t a b
# 1: 3 9 9
# 2: 4 16 16
merge(X, Y, by = "t", all = FALSE) # same as above
# t a b
# 1: 3 9 9
# 2: 4 16 16
identical( X[Y, nomatch = 0], merge(X, Y, by = "t", all = FALSE) )
# [1] TRUE
# all rows from X - left outer join
merge(X, Y, by = "t", all.x = TRUE)
# t a b
# 1: 1 1 NA
# 2: 2 4 NA
# 3: 3 9 9
# 4: 4 16 16
# all rows from both X and Y - full outer join
merge(X, Y, by = "t", all = TRUE)
# t a b
# 1: 1 1 NA
# 2: 2 4 NA
# 3: 3 9 9
# 4: 4 16 16
# 5: 5 NA 25
# 6: 6 NA 36
更新:data.table v1.9.6介绍了 =
语法,允许在除主键之外的字段上进行即席连接。 jangorecki的回答问题如何(合并)数据帧(内,外,左,右)?提供了data.table可以处理的附加连接类型的一些示例。
Update: data.table v1.9.6 introduced the on=
syntax, which allows ad hoc joins on fields other than the primary key. jangorecki's answer to the question How to join (merge) data frames (inner, outer, left, right)? provides some examples of additional join types that data.table can handle.
推荐答案
1.12
X [Y]和合并(X,Y)?
1.12
What is the difference between X[Y] and merge(X,Y)?
-
是一个连接,使用Y(或Y的键,如果它有一个)查找X的行作为索引。
-
Y [X]
是一个连接,使用X(或X的键,如果有)查找Y的行 -
merge(X,Y)
同时执行两种方式。
X[Y]
is a join, looking up X's rows using Y (or Y's key if it has one) as an index.Y[X]
is a join, looking up Y's rows using X (or X's key if it has one)merge(X,Y)
does both ways at the same time.
X [Y]
和 Y [X]
通常不同;而 merge(X,Y)
和 merge(Y,X)
返回的
行数是一样的。但是,
错过了要点。大多数任务需要在连接或合并之后对
数据执行某些操作。为什么合并所有的数据列,只有
使用它们的一小部分?您可以建议
merge(X [,ColsNeeded1],Y [,ColsNeeded2])
,但是会获取
子集的副本,它需要程序员来确定需要哪些
列。 data.table中的X [Y,j
]在
中执行所有步骤。当您写 X [Y,sum(foo * bar)]
时,data.table自动
检查j表达式以查看它使用哪些列。它只有
只有这些列子集;其他都被忽略。内存只为j使用的列创建
,Y列在每个组的上下文中享受标准的R
回收规则。让我们说foo在
X中,bar是在Y(以及Y中的20个其他列)。不是
X [Y,sum(foo * bar)]
更快的编程和更快的运行比合并
后面一个子集?
The number of rows of X[Y]
and Y[X]
usually differ; whereas the number of
rows returned by merge(X,Y)
and merge(Y,X)
is the same. BUT that
misses the main point. Most tasks require something to be done on the
data after a join or merge. Why merge all the columns of data, only to
use a small subset of them afterwards? You may suggest
merge(X[,ColsNeeded1],Y[,ColsNeeded2])
, but that takes copies of the
sub- sets of data, and it requires the programmer to work out which
columns are needed. X[Y,j
] in data.table does all that in one step for
you. When you write X[Y,sum(foo*bar)]
, data.table automatically
inspects the j expression to see which columns it uses. It will only
subset those columns only; the others are ignored. Memory is only
created for the columns the j uses, and Y columns enjoy standard R
recycling rules within the context of each group. Let's say foo is in
X, and bar is in Y (along with 20 other columns in Y). Isn't
X[Y,sum(foo*bar)]
quicker to program and quicker to run than a merge
followed by a subset?
如果你想要一个左外连接 X [Y]
p>
If you want a left outer join of X[Y]
le <- Y[X]
mallx <- merge(X, Y, all.x = T)
# the column order is different so change to be the same as `merge`
setcolorder(le, names(mallx))
identical(le, mallx)
# [1] TRUE
如果你想要一个完整的外连接
If you want a full outer join
# the unique values for the keys over both data sets
unique_keys <- unique(c(X[,t], Y[,t]))
Y[X[J(unique_keys)]]
## t b a
## 1: 1 NA 1
## 2: 2 NA 4
## 3: 3 9 9
## 4: 4 16 16
## 5: 5 25 NA
## 6: 6 36 NA
# The following will give the same with the column order X,Y
X[Y[J(unique_keys)]]
这篇关于为什么X [Y]加入data.tables不允许完全外部连接或左连接?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!