为什么 data.tables 的 X[Y] 连接不允许完全外部连接或左连接? [英] Why does X[Y] join of data.tables not allow a full outer join, or a left join?

查看:22
本文介绍了为什么 data.tables 的 X[Y] 连接不允许完全外部连接或左连接?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这是一个关于 data.table 连接语法的哲学问题.我发现 data.tables 的用途越来越多,但仍在学习...

This is a bit of a philosophical question about data.table join syntax. I am finding more and more uses for data.tables, but still learning...

data.tables 的连接格式X[Y] 非常简洁、方便和高效,但据我所知,它只支持内连接和右外连接.要获得左外连接或全外连接,我需要使用 merge:

The join format X[Y] for data.tables is very concise, handy and efficient, but as far as I can tell, it only supports inner joins and right outer joins. To get a left or full outer join, I need to use merge:

  • X[Y, nomatch = NA] -- Y 中的所有行 -- 右外连接(默认)
  • X[Y, nomatch = 0] -- 只有在 X 和 Y 都匹配的行 -- 内连接
  • merge(X, Y, all = TRUE) -- 来自 X 和 Y 的所有行 -- 全外连接
  • merge(X, Y, all.x = TRUE) -- X 中的所有行 -- 左外连接
  • X[Y, nomatch = NA] -- all rows in Y -- right outer join (default)
  • X[Y, nomatch = 0] -- only rows with matches in both X and Y -- inner join
  • merge(X, Y, all = TRUE) -- all rows from both X and Y -- full outer join
  • merge(X, Y, all.x = TRUE) -- all rows in X -- left outer join

在我看来,如果 X[Y] 连接格式支持所有 4 种连接类型会很方便.是否有仅支持两种类型的连接的原因?

It seems to me that it would be handy if the X[Y] join format supported all 4 types of joins. Is there a reason only two types of joins are supported?

对我来说,nomatch = 0nomatch = NA 参数值对于正在执行的操作不是很直观.我更容易理解和记住 merge 语法:all = TRUEall.x = TRUEall.y= 真.由于 X[Y] 操作比 match 更类似于 merge,为什么不使用 merge 语法进行连接而不是 match 函数的 nomatch 参数?

For me, the nomatch = 0 and nomatch = NA parameter values are not very intuitive for the actions being performed. It is easier for me to understand and remember the merge syntax: all = TRUE, all.x = TRUE and all.y = TRUE. Since the X[Y] operation resembles merge much more than match, why not use the merge syntax for joins rather than the match function's nomatch parameter?

以下是 4 种连接类型的代码示例:

Here are code examples of the 4 join types:

# sample X and Y data.tables
library(data.table)
X <- data.table(t = 1:4, a = (1:4)^2)
setkey(X, t)
X
#    t  a
# 1: 1  1
# 2: 2  4
# 3: 3  9
# 4: 4 16

Y <- data.table(t = 3:6, b = (3:6)^2)
setkey(Y, t)
Y
#    t  b
# 1: 3  9
# 2: 4 16
# 3: 5 25
# 4: 6 36

# all rows from Y - right outer join
X[Y]  # default
#  t  a  b
# 1: 3  9  9
# 2: 4 16 16
# 3: 5 NA 25
# 4: 6 NA 36

X[Y, nomatch = NA]  # same as above
#    t  a  b
# 1: 3  9  9
# 2: 4 16 16
# 3: 5 NA 25
# 4: 6 NA 36

merge(X, Y, by = "t", all.y = TRUE)  # same as above
#    t  a  b
# 1: 3  9  9
# 2: 4 16 16
# 3: 5 NA 25
# 4: 6 NA 36

identical(X[Y], merge(X, Y, by = "t", all.y = TRUE))
# [1] TRUE

# only rows in both X and Y - inner join
X[Y, nomatch = 0]  
#    t  a  b
# 1: 3  9  9
# 2: 4 16 16

merge(X, Y, by = "t")  # same as above
#    t  a  b
# 1: 3  9  9
# 2: 4 16 16

merge(X, Y, by = "t", all = FALSE)  # same as above
#    t  a  b
# 1: 3  9  9
# 2: 4 16 16

identical( X[Y, nomatch = 0], merge(X, Y, by = "t", all = FALSE) )
# [1] TRUE

# all rows from X - left outer join
merge(X, Y, by = "t", all.x = TRUE)
#    t  a  b
# 1: 1  1 NA
# 2: 2  4 NA
# 3: 3  9  9
# 4: 4 16 16

# all rows from both X and Y - full outer join
merge(X, Y, by = "t", all = TRUE)
#    t  a  b
# 1: 1  1 NA
# 2: 2  4 NA
# 3: 3  9  9
# 4: 4 16 16
# 5: 5 NA 25
# 6: 6 NA 36

更新:data.table v1.9.6 引入了 on= 语法,它允许对主键以外的字段进行临时连接.jangorecki 对问题的回答如何连接(合并)数据框(内部、外部、左、右)? 提供了一些 data.table 可以处理的附加连接类型的示例.

Update: data.table v1.9.6 introduced the on= syntax, which allows ad hoc joins on fields other than the primary key. jangorecki's answer to the question How to join (merge) data frames (inner, outer, left, right)? provides some examples of additional join types that data.table can handle.

推荐答案

引自 data.table FAQ 1.11 X[Y]merge(X, Y) 有什么区别?

To quote from the data.table FAQ 1.11 What is the difference between X[Y] and merge(X, Y)?

X[Y] 是一个连接,使用 Y(或 Y 的键,如果有的话)作为索引查找 X 的行.

X[Y] is a join, looking up X's rows using Y (or Y's key if it has one) as an index.

Y[X] 是一个连接,使用 X 查找 Y 的行(或者 X 的键,如果有的话)

Y[X] is a join, looking up Y's rows using X (or X's key if it has one)

merge(X,Y) 同时执行两种方式.X[Y]Y[X] 的行数通常不同,而 merge(X,Y) 返回的行数> 和 merge(Y,X) 是一样的.

merge(X,Y) does both ways at the same time. The number of rows of X[Y] and Y[X] usually differ, whereas the number of rows returned by merge(X,Y) and merge(Y,X) is the same.

但是没有抓住要点.大多数任务需要在加入或合并后的数据.为什么合并所有列数据,只为了之后使用其中的一小部分?你可以建议merge(X[,ColsNeeded1],Y[,ColsNeeded2]),但这需要程序员确定需要哪些列.data.table 中的 X[Y,j] 一步完成所有这些你.当您编写 X[Y,sum(foo*bar)] 时,data.table 会自动检查 j 表达式以查看它使用哪些列.它只会对这些列进行子集;其他人被忽略.仅为 j 使用的列创建内存,并且 Y 列在每个组的上下文中享受标准的 R 回收规则.假设 fooX 中,bar 在 Y 中(以及 Y 中的 20 个其他列).X[Y,sum(foo*bar)] 难道不是比将所有内容合并后浪费地合并一个子集来编程和运行更快吗?

BUT that misses the main point. Most tasks require something to be done on the data after a join or merge. Why merge all the columns of data, only to use a small subset of them afterwards? You may suggest merge(X[,ColsNeeded1],Y[,ColsNeeded2]), but that requires the programmer to work out which columns are needed. X[Y,j] in data.table does all that in one step for you. When you write X[Y,sum(foo*bar)], data.table automatically inspects the j expression to see which columns it uses. It will only subset those columns only; the others are ignored. Memory is only created for the columns the j uses, and Y columns enjoy standard R recycling rules within the context of each group. Let's say foo is in X, and bar is in Y (along with 20 other columns in Y). Isn't X[Y,sum(foo*bar)] quicker to program and quicker to run than a merge of everything wastefully followed by a subset?

<小时>

如果你想要一个 X[Y] 的左外连接

le <- Y[X]
mallx <- merge(X, Y, all.x = T)
# the column order is different so change to be the same as `merge`
setcolorder(le, names(mallx))
identical(le, mallx)
# [1] TRUE

如果你想要一个完整的外连接

If you want a full outer join

# the unique values for the keys over both data sets
unique_keys <- unique(c(X[,t], Y[,t]))
Y[X[J(unique_keys)]]
##   t  b  a
## 1: 1 NA  1
## 2: 2 NA  4
## 3: 3  9  9
## 4: 4 16 16
## 5: 5 25 NA
## 6: 6 36 NA

# The following will give the same with the column order X,Y
X[Y[J(unique_keys)]]

这篇关于为什么 data.tables 的 X[Y] 连接不允许完全外部连接或左连接?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆