更喜欢左连接(一列)的哪个 data.table 语法 [英] Which data.table syntax for left join (one column) to prefer

查看:16
本文介绍了更喜欢左连接(一列)的哪个 data.table 语法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我应该如何开始考虑我更喜欢哪种语法?

How should I start thinking about which syntax I prefer?

我的标准是效率(这是第一名)以及可读性/可维护性.

My criteria is efficiency (this is number one) and also readability/maintainability.

这个

A <- B[A, on = .(id)] # very concise!

或者那个

A[B, on = .(id), comment := i.comment]

甚至(正如 PoGibas 建议的那样):

Or even (as PoGibas suggests):

A <- merge(A, B, all.x = TRUE)

为了完整性,更基本的方法是使用 match():

For completeness then a more basic way is to use match():

A[, comment := B[chmatch(A[["id"]], id), comment]]

示例数据:

library(data.table)
A <- data.table(id = letters[1:10], amount = rnorm(10)^2)
B <- data.table(id = c("c", "d", "e"), comment = c("big", "slow", "nice"))

推荐答案

为了效率和可维护性,我更喜欢update join"成语:**

I prefer the "update join" idiom for efficiency and maintainability:**

DT[WHERE, v := FROM[.SD, on=, x.v]]

它是 vignette("datatable-reference-semantics") 中通过引用更新某些列的行 - sub-assign 通过引用显示的内容的扩展".一旦在连接上有可用的小插图,那也应该是一个很好的参考.

It's an extension of what is shown in vignette("datatable-reference-semantics") under "Update some rows of columns by reference - sub-assign by reference". Once there is a vignette available on joins, that should also be a good reference.

这是有效的,因为它只使用 WHERE 选择的行并就地修改或添加列,而不是像更简洁的左连接 FROM[DT, on=].

This is efficient since it only uses the rows selected by WHERE and modifies or adds the column in-place, instead of making a new table like the more concise left join FROM[DT, on=].

它使我的代码更具可读性,因为我可以很容易地看到连接点是添加列 v;而且我不必考虑 SQL 中的左"/右"行话,或者在加入后是否保留行数.

It makes my code more readable since I can easily see that the point of the join is to add column v; and I don't have to think through "left"/"right" jargon from SQL or whether the number of rows is preserved after the join.

这对于代码维护很有用,因为如果我以后想知道 DT 是如何得到一个名为 v 的列的,我可以在我的代码中搜索 v :=,而 FROM[DT, on=] 掩盖了正在添加的新列.此外,它允许 WHERE 条件,而左连接不允许.这可能很有用,例如,如果 使用 FROM 在现有列 中填充" NAv.

It is useful for code maintenance since if I later want to find out how DT got a column named v, I can search my code for v :=, while FROM[DT, on=] obscures which new columns are being added. Also, it allows the WHERE condition, while the left join does not. This may be useful, for example, if using FROM to "fill" NAs in an existing column v.

与其他更新连接方式DT[FROM, on=, v := i.v]相比,我能想到两个优点.首先是使用 WHERE 子句的选项,其次是在连接出现问题时通过警告的透明度,例如 FROM 中的重复匹配以 on 为条件= 规则.这是扩展 OP 示例的插图:

Compared with the other update join approach DT[FROM, on=, v := i.v], I can think of two advantages. First is the option of using the WHERE clause, and second is transparency through warnings when there are problems with the join, like duplicate matches in FROM conditional on the on= rules. Here's an illustration extending the OP's example:

library(data.table)
A <- data.table(id = letters[1:10], amount = rnorm(10)^2)
B2 <- data.table(
  id = c("c", "d", "e", "e"), 
  ord = 1:4, 
  comment = c("big", "slow", "nice", "nooice")
)

# left-joiny update
A[B2, on=.(id), comment := i.comment, verbose=TRUE]
# Calculated ad hoc index in 0.000s elapsed (0.000s cpu) 
# Starting bmerge ...done in 0.000s elapsed (0.000s cpu) 
# Detected that j uses these columns: comment,i.comment 
# Assigning to 4 row subset of 10 rows

# my preferred update
A[, comment2 := B2[A, on=.(id), x.comment]]
# Warning message:
# In `[.data.table`(A, , `:=`(comment2, B2[A, on = .(id), x.comment])) :
#   Supplied 11 items to be assigned to 10 items of column 'comment2' (1 unused)

    id     amount comment comment2
 1:  a 0.20000990    <NA>     <NA>
 2:  b 1.42146573    <NA>     <NA>
 3:  c 0.73047544     big      big
 4:  d 0.04128676    slow     slow
 5:  e 0.82195377  nooice     nice
 6:  f 0.39013550    <NA>   nooice
 7:  g 0.27019768    <NA>     <NA>
 8:  h 0.36017876    <NA>     <NA>
 9:  i 1.81865721    <NA>     <NA>
10:  j 4.86711754    <NA>     <NA>

在 left-join-flavored 更新中,即使 id == "e" 有两个匹配项,您也会默默地获得 comment 的最终值;而在另一个更新中,您会收到一条有用的警告消息(升级到 未来版本中的错误).即使用 left-joiny 方法打开 verbose=TRUE 也不能提供任何信息——它说有四行正在更新,但没有说一行被更新两次.

In the left-join-flavored update, you silently get the final value of comment even though there are two matches for id == "e"; while in the other update, you get a helpful warning message (upgraded to an error in a future release). Even turning on verbose=TRUE with the left-joiny approach is not informative -- it says there are four rows being updated but doesn't say that one row is being updated twice.

我发现当我的数据被安排到一组整洁/关系表中时,这种方法效果最好.Hadley Wickham 的论文 是一个很好的参考.

I find that this approach works best when my data is arranged into a set of tidy/relational tables. A good reference on that is Hadley Wickham's paper.

** 在此成语中,on= 部分应填写连接列名称和规则,如 on=.(id)on=.(from_date >= dt_date).可以使用 roll=mult=nomatch= 传递更多的连接规则.有关详细信息,请参阅 ?data.table.感谢@RYoda 在评论中指出这一点.

** In this idiom, the on= part should be filled in with the join column names and rules, like on=.(id) or on=.(from_date >= dt_date). Further join rules can be passed with roll=, mult= and nomatch=. See ?data.table for details. Thanks to @RYoda for noting this point in the comments.

下面是 Matt Dowle 解释 roll= 的一个更复杂的示例:查找每行最近出现的特定值的时间

Here is a more complicated example from Matt Dowle explaining roll=: Find time to nearest occurrence of particular value for each row

另一个相关示例:使用data.table左连接

这篇关于更喜欢左连接(一列)的哪个 data.table 语法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆