首选使用哪种data.table语法进行左联接(一列) [英] Which data.table syntax for left join (one column) to prefer

查看:48
本文介绍了首选使用哪种data.table语法进行左联接(一列)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我应该如何开始考虑使用哪种语法?

How should I start thinking about which syntax I prefer?

我的标准是效率(这是第一位)以及可读性/可维护性。

My criteria is efficiency (this is number one) and also readability/maintainability.

A <- B[A, on = .(id)] # wow such. concision

或者那样

A[B, on = .(id), comment := i.comment]

或者甚至(如PoGibas所建议的那样):

Or even (as PoGibas suggests):

A <- merge(A, B, all.x = TRUE)

为了完整起见,一种更基本的方法是使用 match()

For completeness then a more basic way is to use match():

A[, comment := B[chmatch(A[["id"]], id), comment]]

示例数据:

library(data.table)
A <- data.table(id = letters[1:10], amount = rnorm(10)^2)
B <- data.table(id = c("c", "d", "e"), comment = c("big", "slow", "nice"))


推荐答案

为了效率和可维护性,我更喜欢更新联接的习惯用法:**

I prefer the "update join" idiom for efficiency and maintainability:**

DT[WHERE, v := FROM[.SD, on=, x.v]]

它是对 vignette( datatable-reference-semantics)中显示内容的扩展在 U请通过引用设置某些列的行-通过引用 sub-assign 。一旦在连接上有一个小插图可用,那也应该是一个很好的参考。

It's an extension of what is shown in vignette("datatable-reference-semantics") under "Update some rows of columns by reference - sub-assign by reference". Once there is a vignette available on joins, that should also be a good reference.

这是有效的,因为它仅使用 WHERE选择的行并就地修改或添加该列,而不是像更简洁的左联接 FROM [DT,on =] 那样创建新表。

This is efficient since it only uses the rows selected by WHERE and modifies or adds the column in-place, instead of making a new table like the more concise left join FROM[DT, on=].

这使我的代码更具可读性,因为我可以很容易地看到联接的重点是添加列 v ;而且我不必考虑SQL中的左 /右行话,也不必考虑联接后是否保留行数。

It makes my code more readable since I can easily see that the point of the join is to add column v; and I don't have to think through "left"/"right" jargon from SQL or whether the number of rows is preserved after the join.

这对代码维护很有用,因为如果我以后想了解 DT 如何获得名为<$的列c $ c> v ,我可以在代码中搜索 v:= ,而 FROM [DT,on =] 会掩盖正在添加的新列。另外,它允许 WHERE 条件,而左联接则不允许。例如,如果使用 FROM 来填充 NA,这可能很有用。现有的列 v

It is useful for code maintenance since if I later want to find out how DT got a column named v, I can search my code for v :=, while FROM[DT, on=] obscures which new columns are being added. Also, it allows the WHERE condition, while the left join does not. This may be useful, for example, if using FROM to "fill" NAs in an existing column v.

与其他更新连接方法 DT [FROM,on =,v:= iv ] ,我可以想到两个优点。第一种是使用 WHERE 子句的选项,第二种是在连接出现问题时通过警告(例如 FROM中的重复匹配项)通过警告透明化 on = 规则为条件。这是扩展OP示例的示例:

Compared with the other update join approach DT[FROM, on=, v := i.v], I can think of two advantages. First is the option of using the WHERE clause, and second is transparency through warnings when there are problems with the join, like duplicate matches in FROM conditional on the on= rules. Here's an illustration extending the OP's example:

library(data.table)
A <- data.table(id = letters[1:10], amount = rnorm(10)^2)
B2 <- data.table(
  id = c("c", "d", "e", "e"), 
  ord = 1:4, 
  comment = c("big", "slow", "nice", "nooice")
)

# left-joiny update
A[B2, on=.(id), comment := i.comment, verbose=TRUE]
# Calculated ad hoc index in 0.000s elapsed (0.000s cpu) 
# Starting bmerge ...done in 0.000s elapsed (0.000s cpu) 
# Detected that j uses these columns: comment,i.comment 
# Assigning to 4 row subset of 10 rows

# my preferred update
A[, comment2 := B2[A, on=.(id), x.comment]]
# Warning message:
# In `[.data.table`(A, , `:=`(comment2, B2[A, on = .(id), x.comment])) :
#   Supplied 11 items to be assigned to 10 items of column 'comment2' (1 unused)

    id     amount comment comment2
 1:  a 0.20000990    <NA>     <NA>
 2:  b 1.42146573    <NA>     <NA>
 3:  c 0.73047544     big      big
 4:  d 0.04128676    slow     slow
 5:  e 0.82195377  nooice     nice
 6:  f 0.39013550    <NA>   nooice
 7:  g 0.27019768    <NA>     <NA>
 8:  h 0.36017876    <NA>     <NA>
 9:  i 1.81865721    <NA>     <NA>
10:  j 4.86711754    <NA>     <NA>

在左联接风格的更新中,您无声地得到了 comment ,即使 id == e 有两个匹配项;而在其他更新中,您会收到一条有用的警告消息(升级为将来会出现错误发布)。即使使用左联接方法打开 verbose = TRUE 也不能提供信息-它说有四行被更新,但没有说一行被更新了两次

In the left-join-flavored update, you silently get the final value of comment even though there are two matches for id == "e"; while in the other update, you get a helpful warning message (upgraded to an error in a future release). Even turning on verbose=TRUE with the left-joiny approach is not informative -- it says there are four rows being updated but doesn't say that one row is being updated twice.

我发现,当将数据整理到一组整齐/相关的表中时,这种方法最有效。关于这一点的一个很好的参考是哈德利·威克姆(Hadley Wickham)的论文

I find that this approach works best when my data is arranged into a set of tidy/relational tables. A good reference on that is Hadley Wickham's paper.

**在这个惯用语中, on = 部分应使用连接列名称和规则填充,例如 on =。(id) on =。(from_date> = dt_date)。可以使用 roll = mult = nomatch = 。有关详细信息,请参见?data.table 。感谢@RYoda在注释中指出这一点。

** In this idiom, the on= part should be filled in with the join column names and rules, like on=.(id) or on=.(from_date >= dt_date). Further join rules can be passed with roll=, mult= and nomatch=. See ?data.table for details. Thanks to @RYoda for noting this point in the comments.

这里是Matt Dowle解释的 roll = 的一个更复杂的示例:查找到每行最近出现特定值的时间

Here is a more complicated example from Matt Dowle explaining roll=: Find time to nearest occurrence of particular value for each row

另一个相关示例:使用左连接data.table

这篇关于首选使用哪种data.table语法进行左联接(一列)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆