在 R 中滚动连接 data.table [英] rolling joins data.table in R

查看:26
本文介绍了在 R 中滚动连接 data.table的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图更多地了解滚动连接工作的方式,但有些困惑,我希望有人可以为我澄清这一点.举个具体的例子:

I am trying to understand a little more about the way rolling joins work and am having some confusion, I was hoping somebody could clarify this for me. To take a concrete example:

dt1 <- data.table(id=rep(1:5, 10), t=1:50, val1=1:50, key="id,t")
dt2 <- data.table(id=rep(1:5, 2), t=1:10, val2=1:10, key="id,t")

我希望这会产生一个很长的 data.table,其中 dt2 中的值被滚动:

I expected this to produce a long data.table where the values in dt2 are rolled:

dt1[dt2,roll=TRUE]

相反,正确的做法似乎是:

Instead, the correct way to do this seems to be:

dt2[dt1,roll=TRUE]

有人可以向我解释更多关于加入 data.table 是如何工作的,因为我显然没有正确理解它.我认为 dt1[dt2,roll=TRUE] 对应于 select * from dt1 right join dt2 on (dt1.id = dt2.id and dt1.t = dt2.t),除了添加的功能 locf.

Could someone explain to me more about how joining in data.table works as I am clearly not understanding it correctly. I thought that dt1[dt2,roll=TRUE] corresponded to the sql equivalent of select * from dt1 right join dt2 on (dt1.id = dt2.id and dt1.t = dt2.t), except with the added functionality locf.

此外,文档说:

X[Y] is a join, looking up X's rows using Y (or Y's key if it has one) 
as an index.

这使得似乎只有在 X 中的东西才应该被返回,并且正在执行的连接是内部连接,而不是外部连接.当 roll=T 但那个特定的 iddt1 中不存在的情况下呢?多玩一会,我无法理解将什么值放入列中.

This makes it seem that only things in X should be returned an the join being done is an inner join, not outer. What about in the case when roll=T but that particular id does not exist in dt1? Playing around a bit more I can't understand what value is being placed into the column.

推荐答案

文档中的引用似乎来自 FAQ 1.12 X[Y] 和 merge(X,Y) 之间的区别是什么.您是否在 ?data.table 中找到了以下内容并且有帮助吗?

That quote from the documentation appears to be from FAQ 1.12 What is the difference between X[Y] and merge(X,Y). Did you find the following in ?data.table and does it help?

roll 适用于最后一个连接列,通常是日期但可以是任何有序变量,不规则,包括间隙.如果 roll=TRUE 并且我是row 匹配除最后一个 x 连接列之外的所有列,以及它在最后我加入的列落在一个空白处(包括在最后一个之后该组在 x 中的观察),则 x 中的主要值是向前滚动.此操作使用修改后的速度特别快二分查找.该操作也称为最后一次观察进行前进 (LOCF).通常,x 的键中不应有重复项,即最后一个键列是日期(或时间或日期时间)和所有列x 的键被加入.一个常见的习语是选择一个跨一组标识符的同期常规时间序列 (dts)(ids): DT[CJ(ids,dts),roll=TRUE] 其中 DT 有一个 2 列键 (id,date)而 CJ 代表交叉连接.

roll Applies to the last join column, generally a date but can be any ordered variable, irregular and including gaps. If roll=TRUE and i's row matches to all but the last x join column, and its value in the last i join column falls in a gap (including after the last observation in x for that group), then the prevailing value in x is rolled forward. This operation is particularly fast using a modified binary search. The operation is also known as last observation carried forward (LOCF). Usually, there should be no duplicates in x's key, the last key column is a date (or time, or datetime) and all the columns of x's key are joined to. A common idiom is to select a contemporaneous regular time series (dts) across a set of identifiers (ids): DT[CJ(ids,dts),roll=TRUE] where DT has a 2-column key (id,date) and CJ stands for cross join.

rolltolast 与 roll 类似,但数据没有前滚过最后一个由连接列定义的每个组内的观察.价值的 i 必须落在 x 的间隙中,但不能在数据结束之后,因为该组由除最后一个连接列之外的所有列定义.滚动和rolltolast 可能不是 TRUE.

rolltolast Like roll but the data is not rolled forward past the last observation within each group defined by the join columns. The value of i must fall in a gap in x but not after the end of the data, for that group defined by all but the last join column. roll and rolltolast may not both be TRUE.

就 SQL 连接的左/右类比而言,我更喜欢在 FAQ 2.14 的上下文中考虑这一点 你能进一步解释一下为什么 data.table 受到 A[B] 语法的启发在基地.答案很长,我就不贴在这里了.

In terms of left/right analogies to SQL joins, I prefer to think about that in the context of FAQ 2.14 Can you explain further why data.table is inspired by A[B] syntax in base. That's quite a long answer so I won't paste it here.

这篇关于在 R 中滚动连接 data.table的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆