滚动连接R中的data.table [英] rolling joins data.table in R
问题描述
我正在尝试更多地了解滚动连接的工作方式并且有些困惑,我希望有人可以为我澄清这一点.举个具体的例子:
I am trying to understand a little more about the way rolling joins work and am having some confusion, I was hoping somebody could clarify this for me. To take a concrete example:
dt1 <- data.table(id=rep(1:5, 10), t=1:50, val1=1:50, key="id,t")
dt2 <- data.table(id=rep(1:5, 2), t=1:10, val2=1:10, key="id,t")
我希望这会产生一个长 data.table
,其中 dt2
中的值是滚动的:
I expected this to produce a long data.table
where the values in dt2
are rolled:
dt1[dt2,roll=TRUE]
相反,正确的做法似乎是:
Instead, the correct way to do this seems to be:
dt2[dt1,roll=TRUE]
有人可以向我解释一下加入 data.table
是如何工作的,因为我显然没有正确理解它.我认为 dt1[dt2,roll=TRUE]
对应于 select * from dt1 right join dt2 on (dt1.id = dt2.id and dt1.t = dt2.t)
,除了添加的功能 locf.
Could someone explain to me more about how joining in data.table
works as I am clearly not understanding it correctly. I thought that dt1[dt2,roll=TRUE]
corresponded to the sql equivalent of select * from dt1 right join dt2 on (dt1.id = dt2.id and dt1.t = dt2.t)
, except with the added functionality locf.
另外文档说:
X[Y] is a join, looking up X's rows using Y (or Y's key if it has one)
as an index.
这使得看起来只有 X 中的东西应该返回,并且正在执行的连接是内部连接,而不是外部连接.在 roll=T
但特定的 id
在 dt1
中不存在的情况下呢?玩得更久了,我不明白该列中的值是什么.
This makes it seem that only things in X should be returned an the join being done is an inner join, not outer. What about in the case when roll=T
but that particular id
does not exist in dt1
? Playing around a bit more I can't understand what value is being placed into the column.
推荐答案
文档中的引用似乎来自 FAQ 1.12 X[Y] 和 merge(X,Y) 有什么区别.您是否在 ?data.table
中找到以下内容并有帮助?
That quote from the documentation appears to be from FAQ 1.12 What is the difference between X[Y] and merge(X,Y). Did you find the following in ?data.table
and does it help?
roll
适用于最后一个连接列,通常是日期,但可以是任何日期有序变量,不规则且包括间隙.如果 roll=TRUE 并且 i 是行匹配除了最后一个 x 连接列之外的所有列,并且它的值在最后我加入列落在一个空白处(包括在最后一个之后该组在 x 中的观察值),则 x 中的主要值是向前滚动.此操作使用修改后的速度特别快二进制搜索.该操作也称为进行最后一次观察前锋(LOCF).通常,x 的键中不应有重复项,即最后一个键列是日期(或时间,或日期时间)和所有列x 的键被连接到.一个常见的习惯是选择一个跨一组标识符的同时期常规时间序列 (dts)(ids): DT[CJ(ids,dts),roll=TRUE] 其中 DT 有一个 2 列键 (id,date)CJ 代表交叉连接.
roll
Applies to the last join column, generally a date but can be any ordered variable, irregular and including gaps. If roll=TRUE and i's row matches to all but the last x join column, and its value in the last i join column falls in a gap (including after the last observation in x for that group), then the prevailing value in x is rolled forward. This operation is particularly fast using a modified binary search. The operation is also known as last observation carried forward (LOCF). Usually, there should be no duplicates in x's key, the last key column is a date (or time, or datetime) and all the columns of x's key are joined to. A common idiom is to select a contemporaneous regular time series (dts) across a set of identifiers (ids): DT[CJ(ids,dts),roll=TRUE] where DT has a 2-column key (id,date) and CJ stands for cross join.
rolltolast
与 roll 类似,但数据不会前滚到最后一个由连接列定义的每个组内的观察.价值的 i 必须落在 x 的间隙中,但不是在数据结束之后,因为该组由除最后一个连接列之外的所有列定义.滚动和rolltolast 可能不都是 TRUE.
rolltolast
Like roll but the data is not rolled forward past the last observation within each group defined by the join columns. The value of i must fall in a gap in x but not after the end of the data, for that group defined by all but the last join column. roll and rolltolast may not both be TRUE.
就 SQL 连接的左/右类比而言,我更愿意在 FAQ 2.14 的上下文中考虑这一点 您能否进一步解释为什么 data.table 受到 A[B] 语法的启发在基地.答案很长,这里就不贴了.
In terms of left/right analogies to SQL joins, I prefer to think about that in the context of FAQ 2.14 Can you explain further why data.table is inspired by A[B] syntax in base. That's quite a long answer so I won't paste it here.
这篇关于滚动连接R中的data.table的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!