滚动连接R中的data.table [英] rolling joins data.table in R

查看:21
本文介绍了滚动连接R中的data.table的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试更多地了解滚动连接的工作方式并且有些困惑,我希望有人可以为我澄清这一点.举个具体的例子:

I am trying to understand a little more about the way rolling joins work and am having some confusion, I was hoping somebody could clarify this for me. To take a concrete example:

dt1 <- data.table(id=rep(1:5, 10), t=1:50, val1=1:50, key="id,t")
dt2 <- data.table(id=rep(1:5, 2), t=1:10, val2=1:10, key="id,t")

我希望这会产生一个长 data.table,其中 dt2 中的值是滚动的:

I expected this to produce a long data.table where the values in dt2 are rolled:

dt1[dt2,roll=TRUE]

相反,正确的做法似乎是:

Instead, the correct way to do this seems to be:

dt2[dt1,roll=TRUE]

有人可以向我解释一下加入 data.table 是如何工作的,因为我显然没有正确理解它.我认为 dt1[dt2,roll=TRUE] 对应于 select * from dt1 right join dt2 on (dt1.id = dt2.id and dt1.t = dt2.t),除了添加的功能 locf.

Could someone explain to me more about how joining in data.table works as I am clearly not understanding it correctly. I thought that dt1[dt2,roll=TRUE] corresponded to the sql equivalent of select * from dt1 right join dt2 on (dt1.id = dt2.id and dt1.t = dt2.t), except with the added functionality locf.

另外文档说:

X[Y] is a join, looking up X's rows using Y (or Y's key if it has one) 
as an index.

这使得看起来只有 X 中的东西应该返回,并且正在执行的连接是内部连接,而不是外部连接.在 roll=T 但特定的 iddt1 中不存在的情况下呢?玩得更久了,我不明白该列中的值是什么.

This makes it seem that only things in X should be returned an the join being done is an inner join, not outer. What about in the case when roll=T but that particular id does not exist in dt1? Playing around a bit more I can't understand what value is being placed into the column.

推荐答案

文档中的引用似乎来自 FAQ 1.12 X[Y] 和 merge(X,Y) 有什么区别.您是否在 ?data.table 中找到以下内容并有帮助?

That quote from the documentation appears to be from FAQ 1.12 What is the difference between X[Y] and merge(X,Y). Did you find the following in ?data.table and does it help?

roll 适用于最后一个连接列,通常是日期,但可以是任何日期有序变量,不规则且包括间隙.如果 roll=TRUE 并且 i 是行匹配除了最后一个 x 连接列之外的所有列,并且它的值在最后我加入列落在一个空白处(包括在最后一个之后该组在 x 中的观察值),则 x 中的主要值是向前滚动.此操作使用修改后的速度特别快二进制搜索.该操作也称为进行最后一次观察前锋(LOCF).通常,x 的键中不应有重复项,即最后一个键列是日期(或时间,或日期时间)和所有列x 的键被连接到.一个常见的习惯是选择一个跨一组标识符的同时期常规时间序列 (dts)(ids): DT[CJ(ids,dts),roll=TRUE] 其中 DT 有一个 2 列键 (id,date)CJ 代表交叉连接.

roll Applies to the last join column, generally a date but can be any ordered variable, irregular and including gaps. If roll=TRUE and i's row matches to all but the last x join column, and its value in the last i join column falls in a gap (including after the last observation in x for that group), then the prevailing value in x is rolled forward. This operation is particularly fast using a modified binary search. The operation is also known as last observation carried forward (LOCF). Usually, there should be no duplicates in x's key, the last key column is a date (or time, or datetime) and all the columns of x's key are joined to. A common idiom is to select a contemporaneous regular time series (dts) across a set of identifiers (ids): DT[CJ(ids,dts),roll=TRUE] where DT has a 2-column key (id,date) and CJ stands for cross join.

rolltolast 与 roll 类似,但数据不会前滚到最后一个由连接列定义的每个组内的观察.价值的 i 必须落在 x 的间隙中,但不是在数据结束之后,因为该组由除最后一个连接列之外的所有列定义.滚动和rolltolast 可能不都是 TRUE.

rolltolast Like roll but the data is not rolled forward past the last observation within each group defined by the join columns. The value of i must fall in a gap in x but not after the end of the data, for that group defined by all but the last join column. roll and rolltolast may not both be TRUE.

就 SQL 连接的左/右类比而言,我更愿意在 FAQ 2.14 的上下文中考虑这一点 您能否进一步解释为什么 data.table 受到 A[B] 语法的启发在基地.答案很长,这里就不贴了.

In terms of left/right analogies to SQL joins, I prefer to think about that in the context of FAQ 2.14 Can you explain further why data.table is inspired by A[B] syntax in base. That's quite a long answer so I won't paste it here.

这篇关于滚动连接R中的data.table的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆