如何自我加入一个条件的data.table [英] How to self join a data.table on a condition

查看:76
本文介绍了如何自我加入一个条件的data.table的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想为我的data.table添加一个新列。此列应包含满足特定条件的所有行的另一列的总和。例如:我的data.table看起来像这样:

I want to add a new column to my data.table. This column should contain the sum of another column of all rows that satisfy a certain condition. An example: My data.table looks like this:

require(data.table)
DT <- data.table(n=c("a", "a", "a", "a", "a", "a", "b", "b", "b"),
             t=c(10, 20, 33, 40, 50, 22, 25, 34, 11),
             v=c(20, 15, 16, 17, 11, 12, 20, 22, 10)
             )
DT
   n  t  v
1: a 10 20
2: a 20 15
3: a 33 16
4: a 40 17
5: a 50 11
6: a 22 12
7: b 25 20
8: b 34 22
9: b 11 10

对于每一行x和每一行i,其中abs(t [i] -t [x])<= 10,

For every row x and every row i, where abs(t[i] - t[x]) <= 10, I want to calculate

foo = sum( v[i] * abs(t[i] - t[x]) )

在SQL中,我将使用自连接解决这个问题。在RI中能够使用for循环:

In SQL I would solve this using a self join. In R I was able to do this using a for loop:

for (i in 1:nrow(DT))
    DT[i, foo:=DT[n==DT[i]$n & abs(t-DT[i]$t)<=10, sum(v * abs(t-DT[i]$t) )]]

DT
   n  t  v foo
1: a 10 20 150
2: a 20 15 224
3: a 33 16 119
4: a 40 17 222
5: a 50 11 170
6: a 22 12  30
7: b 25 20 198
8: b 34 22 180
9: b 11 10   0

不幸的是,我必须经常这样做,我使用的表是更大。 for-loop方法可行,但太慢了。我玩了sqldf包,没有真正的突破。我会喜欢这样做使用一些data.table魔术,有我需要你的帮助:-)。我认为需要的是某种自联接,条件是t值的差小于阈值。

Unfortunately I have to do this quite often and the table I work with is rather larger. The for-loop approach works but is too slow. I played around with the sqldf package, with no real breakthrough. I would love to do this using some data.table magic and there I need your help :-). I think what is needed is some kind of self join on the condition that the difference of the t values is smaller then the threshold.

后续:
我有一个跟进问题:在我的申请中,此加入一遍又一遍。 v的变化,但t和n总是相同的。所以我想着以某种方式存储哪些行属于一起。

Follow up: I have a follow up question: In my application this join is done over and over again. The v's change, but the t's and the n's are always the same. So I am thinking about somehow storing which rows belong together. Any ideas how to do this in a clever way?

推荐答案

尝试以下操作:

unique(merge(DT, DT, by="n")[abs(t.x - t.y) <= 10, list(n, sum(v.x * abs(t.x - t.y))), by=list(t.x, v.x)])






上述行的细分

您可以将表格与自身合并,输出也将是数据。表。请注意,列名将带有后缀 .x .y

You can merge a table with itself, the output will also be a data.table. Notice that the column names will be given a suffix of .x and .y

merge(DT, DT, by="n")

...你可以像任何DT一样过滤和计算

... you can just filter and calculate as with any DT

# this will give you your desired rows
[abs(t.x - t.y), ]

# this is the expression you outlined
[ ... , sum(v.x * abs(t.x - t.y)) ]

# summing by t.x and v.x
[ ... , ... , by=list(t.x, v.x)]) ]

然后最后将它全部包含在 unique 中以删除所有重复的行。

Then finally wrapping it all in unique to remove any duplicated rows.

UPDATE:这应该是注释,但太长了

下面的行符合你的输出。这个和这个答案顶部之间的唯一区别是 vy sum(vy * ...)然而通过语句仍然使用 vx 。这是故意吗?

The line below is what matches your output. The only difference between this and the one at the top of this answer is the term v.y in sum(v.y * ...) however the by statement still uses v.x. Is that intentional?

unique(merge(DT, DT, by="n")[abs(t.x - t.y) <= 10, list(n, sum(v.y * abs(t.x - t.y))), by=list(t.x, v.x)])

这篇关于如何自我加入一个条件的data.table的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆