如何在条件下自行加入 data.table [英] How to self join a data.table on a condition

查看:13
本文介绍了如何在条件下自行加入 data.table的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想在我的 data.table 中添加一个新列.此列应包含满足特定条件的所有行的另一列的总和.一个例子:我的 data.table 看起来像这样:

I want to add a new column to my data.table. This column should contain the sum of another column of all rows that satisfy a certain condition. An example: My data.table looks like this:

require(data.table)
DT <- data.table(n=c("a", "a", "a", "a", "a", "a", "b", "b", "b"),
             t=c(10, 20, 33, 40, 50, 22, 25, 34, 11),
             v=c(20, 15, 16, 17, 11, 12, 20, 22, 10)
             )
DT
   n  t  v
1: a 10 20
2: a 20 15
3: a 33 16
4: a 40 17
5: a 50 11
6: a 22 12
7: b 25 20
8: b 34 22
9: b 11 10

对于每一行x和每一行i,其中abs(t[i] - t[x]) <= 10,我要计算

For every row x and every row i, where abs(t[i] - t[x]) <= 10, I want to calculate

foo = sum( v[i] * abs(t[i] - t[x]) )

在 SQL 中,我会使用自连接来解决这个问题.在 R 中,我可以使用 for 循环来做到这一点:

In SQL I would solve this using a self join. In R I was able to do this using a for loop:

for (i in 1:nrow(DT))
    DT[i, foo:=DT[n==DT[i]$n & abs(t-DT[i]$t)<=10, sum(v * abs(t-DT[i]$t) )]]

DT
   n  t  v foo
1: a 10 20 150
2: a 20 15 224
3: a 33 16 119
4: a 40 17 222
5: a 50 11 170
6: a 22 12  30
7: b 25 20 198
8: b 34 22 180
9: b 11 10   0

不幸的是,我必须经常这样做,而且我使用的桌子更大.for 循环方法有效,但速度太慢.我玩弄了 sqldf 包,没有真正的突破.我很想使用一些 data.table 魔法来做到这一点,我需要你的帮助:-).我认为需要的是某种自连接,条件是 t 值的差异小于阈值.

Unfortunately I have to do this quite often and the table I work with is rather larger. The for-loop approach works but is too slow. I played around with the sqldf package, with no real breakthrough. I would love to do this using some data.table magic and there I need your help :-). I think what is needed is some kind of self join on the condition that the difference of the t values is smaller then the threshold.

跟进:我有一个后续问题:在我的应用程序中,这个连接是一遍又一遍地完成的.v 发生了变化,但 t 和 n 始终相同.所以我正在考虑以某种方式存储哪些行属于一起.任何想法如何巧妙地做到这一点?

Follow up: I have a follow up question: In my application this join is done over and over again. The v's change, but the t's and the n's are always the same. So I am thinking about somehow storing which rows belong together. Any ideas how to do this in a clever way?

推荐答案

试试以下:

unique(merge(DT, DT, by="n")[abs(t.x - t.y) <= 10, list(n, sum(v.x * abs(t.x - t.y))), by=list(t.x, v.x)])


以上行的细分:

您可以将一个表与其自身合并,输出也将是一个 data.table.请注意,列名将被赋予 .x.y

You can merge a table with itself, the output will also be a data.table. Notice that the column names will be given a suffix of .x and .y

merge(DT, DT, by="n")

...您可以像使用任何 DT 一样过滤和计算

... you can just filter and calculate as with any DT

# this will give you your desired rows
[abs(t.x - t.y), ]

# this is the expression you outlined
[ ... , sum(v.x * abs(t.x - t.y)) ]

# summing by t.x and v.x
[ ... , ... , by=list(t.x, v.x)]) ]

然后最后将其全部包装在 unique 中以删除所有重复的行.

Then finally wrapping it all in unique to remove any duplicated rows.

更新:下面的行与您的输出相匹配.这个和这个答案顶部的唯一区别是 sum(vy * ...) 中的术语 vy 但是 by 语句仍然使用 vx.这是故意的吗?

UPDATE: The line below is what matches your output. The only difference between this and the one at the top of this answer is the term v.y in sum(v.y * ...) however the by statement still uses v.x. Is that intentional?

unique(merge(DT, DT, by="n")[abs(t.x - t.y) <= 10, list(n, sum(v.y * abs(t.x - t.y))), by=list(t.x, v.x)])

这篇关于如何在条件下自行加入 data.table的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆