R data.table多个条件联接 [英] R data.table Multiple Conditions Join

查看:65
本文介绍了R data.table多个条件联接的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我设计了一种解决方案,可以从两个单独的数据表的多个列中查找值,并添加一个基于新列的值计算(多次条件比较).下面的代码.它涉及使用data.table并在从两个表中计算值时进行联接,但是,这些表未联接在我正在比较的列上,因此我怀疑我可能无法获得data.tables固有的速度优势.我已经读了很多,并为能进入而感到兴奋.换句话说,我正在加入虚拟"列,所以我认为我没有适当地"加入.

I’ve devised a solution to lookup values from multiple columns of two separate data tables and add a new column based calculations of their values (multiple conditional comparisons). Code below. It involves using a data.table and join while calculating values from both tables, however, the tables aren’t joined on the columns I’m comparing, and therefore I suspect I may not be getting the speed advantages inherent to data.tables that I’ve read so much about and am excited about tapping into. Said another way, I’m joining on a ‘dummy’ column, so I don’t think I’m joining "properly."

给定X乘X的网格dtGrid和网格内X ^ 2个随机事件dtEvents的列表,以确定在每个网格点的1个单位半径内发生了多少个事件.代码如下.我选择了100 X 100的网格大小,这需要约1.5秒才能在我的计算机上运行联接.但是,如果不引入巨大的性能提升(200 X 200需要大约22秒),我就无法做得更大.

The exercise is, given an X by X grid dtGrid and a list of X^2 random Events dtEvents within the grid, to determine how many Events occur within a 1 unit radius of each grid point. The code is below. I picked a grid size of 100 X 100, which takes ~1.5 sec to run the join on my machine. But I can’t go much bigger without introducing an enormous performance hit (200 X 200 takes ~22 sec).

我真的很喜欢能够在我的val语句中添加多个条件的灵活性(例如,如果我想添加一堆AND和OR组合,我可以做到这一点),所以我想保留它功能.

I really like the flexibility of being able to add multiple conditions to my val statement (e.g., if I wanted to add a bunch of AND and OR combinations I could do that), so I'd like to retain that functionality.

有没有一种方法可以使用data.table正确地联接(或任何其他data.table解决方案)以达到更快/更有效的结果?

Is there a way to use data.table joins ‘properly’ (or any other data.table solution) to achieve a much speedier / efficient outcome?

非常感谢!

#Initialization stuff
library(data.table)
set.seed(77L)

#Set grid size constant
#Increasing this number to a value much larger than 100 will result in significantly longer run times
cstGridSize = 100L

#Create Grid
vecXYSquare <- seq(0, cstGridSize, 1)
dtGrid <- data.table(expand.grid(vecXYSquare, vecXYSquare))
setnames(dtGrid, 'Var1', 'x')
setnames(dtGrid, 'Var2', 'y')
dtGrid[, DummyJoin:='A']
setkey(dtGrid, DummyJoin)

#Create Events
xrand <- runif(cstGridSize^2, 0, cstGridSize + 1)
yrand <- runif(cstGridSize^2, 0, cstGridSize + 1)
dtEvents <- data.table(x=xrand, y=yrand)
dtEvents[, DummyJoin:='A']
dtEvents[, Counter:=1L]
setkey(dtEvents, DummyJoin)

#Return # of events within 1 unit radius of each grid point
system.time(
    dtEventsWithinRadius <- dtEvents[dtGrid, {
        val = Counter[(x - i.x)^2 + (y - i.y)^2 < 1^2];  #basic circle fomula: x^2 + y^2 = radius^2
        list(col_i.x=i.x, col_i.y=i.y, EventsWithinRadius=sum(val))
    }, by=.EACHI]
)

推荐答案

一个非常有趣的问题..以及by = .EACHI的大量使用!这是使用 NEW non-equi 从当前的开发版本v1.9.7 加入.

Very interesting question.. and great use of by = .EACHI! Here's another approach using the NEW non-equi joins from the current development version, v1.9.7.

问题:使用by=.EACHI是完全合理的,因为另一种选择是执行交叉连接(将dtGrid的每一行都连接到dtEvents的所有行),但这是太详尽,势必会迅速爆炸.

Issue: Your use of by=.EACHI is completely justified because the other alternative is to perform a cross join (each row of dtGrid joined to all rows of dtEvents) but that's too exhaustive and is bound to explode very quickly.

但是by = .EACHI equi-join 一起使用 dummy列执行,导致计算所有距离(但一次只能计算一次,因此内存效率高).也就是说,在您的代码中,对于每个dtGrid,仍然使用dtEvents计算所有可能的距离;因此它的扩展性不及预期.

However by = .EACHI is performed along with an equi-join using a dummy column, which results in computing all distances (except that it does one at a time, therefore memory efficient). That is, in your code, for each dtGrid, all possible distances are still computed with dtEvents; hence it doesn't scale as well as expected.

策略:然后,您同意可以接受的改进是限制将dtGrid的每一行连接到dtEvents 所导致的行数.

Strategy: Then you'd agree that an acceptable improvement is to restrict the number of rows that would result from joining each row of dtGrid to dtEvents.

(x_i, y_i)来自dtGrid(a_j, b_j)来自dtEvents,例如,其中1 <= i <= nrow(dtGrid)1 <= j <= nrow(dtEvents).然后,i = 1表示,需要提取所有满足(x1 - a_j)^2 + (y1 - b_j)^2 < 1j.只有在以下情况下才会发生这种情况:

Let (x_i, y_i) come from dtGrid and (a_j, b_j) come from from dtEvents, say, where 1 <= i <= nrow(dtGrid) and 1 <= j <= nrow(dtEvents). Then, i = 1 implies, all j that satisfies (x1 - a_j)^2 + (y1 - b_j)^2 < 1 needs to be extracted. That can only happen when:

(x1 - a_j)^2 < 1 AND (y1 - b_j)^2 < 1

这有助于极大地减少搜索空间,因为我们不必为dtGrid中的每一行查看dtEvents中的 all 行,而只需提取其中的行,

This helps reduce the search space drastically because, instead of looking at all rows in dtEvents for each row in dtGrid, we just have to extract those rows where,

a_j - 1 <= x1 <= a_j + 1 AND b_j - 1 <= y1 <= b_j + 1
# where '1' is the radius

此约束可以直接转换为 non-equi 连接,并可以像以前一样与by = .EACHI组合.唯一需要执行的附加步骤是构造列a_j-1, a_j+1, b_j-1, b_j+1,如下所示:

This constraint can be directly translated to a non-equi join, and combined with by = .EACHI as before. The only additional step required is to construct the columns a_j-1, a_j+1, b_j-1, b_j+1 as follows:

foo1 <- function(dt1, dt2) {
    dt2[, `:=`(xm=x-1, xp=x+1, ym=y-1, yp=y+1)]                   ## (1) 
    tmp = dt2[dt1, on=.(xm<=x, xp>=x, ym<=y, yp>=y), 
              .(sum((i.x-x)^2+(i.y-y)^2<1)), by=.EACHI, 
              allow=TRUE, nomatch=0L
          ][, c("xp", "yp") := NULL]                              ## (2)
    tmp[]
}

## (1)构造非等额联接所需的所有列(因为on=的公式中尚不允许使用表达式.

## (1) constructs all columns necessary for non-equi joins (since expressions are not allowed in the formula for on= yet.

## (2)执行非等距联接,该联接计算距离并检查dtGrid中每一行的受限组合上所有< 1距离-因此应为很多更快.

## (2) performs a non-equi join that computes distances and checks for all distances that are < 1 on the restricted set of combinations for each row in dtGrid -- hence should be much faster.

基准:

# Here's your code (modified to ensure identical column names etc..):
foo2 <- function(dt1, dt2) {
    ans = dt2[dt1, 
                {
                 val = Counter[(x - i.x)^2 + (y - i.y)^2 < 1^2];
                 .(xm=i.x, ym=i.y, V1=sum(val))
                }, 
            by=.EACHI][, "DummyJoin" := NULL]
    ans[]
}

# on grid size of 100:
system.time(ans1 <- foo1(dtGrid, dtEvents)) # 0.166s
system.time(ans2 <- foo2(dtGrid, dtEvents)) # 1.626s

# on grid size of 200:
system.time(ans1 <- foo1(dtGrid, dtEvents)) # 0.983s
system.time(ans2 <- foo2(dtGrid, dtEvents)) # 31.038s

# on grid size of 300:
system.time(ans1 <- foo1(dtGrid, dtEvents)) # 2.847s
system.time(ans2 <- foo2(dtGrid, dtEvents)) # 151.32s

identical(ans1[V1 != 0]L, ans2[V1 != 0L]) # TRUE for all of them

提速分别为〜10倍,32倍和53倍.

The speedups are ~10x, 32x and 53x respectively.

请注意,即使对于dtEvents中的单行,dtGrid中的行也无法满足条件(由于nomatch=0L).如果要这些行,还必须添加xm/xp/ym/yp列之一.并检查它们是否为NA(=不匹配).

Note that the rows in dtGrid for which the condition is not satisfied even for a single row in dtEvents will not be present in the result (due to nomatch=0L). If you want those rows, you'll have to also add one of the xm/xp/ym/yp cols.. and check them for NA (= no matches).

这就是我们必须删除所有全部 0个计数以获得相同= TRUE的原因.

This is the reason we had to remove all 0 counts to get identical = TRUE.

HTH

PS:请参见历史记录,了解另一个变体,在该变体中实现了整个连接,然后计算了距离并生成了计数.

PS: See history for another variation where the entire join is materialised and then the distance is computed and counts generated.

这篇关于R data.table多个条件联接的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆