R data.table多个条件联接 [英] R data.table Multiple Conditions Join
问题描述
我设计了一种解决方案,可以从两个单独的数据表的多个列中查找值,并添加一个基于新列的值计算(多次条件比较).下面的代码.它涉及使用data.table并在从两个表中计算值时进行联接,但是,这些表未联接在我正在比较的列上,因此我怀疑我可能无法获得data.tables固有的速度优势.我已经读了很多,并为能进入而感到兴奋.换句话说,我正在加入虚拟"列,所以我认为我没有适当地"加入.
I’ve devised a solution to lookup values from multiple columns of two separate data tables and add a new column based calculations of their values (multiple conditional comparisons). Code below. It involves using a data.table and join while calculating values from both tables, however, the tables aren’t joined on the columns I’m comparing, and therefore I suspect I may not be getting the speed advantages inherent to data.tables that I’ve read so much about and am excited about tapping into. Said another way, I’m joining on a ‘dummy’ column, so I don’t think I’m joining "properly."
给定X乘X的网格dtGrid
和网格内X ^ 2个随机事件dtEvents
的列表,以确定在每个网格点的1个单位半径内发生了多少个事件.代码如下.我选择了100 X 100的网格大小,这需要约1.5秒才能在我的计算机上运行联接.但是,如果不引入巨大的性能提升(200 X 200需要大约22秒),我就无法做得更大.
The exercise is, given an X by X grid dtGrid
and a list of X^2 random Events dtEvents
within the grid, to determine how many Events occur within a 1 unit radius of each grid point. The code is below. I picked a grid size of 100 X 100, which takes ~1.5 sec to run the join on my machine. But I can’t go much bigger without introducing an enormous performance hit (200 X 200 takes ~22 sec).
我真的很喜欢能够在我的val
语句中添加多个条件的灵活性(例如,如果我想添加一堆AND和OR组合,我可以做到这一点),所以我想保留它功能.
I really like the flexibility of being able to add multiple conditions to my val
statement (e.g., if I wanted to add a bunch of AND and OR combinations I could do that), so I'd like to retain that functionality.
有没有一种方法可以使用data.table正确地联接(或任何其他data.table解决方案)以达到更快/更有效的结果?
Is there a way to use data.table joins ‘properly’ (or any other data.table solution) to achieve a much speedier / efficient outcome?
非常感谢!
#Initialization stuff
library(data.table)
set.seed(77L)
#Set grid size constant
#Increasing this number to a value much larger than 100 will result in significantly longer run times
cstGridSize = 100L
#Create Grid
vecXYSquare <- seq(0, cstGridSize, 1)
dtGrid <- data.table(expand.grid(vecXYSquare, vecXYSquare))
setnames(dtGrid, 'Var1', 'x')
setnames(dtGrid, 'Var2', 'y')
dtGrid[, DummyJoin:='A']
setkey(dtGrid, DummyJoin)
#Create Events
xrand <- runif(cstGridSize^2, 0, cstGridSize + 1)
yrand <- runif(cstGridSize^2, 0, cstGridSize + 1)
dtEvents <- data.table(x=xrand, y=yrand)
dtEvents[, DummyJoin:='A']
dtEvents[, Counter:=1L]
setkey(dtEvents, DummyJoin)
#Return # of events within 1 unit radius of each grid point
system.time(
dtEventsWithinRadius <- dtEvents[dtGrid, {
val = Counter[(x - i.x)^2 + (y - i.y)^2 < 1^2]; #basic circle fomula: x^2 + y^2 = radius^2
list(col_i.x=i.x, col_i.y=i.y, EventsWithinRadius=sum(val))
}, by=.EACHI]
)
推荐答案
一个非常有趣的问题..以及by = .EACHI
的大量使用!这是使用 NEW non-equi 从当前的开发版本v1.9.7 加入.
Very interesting question.. and great use of by = .EACHI
! Here's another approach using the NEW non-equi joins from the current development version, v1.9.7.
问题:使用by=.EACHI
是完全合理的,因为另一种选择是执行交叉连接(将dtGrid
的每一行都连接到dtEvents
的所有行),但这是太详尽,势必会迅速爆炸.
Issue: Your use of by=.EACHI
is completely justified because the other alternative is to perform a cross join (each row of dtGrid
joined to all rows of dtEvents
) but that's too exhaustive and is bound to explode very quickly.
但是by = .EACHI
与 equi-join 一起使用 dummy列执行,导致计算所有距离(但一次只能计算一次,因此内存效率高).也就是说,在您的代码中,对于每个dtGrid
,仍然使用dtEvents
计算所有可能的距离;因此它的扩展性不及预期.
However by = .EACHI
is performed along with an equi-join using a dummy column, which results in computing all distances (except that it does one at a time, therefore memory efficient). That is, in your code, for each dtGrid
, all possible distances are still computed with dtEvents
; hence it doesn't scale as well as expected.
策略:然后,您同意可以接受的改进是限制将dtGrid
的每一行连接到dtEvents
所导致的行数.
Strategy: Then you'd agree that an acceptable improvement is to restrict the number of rows that would result from joining each row of dtGrid
to dtEvents
.
让(x_i, y_i)
来自dtGrid
和(a_j, b_j)
来自dtEvents
,例如,其中1 <= i <= nrow(dtGrid)
和1 <= j <= nrow(dtEvents)
.然后,i = 1
表示,需要提取所有满足(x1 - a_j)^2 + (y1 - b_j)^2 < 1
的j
.只有在以下情况下才会发生这种情况:
Let (x_i, y_i)
come from dtGrid
and (a_j, b_j)
come from from dtEvents
, say, where 1 <= i <= nrow(dtGrid)
and 1 <= j <= nrow(dtEvents)
. Then, i = 1
implies, all j
that satisfies (x1 - a_j)^2 + (y1 - b_j)^2 < 1
needs to be extracted. That can only happen when:
(x1 - a_j)^2 < 1 AND (y1 - b_j)^2 < 1
这有助于极大地减少搜索空间,因为我们不必为dtGrid
中的每一行查看dtEvents
中的 all 行,而只需提取其中的行,
This helps reduce the search space drastically because, instead of looking at all rows in dtEvents
for each row in dtGrid
, we just have to extract those rows where,
a_j - 1 <= x1 <= a_j + 1 AND b_j - 1 <= y1 <= b_j + 1
# where '1' is the radius
此约束可以直接转换为 non-equi 连接,并可以像以前一样与by = .EACHI
组合.唯一需要执行的附加步骤是构造列a_j-1, a_j+1, b_j-1, b_j+1
,如下所示:
This constraint can be directly translated to a non-equi join, and combined with by = .EACHI
as before. The only additional step required is to construct the columns a_j-1, a_j+1, b_j-1, b_j+1
as follows:
foo1 <- function(dt1, dt2) {
dt2[, `:=`(xm=x-1, xp=x+1, ym=y-1, yp=y+1)] ## (1)
tmp = dt2[dt1, on=.(xm<=x, xp>=x, ym<=y, yp>=y),
.(sum((i.x-x)^2+(i.y-y)^2<1)), by=.EACHI,
allow=TRUE, nomatch=0L
][, c("xp", "yp") := NULL] ## (2)
tmp[]
}
## (1)
构造非等额联接所需的所有列(因为on=
的公式中尚不允许使用表达式.
## (1)
constructs all columns necessary for non-equi joins (since expressions are not allowed in the formula for on=
yet.
## (2)
执行非等距联接,该联接计算距离并检查dtGrid
中每一行的受限组合上所有< 1
距离-因此应为很多更快.
## (2)
performs a non-equi join that computes distances and checks for all distances that are < 1
on the restricted set of combinations for each row in dtGrid
-- hence should be much faster.
基准:
# Here's your code (modified to ensure identical column names etc..):
foo2 <- function(dt1, dt2) {
ans = dt2[dt1,
{
val = Counter[(x - i.x)^2 + (y - i.y)^2 < 1^2];
.(xm=i.x, ym=i.y, V1=sum(val))
},
by=.EACHI][, "DummyJoin" := NULL]
ans[]
}
# on grid size of 100:
system.time(ans1 <- foo1(dtGrid, dtEvents)) # 0.166s
system.time(ans2 <- foo2(dtGrid, dtEvents)) # 1.626s
# on grid size of 200:
system.time(ans1 <- foo1(dtGrid, dtEvents)) # 0.983s
system.time(ans2 <- foo2(dtGrid, dtEvents)) # 31.038s
# on grid size of 300:
system.time(ans1 <- foo1(dtGrid, dtEvents)) # 2.847s
system.time(ans2 <- foo2(dtGrid, dtEvents)) # 151.32s
identical(ans1[V1 != 0]L, ans2[V1 != 0L]) # TRUE for all of them
提速分别为〜10倍,32倍和53倍.
The speedups are ~10x, 32x and 53x respectively.
请注意,即使对于dtEvents
中的单行,dtGrid
中的行也无法满足条件(由于nomatch=0L
).如果要这些行,还必须添加xm/xp/ym/yp
列之一.并检查它们是否为NA
(=不匹配).
Note that the rows in dtGrid
for which the condition is not satisfied even for a single row in dtEvents
will not be present in the result (due to nomatch=0L
). If you want those rows, you'll have to also add one of the xm/xp/ym/yp
cols.. and check them for NA
(= no matches).
这就是我们必须删除所有全部 0个计数以获得相同= TRUE
的原因.
This is the reason we had to remove all 0 counts to get identical = TRUE
.
HTH
PS:请参见历史记录,了解另一个变体,在该变体中实现了整个连接,然后计算了距离并生成了计数.
PS: See history for another variation where the entire join is materialised and then the distance is computed and counts generated.
这篇关于R data.table多个条件联接的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!