R:有效计算值子集的摘要,其值由两个变量之间的关系决定 [英] R: efficiently computing summaries of value-subsets whose contents are determined by the relation between two variables
问题描述
我有两个表, A
和 B
。对于表 A
的每一行,我想获取 B $ value
的一些摘要统计信息,其中 B $ location
在 A $ location
的 100
之内。我已经使用下面的for循环完成了此操作,但这是一个缓慢的解决方案,当表很小时,它很好用,但是我想扩展到表 A
它有成千上万行,而表 B
则有近百万行。关于如何实现这一目标的任何想法?提前致谢!
I have two tables, A
and B
. For each row of table A
, I want to get some summary statistics for B$value
where the value of B$location
is within 100
of A$location
. I've accomplished this using the for-loop below, but this is a slow solution that works well when the tables are small but I would like to scale up to a table A
which is thousands of rows and a table B
which is nearly a millions of rows. Any ideas of how to achieve this? Thanks in advance!
for循环:
for (i in 1:nrow(A)) {
subset(B, abs(A$location[i] - B$location) <= 100) -> temp
A$n[i] <- nrow(temp)
A$sum[i] <- sum(temp$value)
A$avg[i] <- mean(temp$value)
}
例如:
A
loc
150
250
400
B
位置值
25 7
77 19
170 10
320 15
会变成:
A
本地平均
150 2 29 14.5
250 2 25 12.5
400 1 15 15
推荐答案
类似于Matt Summersgill的答案,您可以进行非等额加入更新 A
:
Similar to Matt Summersgill's answer, you could do a non-equi join to update A
:
A[, up := loc + 100]
A[, dn := loc - 100]
A[, c("n", "s", "m") :=
B[copy(.SD), on=.(loc >= dn, loc <= up), .(.N, sum(value), mean(value)), by=.EACHI][, .(N, V2, V3)]
]
或使用一个链接命令:
A[, up := loc + 100][, dn := loc - 100][, c("n", "s", "m") :=
B[copy(.SD), on=.(loc >= dn, loc <= up),
.(.N, sum(value), mean(value)), by=.EACHI][,
.(N, V2, V3)]
]
我想这应该是相当有效的。
This should be fairly efficient, I guess.
工作原理
在 x [i,j]
,。
。
是指 x
中的数据子集(在这种情况下,全部是 A
)。
Inside j
of x[i, j]
, .SD
refers to the subset of data from x
(in this case it's all of A
).
x [i,on =,j,by = .EACHI]
是一个联接,使用 i
(在这种情况下为 copy(.SD)
== A
)使用<$ c $中的条件查找 x
(在本例中为 B
)的匹配行c> on = 。对于 i
的每一行,都会计算 j
(这是 by = .EACHI
表示)。
x[i, on=, j, by=.EACHI]
is a join, using each row of i
(in this case copy(.SD)
== A
) to look up matching rows of x
(in this case B
) using the conditions in on=
. For each row of i
, j
is calculated (which is what by=.EACHI
means).
当 j
没有名字时,它们会自动分配。 V1
, V2
,依此类推。 .N
默认情况下被命名为 N
。
When j
doesn't have names, they are assigned automatically. V1
, V2
, and so on. .N
by default gets named N
.
这篇关于R:有效计算值子集的摘要,其值由两个变量之间的关系决定的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!