R:有效计算值子集的摘要,其值由两个变量之间的关系决定 [英] R: efficiently computing summaries of value-subsets whose contents are determined by the relation between two variables

查看:74
本文介绍了R:有效计算值子集的摘要,其值由两个变量之间的关系决定的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有两个表, A B 。对于表 A 的每一行,我想获取 B $ value 的一些摘要统计信息,其中 B $ location A $ location 100 之内。我已经使用下面的for循环完成了此操作,但这是一个缓慢的解决方案,当表很小时,它很好用,但是我想扩展到表 A 它有成千上万行,而表 B 则有近百万行。关于如何实现这一目标的任何想法?提前致谢!

I have two tables, A and B. For each row of table A, I want to get some summary statistics for B$value where the value of B$location is within 100 of A$location. I've accomplished this using the for-loop below, but this is a slow solution that works well when the tables are small but I would like to scale up to a table A which is thousands of rows and a table B which is nearly a millions of rows. Any ideas of how to achieve this? Thanks in advance!

for循环:

for (i in 1:nrow(A)) {    
   subset(B, abs(A$location[i] - B$location) <= 100) -> temp
   A$n[i] <- nrow(temp)
   A$sum[i] <- sum(temp$value)
   A$avg[i] <- mean(temp$value)
}    

例如:

A
loc
150
250
400


B
位置值
25 7
77 19
170 10
320 15

会变成:

A
本地平均
150 2 29 14.5
250 2 25 12.5
400 1 15 15

推荐答案

类似于Matt Summersgill的答案,您可以进行非等额加入更新 A

Similar to Matt Summersgill's answer, you could do a non-equi join to update A:

A[, up := loc + 100]
A[, dn := loc - 100]
A[, c("n", "s", "m") := 
  B[copy(.SD), on=.(loc >= dn, loc <= up), .(.N, sum(value), mean(value)), by=.EACHI][, .(N, V2, V3)]
]

或使用一个链接命令:

A[, up := loc + 100][, dn := loc - 100][, c("n", "s", "m") := 
  B[copy(.SD), on=.(loc >= dn, loc <= up), 
    .(.N, sum(value), mean(value)), by=.EACHI][, 
    .(N, V2, V3)]
]

我想这应该是相当有效的。

This should be fairly efficient, I guess.

工作原理

x [i,j] 是指 x 中的数据子集(在这种情况下,全部是 A )。

Inside j of x[i, j], .SD refers to the subset of data from x (in this case it's all of A).

x [i,on =,j,by = .EACHI] 是一个联接,使用 i (在这种情况下为 copy(.SD) == A )使用<$ c $中的条件查找 x (在本例中为 B )的匹配行c> on = 。对于 i 的每一行,都会计算 j (这是 by = .EACHI 表示)。

x[i, on=, j, by=.EACHI] is a join, using each row of i (in this case copy(.SD) == A) to look up matching rows of x (in this case B) using the conditions in on=. For each row of i, j is calculated (which is what by=.EACHI means).

j 没有名字时,它们会自动分配。 V1 V2 ,依此类推。 .N 默认情况下被命名为 N

When j doesn't have names, they are assigned automatically. V1, V2, and so on. .N by default gets named N.

这篇关于R:有效计算值子集的摘要,其值由两个变量之间的关系决定的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆