R：有效计算值子集的摘要，其值由两个变量之间的关系决定 [英] R: efficiently computing summaries of value-subsets whose contents are determined by the relation between two variables

查看：74 发布时间：2020/10/15 20:16:43 r for-loop dataframe data.table coding-efficiency

本文介绍了R：有效计算值子集的摘要，其值由两个变量之间的关系决定的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有两个表， A 和 B 。对于表 A 的每一行，我想获取 B $ value 的一些摘要统计信息，其中 B $ location 在 A $ location 的 100 之内。我已经使用下面的for循环完成了此操作，但这是一个缓慢的解决方案，当表很小时，它很好用，但是我想扩展到表 A 它有成千上万行，而表 B 则有近百万行。关于如何实现这一目标的任何想法？提前致谢！

I have two tables, A and B. For each row of table A, I want to get some summary statistics for B$value where the value of B$location is within 100 of A$location. I've accomplished this using the for-loop below, but this is a slow solution that works well when the tables are small but I would like to scale up to a table A which is thousands of rows and a table B which is nearly a millions of rows. Any ideas of how to achieve this? Thanks in advance!

for循环：

for (i in 1:nrow(A)) {    
   subset(B, abs(A$location[i] - B$location) <= 100) -> temp
   A$n[i] <- nrow(temp)
   A$sum[i] <- sum(temp$value)
   A$avg[i] <- mean(temp$value)
}

例如：

A loc 150 250 400

B 位置值 25 7 77 19 170 10 320 15

会变成：

A 本地平均 150 2 29 14.5 250 2 25 12.5 400 1 15 15

推荐答案

类似于Matt Summersgill的答案，您可以进行非等额加入更新 A ：

Similar to Matt Summersgill's answer, you could do a non-equi join to update A:

A[, up := loc + 100]
A[, dn := loc - 100]
A[, c("n", "s", "m") := 
  B[copy(.SD), on=.(loc >= dn, loc <= up), .(.N, sum(value), mean(value)), by=.EACHI][, .(N, V2, V3)]
]

或使用一个链接命令：

A[, up := loc + 100][, dn := loc - 100][, c("n", "s", "m") := 
  B[copy(.SD), on=.(loc >= dn, loc <= up), 
    .(.N, sum(value), mean(value)), by=.EACHI][, 
    .(N, V2, V3)]
]

我想这应该是相当有效的。

This should be fairly efficient, I guess.

工作原理

在 x [i，j] ，。。是指 x 中的数据子集（在这种情况下，全部是 A ）。


Inside j of x[i, j], .SD refers to the subset of data from x (in this case it's all of A).

x [i，on =，j，by = .EACHI] 是一个联接，使用 i （在这种情况下为 copy（.SD） == A ）使用<$ c $中的条件查找 x （在本例中为 B ）的匹配行c> on = 。对于 i 的每一行，都会计算 j （这是 by = .EACHI 表示）。

x[i, on=, j, by=.EACHI] is a join, using each row of i (in this case copy(.SD) == A) to look up matching rows of x (in this case B) using the conditions in on=. For each row of i, j is calculated (which is what by=.EACHI means).

当 j 没有名字时，它们会自动分配。 V1 ， V2 ，依此类推。 .N 默认情况下被命名为 N 。

When j doesn't have names, they are assigned automatically. V1, V2, and so on. .N by default gets named N.

这篇关于R：有效计算值子集的摘要，其值由两个变量之间的关系决定的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

R：有效计算值子集的摘要，其值由两个变量之间的关系决定 [英] R: efficiently computing summaries of value-subsets whose contents are determined by the relation between two variables

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

R：有效计算值子集的摘要，其值由两个变量之间的关系决定 [英] R: efficiently computing summaries of value-subsets whose contents are determined by the relation between two variables

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭