用于按索引对向量进行分区并对该分区执行操作的惯用 R 代码 [英] Idiomatic R code for partitioning a vector by an index and performing an operation on that partition

查看:17
本文介绍了用于按索引对向量进行分区并对该分区执行操作的惯用 R 代码的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图在 R 中找到惯用的方法,通过某个索引向量对数字向量进行分区,找到该分区中所有数字的总和,然后将每个单独的条目除以该分区总和.换句话说,如果我从这个开始:

I'm trying to find the idiomatic way in R to partition a numerical vector by some index vector, find the sum of all numbers in that partition and then divide each individual entry by that partition sum. In other words, if I start with this:

df <- data.frame(x = c(1,2,3,4,5,6), index = c('a', 'a', 'b', 'b', 'c', 'c'))

我希望输出创建一个向量(我们称之为 z):

I want the output to create a vector (let's call it z):

c(1/(1+2), 2/(1+2), 3/(3+4), 3/(3+4), 5/(5+6), 6/(5+6))  

如果我这样做是 SQL 并且可以使用窗口函数,我会这样做:

If I were doing this is SQL and could use window functions, I would do this:

select 
 x / sum(x) over (partition by index) as z 
from df

如果我使用 plyr,我会做这样的事情:

and if I were using plyr, I would do something like this:

ddply(df, .(index), transform, z = x / sum(x))

但我想知道如何使用标准的 R 函数式编程工具(如 mapply/aggregate 等)来做到这一点.

but I'd like to know how to do it using the standard R functional programming tools like mapply/aggregate etc.

推荐答案

另一种选择是 ave.为了更好地衡量,我收集了上面的答案,尽我最大的努力使它们的输出等效(向量),并使用示例数据作为输入提供超过 1000 次运行的计时.首先,我使用 ave 回答:ave(df$x, df$index, FUN = function(z) z/sum(z)).我还展示了一个使用 data.table 包的示例,因为它通常很快,但我知道您正在寻找基本解决方案,因此您可以根据需要忽略它.

Yet another option is ave. For good measure, I've collected the answers above, tried my best to make their output equivalent (a vector), and provided timings over 1000 runs using your example data as an input. First, my answer using ave: ave(df$x, df$index, FUN = function(z) z/sum(z)). I also show an example using data.table package since it is usually pretty quick, but I know you're looking for base solutions, so you can ignore that if you want.

现在是一堆时间:

library(data.table)
library(plyr)
dt <- data.table(df)

plyr <- function() ddply(df, .(index), transform, z = x / sum(x))
av <- function() ave(df$x, df$index, FUN = function(z) z/sum(z))
t.apply <- function() unlist(tapply(df$x, df$index, function(x) x/sum(x)))
l.apply <- function() unlist(lapply(split(df$x, df$index), function(x){x/sum(x)}))
b.y <- function() unlist(by(df$x, df$index, function(x){x/sum(x)}))
agg <- function() aggregate(df$x, list(df$index), function(x){x/sum(x)})
d.t <- function() dt[, x/sum(x), by = index]

library(rbenchmark)
benchmark(plyr(), av(), t.apply(), l.apply(), b.y(), agg(), d.t(), 
           replications = 1000, 
           columns = c("test", "elapsed", "relative"),
           order = "elapsed")
#-----

       test elapsed  relative
4 l.apply()   0.052  1.000000
2      av()   0.168  3.230769
3 t.apply()   0.257  4.942308
5     b.y()   0.694 13.346154
6     agg()   1.020 19.615385
7     d.t()   2.380 45.769231
1    plyr()   5.119 98.442308

lapply() 解决方案似乎在这种情况下胜出,而 data.table() 的速度出奇的慢.让我们看看这如何扩展到更大的聚合问题:

the lapply() solution seems to win in this case and data.table() is surprisingly slow. Let's see how this scales to a bigger aggregation problem:

df <- data.frame(x = sample(1:100, 1e5, TRUE), index = gl(1000, 100))
dt <- data.table(df)

#Replication code omitted for brevity, used 100 replications and dropped plyr() since I know it 
#will be slow by comparison:
       test elapsed  relative
6     d.t()   2.052  1.000000
1      av()   2.401  1.170078
3 l.apply()   4.660  2.270955
2 t.apply()   9.500  4.629630
4     b.y()  16.329  7.957602
5     agg()  20.541 10.010234

这似乎更符合我的预期.

that seems more consistent with what I'd expect.

总而言之,您有很多不错的选择.找到一种或两种适合您的关于聚合任务应该如何工作的心智模型的方法,并掌握该功能.给猫剥皮的方法很多.

In summary, you've got plenty of good options. Find one or two methods that work with your mental model of how aggregation tasks should work and master that function. Many ways to skin a cat.

对于 Matt 来说可能不够大,但与我的笔记本电脑可以处理的大小一样大而不会崩溃:

Probably not large enough for Matt, but as big as my laptop can handle without crashing:

df <- data.frame(x = sample(1:100, 1e7, TRUE), index = gl(10000, 1000))
dt <- data.table(df)
#-----
       test elapsed  relative
6     d.t()    0.61  1.000000
1      av()    1.45  2.377049
3 l.apply()    4.61  7.557377
2 t.apply()    8.80 14.426230
4     b.y()    8.92 14.622951
5     agg()   18.20 29.83606

这篇关于用于按索引对向量进行分区并对该分区执行操作的惯用 R 代码的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆