用于通过索引对矢量进行分区并对该分区执行操作的习惯R代码 [英] Idiomatic R code for partitioning a vector by an index and performing an operation on that partition

查看:90
本文介绍了用于通过索引对矢量进行分区并对该分区执行操作的习惯R代码的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图在R中找到惯用的方式来通过某个索引向量对数值向量进行分区,找到该分区中所有数字的总和,然后将每个条目除以该分区总和。换句话说,如果我从这开始:

  df < -  data.frame(x = c(1,2, 3,4,5,6),index = c('a','a','b','b','c','c'))

我希望输出创建一个向量(让我们称它为z):

 (1 + 2),2 /(1 + 2),3 /(3 + 4),3 /(3 + 4),5 /(5 + 6),6 / (5 + 6))

如果我这样做是SQL并且可以使用窗口函数,这样做:

  select 
x / sum(x)over(partition by index)as z
from df

如果我使用plyr,我会这样做:

  ddply(df,。(index),transform,z = x / sum(x))

但我想知道如何使用标准的R函数编程工具,如mapply / aggregate等。

解决方案

另一个选项是 ave 。为了更好的衡量,我已经收集了上面的答案,尽我所能使它们的输出等效(一个向量),并使用您的示例数据作为输入提供超过1000次运行的计时。首先,我使用 ave ave(df $ x,df $ index,FUN = function(z)z / sum(z))的答案。我还使用 data.table 包显示了一个示例,因为它通常很快,但我知道您正在寻找基本解决方案,所以如果您愿意,可以忽略它。



现在有一堆计时:

  library(data。 table)
library(plyr)
dt
plyr < - function()ddply(df,。(index),transform, z = x / sum(x))
av < - function()ave(df $ x,df $ index,FUN = function(z)z / sum(z))
t.apply < - function()unlist(tapply(df $ x,df $ index,function(x)x / sum(x)))
l.apply < - function()unlist(lapply(split(df ()函数(x){x / sum(x)}))
由< - function()unlist(by(df $ x,df $ index,function(x){ x / sum(x)}))
agg < - function()aggregate(df $ x,list(df $ index),function(x){x / sum(x)})
dt < - 函数()dt [,x / sum(x),by = index]

library(rbenchmark)
基准(plyr(),av(),t.apply (),l.apply(),by(),agg(),dt(),
replications = 1000,
列= c(test,elapsed,relative),
order =elapsed)
#-----

测试已过相对
4 l.apply()0.052 1.000000
2 av()0.168 3.230769
3 t.apply()0.257 4.942308
5()0.694 13.346154
6 agg()1.020 19.615385
7 dt()2.380 45.769231
1 plyr()5.119 98.442308

在这种情况下, lapply()解决方案似乎胜出,而 data.table()出人意料地很慢。让我们看看这是如何扩展到一个更大的聚合问题的:

  df < -  data.frame(x = sample(1:100 ,1e5,TRUE),index = gl(1000,100))
dt < - data.table(df)

#为简洁起见省略了复制代码,使用了100个复制和丢弃的plyr ()因为我知道它
#会比较慢:
测试相对经过
6 dt()2.052 1.000000
1 av()2.401 1.170078
3 l .apply()4.660 2.270955
2 t.apply()9.500 4.629630
4 by()16.329 7.957602
5 agg()20.541 10.010234

,这似乎与我预期的一致。

总之,有很多不错的选择。找到一个或两个方法,与您的思维模型一起工作,如何聚合任务应该如何工作并掌握该功能。许多方法来剥皮猫。



编辑 - 以及1e7行的示例



可能不够大Matt,但与我的笔记本电脑一样大,不会崩溃:

  df < -  data.frame(x = sample(1 :100,1e7,TRUE),index = gl(10000,1000))
dt < - data.table(df)
#-----
测试经过的相对值
6 dt()0.61 1.000000
1 av()1.45 2.377049
3 l.apply()4.61 7.557377
2 t.apply()8.80 14.426230
4() 8.92 14.622951
5 agg()18.20 29.83606


I'm trying to find the idiomatic way in R to partition a numerical vector by some index vector, find the sum of all numbers in that partition and then divide each individual entry by that partition sum. In other words, if I start with this:

df <- data.frame(x = c(1,2,3,4,5,6), index = c('a', 'a', 'b', 'b', 'c', 'c'))

I want the output to create a vector (let's call it z):

c(1/(1+2), 2/(1+2), 3/(3+4), 3/(3+4), 5/(5+6), 6/(5+6))  

If I were doing this is SQL and could use window functions, I would do this:

select 
 x / sum(x) over (partition by index) as z 
from df

and if I were using plyr, I would do something like this:

ddply(df, .(index), transform, z = x / sum(x))

but I'd like to know how to do it using the standard R functional programming tools like mapply/aggregate etc.

解决方案

Yet another option is ave. For good measure, I've collected the answers above, tried my best to make their output equivalent (a vector), and provided timings over 1000 runs using your example data as an input. First, my answer using ave: ave(df$x, df$index, FUN = function(z) z/sum(z)). I also show an example using data.table package since it is usually pretty quick, but I know you're looking for base solutions, so you can ignore that if you want.

And now a bunch of timings:

library(data.table)
library(plyr)
dt <- data.table(df)

plyr <- function() ddply(df, .(index), transform, z = x / sum(x))
av <- function() ave(df$x, df$index, FUN = function(z) z/sum(z))
t.apply <- function() unlist(tapply(df$x, df$index, function(x) x/sum(x)))
l.apply <- function() unlist(lapply(split(df$x, df$index), function(x){x/sum(x)}))
b.y <- function() unlist(by(df$x, df$index, function(x){x/sum(x)}))
agg <- function() aggregate(df$x, list(df$index), function(x){x/sum(x)})
d.t <- function() dt[, x/sum(x), by = index]

library(rbenchmark)
benchmark(plyr(), av(), t.apply(), l.apply(), b.y(), agg(), d.t(), 
           replications = 1000, 
           columns = c("test", "elapsed", "relative"),
           order = "elapsed")
#-----

       test elapsed  relative
4 l.apply()   0.052  1.000000
2      av()   0.168  3.230769
3 t.apply()   0.257  4.942308
5     b.y()   0.694 13.346154
6     agg()   1.020 19.615385
7     d.t()   2.380 45.769231
1    plyr()   5.119 98.442308

the lapply() solution seems to win in this case and data.table() is surprisingly slow. Let's see how this scales to a bigger aggregation problem:

df <- data.frame(x = sample(1:100, 1e5, TRUE), index = gl(1000, 100))
dt <- data.table(df)

#Replication code omitted for brevity, used 100 replications and dropped plyr() since I know it 
#will be slow by comparison:
       test elapsed  relative
6     d.t()   2.052  1.000000
1      av()   2.401  1.170078
3 l.apply()   4.660  2.270955
2 t.apply()   9.500  4.629630
4     b.y()  16.329  7.957602
5     agg()  20.541 10.010234

that seems more consistent with what I'd expect.

In summary, you've got plenty of good options. Find one or two methods that work with your mental model of how aggregation tasks should work and master that function. Many ways to skin a cat.

Edit - and an example with 1e7 rows

Probably not large enough for Matt, but as big as my laptop can handle without crashing:

df <- data.frame(x = sample(1:100, 1e7, TRUE), index = gl(10000, 1000))
dt <- data.table(df)
#-----
       test elapsed  relative
6     d.t()    0.61  1.000000
1      av()    1.45  2.377049
3 l.apply()    4.61  7.557377
2 t.apply()    8.80 14.426230
4     b.y()    8.92 14.622951
5     agg()   18.20 29.83606

这篇关于用于通过索引对矢量进行分区并对该分区执行操作的习惯R代码的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆