R:加快“分组依据"运作 [英] R: speeding up "group by" operations

查看:60
本文介绍了R:加快“分组依据"运作的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个模拟,该模拟具有巨大的集合,并在中间合并了一个步骤.我使用plyr的ddply()函数对该过程进行了原型设计,该函数非常适合我的大部分需求.但是我需要更快的聚合步骤,因为我必须运行10K模拟.我已经在并行地扩展仿真,但是如果这一步骤更快,则可以大大减少所需的节点数量.

I have a simulation that has a huge aggregate and combine step right in the middle. I prototyped this process using plyr's ddply() function which works great for a huge percentage of my needs. But I need this aggregation step to be faster since I have to run 10K simulations. I'm already scaling the simulations in parallel but if this one step were faster I could greatly decrease the number of nodes I need.

这是我要做的事情的合理简化:

Here's a reasonable simplification of what I am trying to do:

library(Hmisc)

# Set up some example data
year <-    sample(1970:2008, 1e6, rep=T)
state <-   sample(1:50, 1e6, rep=T)
group1 <-  sample(1:6, 1e6, rep=T)
group2 <-  sample(1:3, 1e6, rep=T)
myFact <-  rnorm(100, 15, 1e6)
weights <- rnorm(1e6)
myDF <- data.frame(year, state, group1, group2, myFact, weights)

# this is the step I want to make faster
system.time(aggregateDF <- ddply(myDF, c("year", "state", "group1", "group2"),
                     function(df) wtd.mean(df$myFact, weights=df$weights)
                                 )
           )

所有提示或建议,不胜感激!

All tips or suggestions are appreciated!

推荐答案

您可以使用不可变的数据帧,而不是普通的R数据帧,该数据帧在您进行子集设置时会返回指向原始对象的指针,并且速度会更快:

Instead of the normal R data frame, you can use a immutable data frame which returns pointers to the original when you subset and can be much faster:

idf <- idata.frame(myDF)
system.time(aggregateDF <- ddply(idf, c("year", "state", "group1", "group2"),
   function(df) wtd.mean(df$myFact, weights=df$weights)))

#    user  system elapsed 
# 18.032   0.416  19.250 

如果我要编写一个完全针对这种情况定制的plyr函数,我将执行以下操作:

If I was to write a plyr function customised exactly to this situation, I'd do something like this:

system.time({
  ids <- id(myDF[c("year", "state", "group1", "group2")], drop = TRUE)
  data <- as.matrix(myDF[c("myFact", "weights")])
  indices <- plyr:::split_indices(seq_len(nrow(data)), ids, n = attr(ids, "n"))

  fun <- function(rows) {
    weighted.mean(data[rows, 1], data[rows, 2])
  }
  values <- vapply(indices, fun, numeric(1))

  labels <- myDF[match(seq_len(attr(ids, "n")), ids), 
    c("year", "state", "group1", "group2")]
  aggregateDF <- cbind(labels, values)
})

# user  system elapsed 
# 2.04    0.29    2.33 

之所以快得多,是因为它避免了复制数据,只在计算时提取每次计算所需的子集.将数据切换为矩阵形式可进一步提高速度,因为矩阵子集比数据帧子集要快得多.

It's so much faster because it avoids copying the data, only extracting the subset needed for each computation when it's computed. Switching the data to matrix form gives another speed boost because matrix subsetting is much faster than data frame subsetting.

这篇关于R:加快“分组依据"运作的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆