如何加快汇总和ddply? [英] How to speed up summarise and ddply?
问题描述
我有一个具有200万行和15列的数据框.我想用ddply将这些列中的3个分组(所有3个都是因子,并且这些因子有780,000个唯一组合),并获得3列的加权平均值(权重由我的数据集定义).以下是相当快的内容:
I have a data frame with 2 million rows, and 15 columns. I want to group by 3 of these columns with ddply (all 3 are factors, and there are 780,000 unique combinations of these factors), and get the weighted mean of 3 columns (with weights defined by my data set). The following is reasonably quick:
system.time(a2 <- aggregate(cbind(col1,col2,col3) ~ fac1 + fac2 + fac3, data=aggdf, FUN=mean))
user system elapsed
91.358 4.747 115.727
问题是我想使用weighted.mean而不是mean来计算我的聚合列.
The problem is that I want to use weighted.mean instead of mean to calculate my aggregate columns.
如果我在同一数据帧上尝试以下ddply(请注意,我强制转换为不可变),则20分钟后以下操作将无法完成:
If I try the following ddply on the same data frame (note, I cast to immutable), the following does not finish after 20 minutes:
x <- ddply(idata.frame(aggdf),
c("fac1","fac2","fac3"),
summarise,
w=sum(w),
col1=weighted.mean(col1, w),
col2=weighted.mean(col2, w),
col3=weighted.mean(col3, w))
此操作似乎占用大量CPU,但不是占用大量RAM.
This operation seems to be CPU hungry, but not very RAM-intensive.
因此,我最终编写了这个小函数,该函数通过利用加权均值的某些属性来作弊",并对整个对象(而不是切片)进行乘法和除法.
So I ended up writing this little function, which "cheats" a bit by taking advantage of some properties of weighted mean and does a multiplication and a division on the whole object, rather than on the slices.
weighted_mean_cols <- function(df, bycols, aggcols, weightcol) {
df[,aggcols] <- df[,aggcols]*df[,weightcol]
df <- aggregate(df[,c(weightcol, aggcols)], by=as.list(df[,bycols]), sum)
df[,aggcols] <- df[,aggcols]/df[,weightcol]
df
}
当我以以下身份运行时:
When I run as:
a2 <- weighted_mean_cols(aggdf, c("fac1","fac2","fac3"), c("col1","col2","col3"),"w")
我获得了良好的性能,并且具有可重复使用的优雅代码.
I get good performance, and somewhat reusable, elegant code.
推荐答案
如果要使用编辑,为什么不使用rowsum
并节省几分钟的执行时间?
If you're going to use your edit, why not use rowsum
and save yourself a few minutes of execution time?
nr <- 2e6
nc <- 3
aggdf <- data.frame(matrix(rnorm(nr*nc),nr,nc),
matrix(sample(100,nr*nc,TRUE),nr,nc), rnorm(nr))
colnames(aggdf) <- c("col1","col2","col3","fac1","fac2","fac3","w")
system.time({
aggsums <- rowsum(data.frame(aggdf[,c("col1","col2","col3")]*aggdf$w,w=aggdf$w),
interaction(aggdf[,c("fac1","fac2","fac3")]))
agg_wtd_mean <- aggsums[,1:3]/aggsums[,4]
})
# user system elapsed
# 16.21 0.77 16.99
这篇关于如何加快汇总和ddply?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!