聚合效率比循环效率低吗? [英] aggregate less efficient than loops?

查看:105
本文介绍了聚合效率比循环效率低吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图在一个大表上执行此操作,以计算数据表X中a和b的不同组合的行.

I was trying to do this operation on a big table, to count rows with different combinations of a and b in a data.table X.

Y <- aggregate(c ~ a+b,X,length)

虽然 RAM 使用量仍然存在,但它需要很长时间(我在 30 分钟后停止了).

And it was taking forever (I stopped after 30 min) though RAM usage was still.

然后,我尝试手动遍历 b 的值,并仅在 a 上进行汇总(技术上仍在 b 上进行汇总,但使用单个值每次 b ):

Then I tried to loop manually through values of b and aggregate only on a (technically still aggregating on b but with a single value of b every time) :

sub_agg <- list()
unique_bs <- unique(X$b)
for (b_it in unique_bs){
sub_agg[[length(sub_agg)+1]] <- aggregate(c ~ a + b,subset(X, b == b_it),length)
}
Y <- do.call(rbind, sub_agg )

我完成了3分钟.

我还可以走得更远,完全摆脱聚合,只对子集进行操作.

I may as well go further and get rid of aggregate completely and only do operations on subsets.

与子集上的嵌套循环和操作相比,聚合的效率较低吗?或者这是特殊情况吗?

Is aggregate less efficient than nested loops and operations on subsets or is this a special case ?

聚集通常是花费时间最多的部分,所以我现在考虑始终尝试循环,我想更好地了解这里发生的情况.

Aggregations are often the parts of codes that take the most time, so I'm now thinking of always trying loops instead, I'd like to understand better what's happening here.

其他信息:

X有2000万行

X has 20 million rows

b的50个不同值

15000个不同的值

15 000 distinct values for a

推荐答案

是的,聚合比您在此处使用的循环效率低,因为:

Yes, aggregate is less efficient than the loops you use there, because:

  • 当数据点数量增加时,聚合变得不成比例地变慢.您的第二个解决方案在小子集上使用 aggregate .其中一个原因是 aggregate 取决于排序,并且排序不是在O(n)时间内完成的.
  • aggregate还内部使用了 expand.grid ,它创建了一个数据框,其中包含变量a和b中所有唯一值的所有可能组合.您可以在 aggregate.data.frame 的内部代码中看到这一点.而且,随着观察次数的增加,此功能会变得不成比例地变慢.
  • 我的最后一点实际上没有任何意义,因为您确实将所有内容组合在一个数据框中.
  • aggregate becomes disproportionally slower when the number of data points increases. Your second solution uses aggregate on small subsets. One of the reasons is that aggregate depends on sorting, and sorting is not done in O(n) time.
  • aggregate also uses expand.grid internally, which creates a data frame with all possible combinations of all unique values in the variables a and b. You can see this in the internal code of aggregate.data.frame. Also this function becomes disproportionally slower with rising number of observations.
  • edit: my last point didn't really make sense as you do combine everything in a data frame.

也就是说,这里绝对没有理由使用 aggregate .我通过简单地使用 table 进入数据框 Y :

That said, there is absolutely no reason to use aggregate here. I come to the data frame Y by simply using table :

thecounts <- with(X, table(a,b))
Y <- as.data.frame(thecounts)

此解决方案比使用 aggregate 提出的解决方案要快得多.准确地说,我的机器上有68次...

This solution is a whole lot faster than the solution you came up with using aggregate. 68 times on my machine to be precise...

基准:

        test replications elapsed relative 
1  aggloop()            1   15.03   68.318 
2 tableway()            1    0.22    1.000 

用于基准测试的代码(请注意,我将所有内容都做了一些缩小,以免阻塞我的R时间太长):

code for benchmarking (note I made everything a bit smaller to not block my R for too long):

nrows <- 20e5

X <- data.frame(
  a = factor(sample(seq_len(15e2), nrows, replace = TRUE)),
  b = factor(sample(seq_len(50), nrows, replace = TRUE)),
  c = 1
)

aggloop <- function(){
sub_agg <- list()
unique_bs <- unique(X$b)
for (b_it in unique_bs){
  sub_agg[[length(sub_agg)+1]] <- aggregate(c ~ a + b,subset(X, b == b_it),length)
}
Y <- do.call(rbind, sub_agg )
}

tableway <- function(){
  thecounts <- with(X, table(a,b))
  Y <- as.data.frame(thecounts)
}

library(rbenchmark)

benchmark(aggloop(),
          tableway(),
          replications = 1
          )

这篇关于聚合效率比循环效率低吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆