为什么在样本量的情况下,sapply缩放比for loop慢? [英] Why does sapply scale slower than for loop with sample size?

查看:107
本文介绍了为什么在样本量的情况下,sapply缩放比for loop慢?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

因此,假设我想取向量X = 2 * 1:N并将e提升到每个元素的指数. (是的,我认识到做到这一点的最佳方法是简单地通过向量化exp(X),但是这样做的重点是比较for循环和sapply).好吧,我通过使用不同的样本大小并尝试相应的时间来逐步尝试三种方法(一种用于for循环,两种以不同方式应用sapply)进行了测试.然后为每种方法绘制样本大小N与时间t的关系.

So let's say I want to take the vector X = 2*1:N and raise e to the exponent of each element. (Yes, I recognize the best way to do that is simply by vectorization exp(X), but the point of this is to compare for loop with sapply). Well I tested by incrementally trying three methods (one with for loops, two with sapply applied in a different manner) with different sample sizes and measuring the corresponding time. I then plot the sample size N vs time t for each method.

每种方法均以"#####"表示.

Each method is indicated by "#####".

k <- 20 
t1 <- rep(0,k) 
t2 <- rep(0,k)
t3 <- rep(0,k)
L <- round(10^seq(4,7,length=k))


for (i in 1:k) {
  X <- 2*1:L[i]
  Y1 <- rep(0,L[i])
  t <- system.time(for (j in 1:L[i]) Y1[j] <- exp(X[j]))[3] #####
  t1[i] <- t
}

for (i in 1:k) {
  X <- 2*1:L[i]
  t <- system.time( Y2 <- sapply(1:L[i], function(q) exp(X[q])) )[3] #####
  t2[i] <- t
}

for (i in 1:k) {
  X <- 2*1:L[i]
  t <- system.time( Y3 <- sapply(X, function(x) exp(x)) )[3] #####
  t3[i] <- t
}

plot(L, t3, type='l', col='green')
lines(L, t2,col='red')
lines(L, t1,col='blue')

plot(log(L), log(t1), type='l', col='blue')
lines(log(L), log(t2),col='red')
lines(log(L), log(t3), col='green')

我们得到以下结果. N vs t的图:

We get the following results. Plot of N vs t:

对数(N)与对数(t)的关系图

Plot of log(N) vs log(t)

蓝色图是for循环方法,红色和绿色图是sapply方法.在正则图中,您可以看到,随着样本数量的增加,for循环方法比sapply方法更受青睐,这根本不是我所期望的.如果查看对数-对数图(为了更容易地区分较小的N个结果),我们会看到sapply的预期结果比小N的for循环更有效.

The blue plot is the for loop method, and the red and green plots are the sapply methods. In the regular plot, you can see that, as sample size gets larger, the for loop method is heavily favoured over the sapply methods, which is not what I would have expected at all. If you look at the log-log plot (in order to more easily distinguish the smaller N results) we see the expected result of sapply being more efficient than for loop for small N.

有人知道为什么sapply缩放的比例要比带有样本大小的forloop慢吗?谢谢.

Does anybody know why sapply scales more slowly than for loop with sample size? Thanks.

推荐答案

您无需考虑为所得向量Y1分配空间所花费的时间.随着样本量的增加,分配Y1所花费的时间在执行时间中所占的比例较大,而执行替换所花费的时间所占的比例也较小.

You're not accounting for the time it takes to allocate space for the resulting vector Y1. As the sample size increases, the time it takes to allocate Y1 becomes a larger share of the execution time, and the time it takes to do the replacement becomes a smaller share.

sapply始终为结果分配内存,所以这是由于样本量增加而效率降低的原因之一.也 gagolews 关于sapply调用simplify2array有一个很好的观点. (可能)添加另一个副本.

sapply always allocates memory for the the result, so that's one reason it would be less efficient as sample size increases. gagolews also has a very good point about sapply calling simplify2array. That (likely) adds another copy.

经过更多测试后,随着对象变大,lapply看起来仍然与包含for循环的字节编译函数大约相同或更慢.除了do_lapply中的这一行之外,我不确定如何解释这一点:

After some more testing, it looks like lapply is still about the same or slower than a byte-compiled function containing a for loop, as the objects get larger. I'm not sure how to explain this, other than possibly this line in do_lapply:

if (MAYBE_REFERENCED(tmp)) tmp = lazy_duplicate(tmp);

或者可能与lapply如何构造函数调用有关……但是我主要是在推测.

Or possibly something with how lapply constructs the function call... but I'm mostly speculating.

这是我用来测试的代码:

Here's the code I used to test:

k <- 20 
t1 <- rep(0,k) 
t2 <- rep(0,k)
t3 <- rep(0,k)
L <- round(10^seq(4,7,length=k))
L <- round(10^seq(4,6,length=k))

# put the loop in a function
fun <- function(X, L) {
  Y1 <- rep(0,L)
  for (j in 1:L)
    Y1[j] <- exp(X[j])
  Y1
}
# for loops often benefit from compiling
library(compiler)
cfun <- cmpfun(fun)

for (i in 1:k) {
  X <- 2*1:L[i]
  t1[i] <- system.time( Y1 <- fun(X, L[i]) )[3]
}
for (i in 1:k) {
  X <- 2*1:L[i]
  t2[i] <- system.time( Y2 <- cfun(X, L[i]) )[3]
}
for (i in 1:k) {
  X <- 2*1:L[i]
  t3[i] <- system.time( Y3 <- lapply(X, exp) )[3]
}
identical(Y1, Y2)          # TRUE
identical(Y1, unlist(Y3))  # TRUE
plot(L, t1, type='l', col='blue', log="xy", ylim=range(t1,t2,t3))
lines(L, t2, col='red')
lines(L, t3, col='green')

这篇关于为什么在样本量的情况下,sapply缩放比for loop慢?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆