lapply与for循环-性能R [英] lapply vs for loop - Performance R
问题描述
通常说到,人们应该更喜欢lapply
而不是for
循环.
例如Hadley Wickham在他的Advance R书中指出的那样.
It is often said that one should prefer lapply
over for
loops.
There are some exception as for example Hadley Wickham points out in his Advance R book.
( http://adv-r.had.co.nz/Functionals.html )(就地修改,递归等). 以下是这种情况之一.
(http://adv-r.had.co.nz/Functionals.html) (Modifying in place, Recursion etc). The following is one of this case.
仅出于学习目的,我尝试以功能形式重写感知器算法以进行基准测试 相对表现. 来源( https://rpubs.com/FaiHas/197581 ).
Just for sake of learning, I tried to rewrite a perceptron algorithm in a functional form in order to benchmark relative performance. source (https://rpubs.com/FaiHas/197581).
这是代码.
# prepare input
data(iris)
irissubdf <- iris[1:100, c(1, 3, 5)]
names(irissubdf) <- c("sepal", "petal", "species")
head(irissubdf)
irissubdf$y <- 1
irissubdf[irissubdf[, 3] == "setosa", 4] <- -1
x <- irissubdf[, c(1, 2)]
y <- irissubdf[, 4]
# perceptron function with for
perceptron <- function(x, y, eta, niter) {
# initialize weight vector
weight <- rep(0, dim(x)[2] + 1)
errors <- rep(0, niter)
# loop over number of epochs niter
for (jj in 1:niter) {
# loop through training data set
for (ii in 1:length(y)) {
# Predict binary label using Heaviside activation
# function
z <- sum(weight[2:length(weight)] * as.numeric(x[ii,
])) + weight[1]
if (z < 0) {
ypred <- -1
} else {
ypred <- 1
}
# Change weight - the formula doesn't do anything
# if the predicted value is correct
weightdiff <- eta * (y[ii] - ypred) * c(1,
as.numeric(x[ii, ]))
weight <- weight + weightdiff
# Update error function
if ((y[ii] - ypred) != 0) {
errors[jj] <- errors[jj] + 1
}
}
}
# weight to decide between the two species
return(errors)
}
err <- perceptron(x, y, 1, 10)
### my rewriting in functional form auxiliary
### function
faux <- function(x, weight, y, eta) {
err <- 0
z <- sum(weight[2:length(weight)] * as.numeric(x)) +
weight[1]
if (z < 0) {
ypred <- -1
} else {
ypred <- 1
}
# Change weight - the formula doesn't do anything
# if the predicted value is correct
weightdiff <- eta * (y - ypred) * c(1, as.numeric(x))
weight <<- weight + weightdiff
# Update error function
if ((y - ypred) != 0) {
err <- 1
}
err
}
weight <- rep(0, 3)
weightdiff <- rep(0, 3)
f <- function() {
t <- replicate(10, sum(unlist(lapply(seq_along(irissubdf$y),
function(i) {
faux(irissubdf[i, 1:2], weight, irissubdf$y[i],
1)
}))))
weight <<- rep(0, 3)
t
}
由于上述原因,我没想到会有任何持续的改善
问题.但是,当我看到急剧恶化时,我真的很惊讶
使用lapply
和replicate
.
I did not expected any consistent improvement due to the aforementioned
issues. But nevertheless I was really surprised when I saw the sharp worsening
using lapply
and replicate
.
我使用microbenchmark
库中的microbenchmark
函数获得了此结果
I obtained this results using microbenchmark
function from microbenchmark
library
可能是什么原因? 可能是内存泄漏了吗?
What could possibly be the reasons? Could it be some memory leak?
expr min lq mean median uq
f() 48670.878 50600.7200 52767.6871 51746.2530 53541.2440
perceptron(as.matrix(irissubdf[1:2]), irissubdf$y, 1, 10) 4184.131 4437.2990 4686.7506 4532.6655 4751.4795
perceptronC(as.matrix(irissubdf[1:2]), irissubdf$y, 1, 10) 95.793 104.2045 123.7735 116.6065 140.5545
max neval
109715.673 100
6513.684 100
264.858 100
第一个功能是lapply
/replicate
功能
第二个是具有for
循环的函数
The second is the function with for
loops
第三个与C++
中使用Rcpp
根据Roland的资料,此功能的剖析. 我不确定我能否以正确的方式解释它. 在我看来,大部分时间都花在子集上 函数分析
Here According to Roland the profiling of the function. I am not sure I can interpret it in the right way. It looks like to me most of the time is spent in subsetting Function profiling
推荐答案
首先,for
循环比lapply
慢是一个早已被揭穿的神话. R中的for
循环的性能更高,并且目前至少与lapply
一样快.
First of all, it is an already long debunked myth that for
loops are any slower than lapply
. The for
loops in R have been made a lot more performant and are currently at least as fast as lapply
.
也就是说,您必须在这里重新考虑对lapply
的使用.您的实现要求分配给全局环境,因为您的代码要求您在循环期间更新权重.这是不考虑lapply
的正当理由.
That said, you have to rethink your use of lapply
here. Your implementation demands assigning to the global environment, because your code requires you to update the weight during the loop. And that is a valid reason to not consider lapply
.
lapply
是您应该用于产生副作用(或没有副作用)的函数.函数lapply
自动将结果合并到一个列表中,并且与for
循环相反,不会与您的工作环境混淆. replicate
也是如此.另请参阅此问题:
lapply
is a function you should use for its side effects (or lack of side effects). The function lapply
combines the results in a list automatically and doesn't mess with the environment you work in, contrary to a for
loop. The same goes for replicate
. See also this question:
您的lapply
解决方案之所以慢得多,是因为您使用它的方式会产生更多的开销.
The reason your lapply
solution is far slower, is because your way of using it creates a lot more overhead.
-
replicate
在内部仅是sapply
,因此您实际上结合了sapply
和lapply
来实现双循环.sapply
会产生额外的开销,因为它必须测试是否可以简化结果.因此,for
循环实际上比使用replicate
更快. - 在您的
lapply
匿名函数中,您必须为每次观察都访问x和y的数据框.这意味着,与for循环相反,例如,每次必须调用函数$
. - 由于使用了这些高端功能,因此,"lapply"解决方案将调用49个功能,而
for
解决方案将仅调用26个功能.lapply
解决方案的这些额外功能包括对诸如match
等功能的调用,structure
,[[
,names
,%in%
,sys.call
,duplicated
,...for
循环不需要的所有功能,因为它们不执行任何这些检查.
replicate
is nothing else butsapply
internally, so you actually combinesapply
andlapply
to implement your double loop.sapply
creates extra overhead because it has to test whether or not the result can be simplified. So afor
loop will be actually faster than usingreplicate
.- inside your
lapply
anonymous function, you have to access the dataframe for both x and y for every observation. This means that -contrary to in your for-loop- eg the function$
has to be called every time. - Because you use these high-end functions, your 'lapply' solution calls 49 functions, compared to your
for
solution that only calls 26. These extra functions for thelapply
solution include calls to functions likematch
,structure
,[[
,names
,%in%
,sys.call
,duplicated
, ... All functions not needed by yourfor
loop as that one doesn't do any of these checks.
如果要查看这些额外开销的来源,请查看replicate
,unlist
,sapply
和simplify2array
的内部代码.
If you want to see where this extra overhead comes from, look at the internal code of replicate
, unlist
, sapply
and simplify2array
.
您可以使用以下代码更好地了解使用lapply
会在哪里失去性能.逐行运行!
You can use the following code to get a better idea of where you lose your performance with the lapply
. Run this line by line!
Rprof(interval = 0.0001)
f()
Rprof(NULL)
fprof <- summaryRprof()$by.self
Rprof(interval = 0.0001)
perceptron(as.matrix(irissubdf[1:2]), irissubdf$y, 1, 10)
Rprof(NULL)
perprof <- summaryRprof()$by.self
fprof$Fun <- rownames(fprof)
perprof$Fun <- rownames(perprof)
Selftime <- merge(fprof, perprof,
all = TRUE,
by = 'Fun',
suffixes = c(".lapply",".for"))
sum(!is.na(Selftime$self.time.lapply))
sum(!is.na(Selftime$self.time.for))
Selftime[order(Selftime$self.time.lapply, decreasing = TRUE),
c("Fun","self.time.lapply","self.time.for")]
Selftime[is.na(Selftime$self.time.for),]
这篇关于lapply与for循环-性能R的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!