lapply 与 for 循环 - 性能 R [英] lapply vs for loop - Performance R
问题描述
人们常说应该更喜欢 lapply
而不是 for
循环.有一些例外,例如 Hadley Wickham 在他的 Advance R 书中指出的.
It is often said that one should prefer lapply
over for
loops.
There are some exception as for example Hadley Wickham points out in his Advance R book.
(http://adv-r.had.co.nz/Functionals.html)(就地修改、递归等).以下是这种情况之一.
(http://adv-r.had.co.nz/Functionals.html) (Modifying in place, Recursion etc). The following is one of this case.
为了学习,我尝试以函数形式重写感知器算法以进行基准测试相对表现.来源(https://rpubs.com/FaiHas/197581).
Just for sake of learning, I tried to rewrite a perceptron algorithm in a functional form in order to benchmark relative performance. source (https://rpubs.com/FaiHas/197581).
这是代码.
# prepare input
data(iris)
irissubdf <- iris[1:100, c(1, 3, 5)]
names(irissubdf) <- c("sepal", "petal", "species")
head(irissubdf)
irissubdf$y <- 1
irissubdf[irissubdf[, 3] == "setosa", 4] <- -1
x <- irissubdf[, c(1, 2)]
y <- irissubdf[, 4]
# perceptron function with for
perceptron <- function(x, y, eta, niter) {
# initialize weight vector
weight <- rep(0, dim(x)[2] + 1)
errors <- rep(0, niter)
# loop over number of epochs niter
for (jj in 1:niter) {
# loop through training data set
for (ii in 1:length(y)) {
# Predict binary label using Heaviside activation
# function
z <- sum(weight[2:length(weight)] * as.numeric(x[ii,
])) + weight[1]
if (z < 0) {
ypred <- -1
} else {
ypred <- 1
}
# Change weight - the formula doesn't do anything
# if the predicted value is correct
weightdiff <- eta * (y[ii] - ypred) * c(1,
as.numeric(x[ii, ]))
weight <- weight + weightdiff
# Update error function
if ((y[ii] - ypred) != 0) {
errors[jj] <- errors[jj] + 1
}
}
}
# weight to decide between the two species
return(errors)
}
err <- perceptron(x, y, 1, 10)
### my rewriting in functional form auxiliary
### function
faux <- function(x, weight, y, eta) {
err <- 0
z <- sum(weight[2:length(weight)] * as.numeric(x)) +
weight[1]
if (z < 0) {
ypred <- -1
} else {
ypred <- 1
}
# Change weight - the formula doesn't do anything
# if the predicted value is correct
weightdiff <- eta * (y - ypred) * c(1, as.numeric(x))
weight <<- weight + weightdiff
# Update error function
if ((y - ypred) != 0) {
err <- 1
}
err
}
weight <- rep(0, 3)
weightdiff <- rep(0, 3)
f <- function() {
t <- replicate(10, sum(unlist(lapply(seq_along(irissubdf$y),
function(i) {
faux(irissubdf[i, 1:2], weight, irissubdf$y[i],
1)
}))))
weight <<- rep(0, 3)
t
}
由于上述原因,我没有预料到任何持续的改进问题.但是当我看到急剧恶化的时候我真的很惊讶使用 lapply
和 replicate
.
I did not expected any consistent improvement due to the aforementioned
issues. But nevertheless I was really surprised when I saw the sharp worsening
using lapply
and replicate
.
我使用 microbenchmark
库中的 microbenchmark
函数获得了这个结果
I obtained this results using microbenchmark
function from microbenchmark
library
可能是什么原因?会不会是内存泄漏?
What could possibly be the reasons? Could it be some memory leak?
expr min lq mean median uq
f() 48670.878 50600.7200 52767.6871 51746.2530 53541.2440
perceptron(as.matrix(irissubdf[1:2]), irissubdf$y, 1, 10) 4184.131 4437.2990 4686.7506 4532.6655 4751.4795
perceptronC(as.matrix(irissubdf[1:2]), irissubdf$y, 1, 10) 95.793 104.2045 123.7735 116.6065 140.5545
max neval
109715.673 100
6513.684 100
264.858 100
第一个函数是lapply
/replicate
函数
第二个是带有for
循环的函数
The second is the function with for
loops
第三个是C++
中使用Rcpp
这里根据 Roland 的功能剖析.我不确定我能否以正确的方式解释它.在我看来大部分时间都花在子集化上功能分析
Here According to Roland the profiling of the function. I am not sure I can interpret it in the right way. It looks like to me most of the time is spent in subsetting Function profiling
推荐答案
首先,for
循环比 lapply
慢,这是一个早已被揭穿的神话.R 中的 for
循环的性能提高了很多,目前至少与 lapply
一样快.
First of all, it is an already long debunked myth that for
loops are any slower than lapply
. The for
loops in R have been made a lot more performant and are currently at least as fast as lapply
.
也就是说,您必须在这里重新考虑您对 lapply
的使用.您的实现需要分配给全局环境,因为您的代码要求您在循环期间更新权重.这是不考虑 lapply
的正当理由.
That said, you have to rethink your use of lapply
here. Your implementation demands assigning to the global environment, because your code requires you to update the weight during the loop. And that is a valid reason to not consider lapply
.
lapply
是一个你应该使用的函数,因为它的副作用(或没有副作用).函数 lapply
自动将结果组合到一个列表中,并且不会干扰您工作的环境,这与 for
循环相反.replicate
也是如此.另见这个问题:
lapply
is a function you should use for its side effects (or lack of side effects). The function lapply
combines the results in a list automatically and doesn't mess with the environment you work in, contrary to a for
loop. The same goes for replicate
. See also this question:
您的 lapply
解决方案慢得多的原因是因为您使用它的方式会产生更多的开销.
The reason your lapply
solution is far slower, is because your way of using it creates a lot more overhead.
replicate
只不过是sapply
在内部,所以你实际上结合了sapply
和lapply
来实现你的双重环形.sapply
会产生额外的开销,因为它必须测试结果是否可以简化.因此,for
循环实际上比使用replicate
更快.- 在您的
lapply
匿名函数中,您必须为每次观察访问 x 和 y 的数据帧.这意味着 - 与您的 for 循环相反 - 例如,每次都必须调用$
函数. - 因为您使用这些高端函数,所以您的 'lapply' 解决方案调用了 49 个函数,而您的
for
解决方案仅调用了 26 个.这些用于lapply
解决方案包括调用match
、structure
、[[
、names
、%in% 等函数
,sys.call
,duplicated
, ...for
循环不需要的所有函数,因为它不执行任何这些检查.
replicate
is nothing else butsapply
internally, so you actually combinesapply
andlapply
to implement your double loop.sapply
creates extra overhead because it has to test whether or not the result can be simplified. So afor
loop will be actually faster than usingreplicate
.- inside your
lapply
anonymous function, you have to access the dataframe for both x and y for every observation. This means that -contrary to in your for-loop- eg the function$
has to be called every time. - Because you use these high-end functions, your 'lapply' solution calls 49 functions, compared to your
for
solution that only calls 26. These extra functions for thelapply
solution include calls to functions likematch
,structure
,[[
,names
,%in%
,sys.call
,duplicated
, ... All functions not needed by yourfor
loop as that one doesn't do any of these checks.
如果你想看看这个额外的开销是从哪里来的,看看replicate
、unlist
、sapply
和的内部代码>simplify2array
.
If you want to see where this extra overhead comes from, look at the internal code of replicate
, unlist
, sapply
and simplify2array
.
您可以使用以下代码来更好地了解 lapply
的性能损失.一行一行运行!
You can use the following code to get a better idea of where you lose your performance with the lapply
. Run this line by line!
Rprof(interval = 0.0001)
f()
Rprof(NULL)
fprof <- summaryRprof()$by.self
Rprof(interval = 0.0001)
perceptron(as.matrix(irissubdf[1:2]), irissubdf$y, 1, 10)
Rprof(NULL)
perprof <- summaryRprof()$by.self
fprof$Fun <- rownames(fprof)
perprof$Fun <- rownames(perprof)
Selftime <- merge(fprof, perprof,
all = TRUE,
by = 'Fun',
suffixes = c(".lapply",".for"))
sum(!is.na(Selftime$self.time.lapply))
sum(!is.na(Selftime$self.time.for))
Selftime[order(Selftime$self.time.lapply, decreasing = TRUE),
c("Fun","self.time.lapply","self.time.for")]
Selftime[is.na(Selftime$self.time.for),]
这篇关于lapply 与 for 循环 - 性能 R的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!