lapply与for循环-性能R [英] lapply vs for loop - Performance R

查看:211
本文介绍了lapply与for循环-性能R的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

通常说到,人们应该更喜欢lapply而不是for循环. 例如Hadley Wickham在他的Advance R书中指出的那样.

It is often said that one should prefer lapply over for loops. There are some exception as for example Hadley Wickham points out in his Advance R book.

( http://adv-r.had.co.nz/Functionals.html )(就地修改,递归等). 以下是这种情况之一.

(http://adv-r.had.co.nz/Functionals.html) (Modifying in place, Recursion etc). The following is one of this case.

仅出于学习目的,我尝试以功能形式重写感知器算法以进行基准测试 相对表现. 来源( https://rpubs.com/FaiHas/197581 ).

Just for sake of learning, I tried to rewrite a perceptron algorithm in a functional form in order to benchmark relative performance. source (https://rpubs.com/FaiHas/197581).

这是代码.

# prepare input
data(iris)
irissubdf <- iris[1:100, c(1, 3, 5)]
names(irissubdf) <- c("sepal", "petal", "species")
head(irissubdf)
irissubdf$y <- 1
irissubdf[irissubdf[, 3] == "setosa", 4] <- -1
x <- irissubdf[, c(1, 2)]
y <- irissubdf[, 4]

# perceptron function with for
perceptron <- function(x, y, eta, niter) {

  # initialize weight vector
  weight <- rep(0, dim(x)[2] + 1)
  errors <- rep(0, niter)


  # loop over number of epochs niter
  for (jj in 1:niter) {

    # loop through training data set
    for (ii in 1:length(y)) {

      # Predict binary label using Heaviside activation
      # function
      z <- sum(weight[2:length(weight)] * as.numeric(x[ii, 
        ])) + weight[1]
      if (z < 0) {
        ypred <- -1
      } else {
        ypred <- 1
      }

      # Change weight - the formula doesn't do anything
      # if the predicted value is correct
      weightdiff <- eta * (y[ii] - ypred) * c(1, 
        as.numeric(x[ii, ]))
      weight <- weight + weightdiff

      # Update error function
      if ((y[ii] - ypred) != 0) {
        errors[jj] <- errors[jj] + 1
      }

    }
  }

  # weight to decide between the two species

  return(errors)
}

err <- perceptron(x, y, 1, 10)

### my rewriting in functional form auxiliary
### function
faux <- function(x, weight, y, eta) {
  err <- 0
  z <- sum(weight[2:length(weight)] * as.numeric(x)) + 
    weight[1]
  if (z < 0) {
    ypred <- -1
  } else {
    ypred <- 1
  }

  # Change weight - the formula doesn't do anything
  # if the predicted value is correct
  weightdiff <- eta * (y - ypred) * c(1, as.numeric(x))
  weight <<- weight + weightdiff

  # Update error function
  if ((y - ypred) != 0) {
    err <- 1
  }
  err
}

weight <- rep(0, 3)
weightdiff <- rep(0, 3)

f <- function() {
  t <- replicate(10, sum(unlist(lapply(seq_along(irissubdf$y), 
    function(i) {
      faux(irissubdf[i, 1:2], weight, irissubdf$y[i], 
        1)
    }))))
  weight <<- rep(0, 3)
  t
}

由于上述原因,我没想到会有任何持续的改善 问题.但是,当我看到急剧恶化时,我真的很惊讶 使用lapplyreplicate.

I did not expected any consistent improvement due to the aforementioned issues. But nevertheless I was really surprised when I saw the sharp worsening using lapply and replicate.

我使用microbenchmark库中的microbenchmark函数获得了此结果

I obtained this results using microbenchmark function from microbenchmark library

可能是什么原因? 可能是内存泄漏了吗?

What could possibly be the reasons? Could it be some memory leak?

                                                      expr       min         lq       mean     median         uq
                                                        f() 48670.878 50600.7200 52767.6871 51746.2530 53541.2440
  perceptron(as.matrix(irissubdf[1:2]), irissubdf$y, 1, 10)  4184.131  4437.2990  4686.7506  4532.6655  4751.4795
 perceptronC(as.matrix(irissubdf[1:2]), irissubdf$y, 1, 10)    95.793   104.2045   123.7735   116.6065   140.5545
        max neval
 109715.673   100
   6513.684   100
    264.858   100

第一个功能是lapply/replicate功能

第二个是具有for循环的函数

The second is the function with for loops

第三个与C++中使用Rcpp

根据Roland的资料,此功能的剖析. 我不确定我能否以正确的方式解释它. 在我看来,大部分时间都花在子集上 函数分析

Here According to Roland the profiling of the function. I am not sure I can interpret it in the right way. It looks like to me most of the time is spent in subsetting Function profiling

推荐答案

首先,for循环比lapply慢是一个早已被揭穿的神话. R中的for循环的性能更高,并且目前至少与lapply一样快.

First of all, it is an already long debunked myth that for loops are any slower than lapply. The for loops in R have been made a lot more performant and are currently at least as fast as lapply.

也就是说,您必须在这里重新考虑对lapply的使用.您的实现要求分配给全局环境,因为您的代码要求您在循环期间更新权重.这是不考虑lapply的正当理由.

That said, you have to rethink your use of lapply here. Your implementation demands assigning to the global environment, because your code requires you to update the weight during the loop. And that is a valid reason to not consider lapply.

lapply是您应该用于产生副作用(或没有副作用)的函数.函数lapply自动将结果合并到一个列表中,并且与for循环相反,不会与您的工作环境混淆. replicate也是如此.另请参阅此问题:

lapply is a function you should use for its side effects (or lack of side effects). The function lapply combines the results in a list automatically and doesn't mess with the environment you work in, contrary to a for loop. The same goes for replicate. See also this question:

R的适用范围是否比句法糖还要多?

您的lapply解决方案之所以慢得多,是因为您使用它的方式会产生更多的开销.

The reason your lapply solution is far slower, is because your way of using it creates a lot more overhead.

  • replicate在内部仅是sapply,因此您实际上结合了sapplylapply来实现双循环. sapply会产生额外的开销,因为它必须测试是否可以简化结果.因此,for循环实际上比使用replicate更快.
  • 在您的lapply匿名函数中,您必须为每次观察都访问x和y的数据框.这意味着,与for循环相反,例如,每次必须调用函数$.
  • 由于使用了这些高端功能,因此,"lapply"解决方案将调用49个功能,而for解决方案将仅调用26个功能.lapply解决方案的这些额外功能包括对诸如match等功能的调用,structure[[names%in%sys.callduplicated,... for循环不需要的所有功能,因为它们不执行任何这些检查.
  • replicate is nothing else but sapply internally, so you actually combine sapply and lapply to implement your double loop. sapply creates extra overhead because it has to test whether or not the result can be simplified. So a for loop will be actually faster than using replicate.
  • inside your lapply anonymous function, you have to access the dataframe for both x and y for every observation. This means that -contrary to in your for-loop- eg the function $ has to be called every time.
  • Because you use these high-end functions, your 'lapply' solution calls 49 functions, compared to your for solution that only calls 26. These extra functions for the lapply solution include calls to functions like match, structure, [[, names, %in%, sys.call, duplicated, ... All functions not needed by your for loop as that one doesn't do any of these checks.

如果要查看这些额外开销的来源,请查看replicateunlistsapplysimplify2array的内部代码.

If you want to see where this extra overhead comes from, look at the internal code of replicate, unlist, sapply and simplify2array.

您可以使用以下代码更好地了解使用lapply会在哪里失去性能.逐行运行!

You can use the following code to get a better idea of where you lose your performance with the lapply. Run this line by line!

Rprof(interval = 0.0001)
f()
Rprof(NULL)
fprof <- summaryRprof()$by.self

Rprof(interval = 0.0001)
perceptron(as.matrix(irissubdf[1:2]), irissubdf$y, 1, 10) 
Rprof(NULL)
perprof <- summaryRprof()$by.self

fprof$Fun <- rownames(fprof)
perprof$Fun <- rownames(perprof)

Selftime <- merge(fprof, perprof,
                  all = TRUE,
                  by = 'Fun',
                  suffixes = c(".lapply",".for"))

sum(!is.na(Selftime$self.time.lapply))
sum(!is.na(Selftime$self.time.for))
Selftime[order(Selftime$self.time.lapply, decreasing = TRUE),
         c("Fun","self.time.lapply","self.time.for")]

Selftime[is.na(Selftime$self.time.for),]

这篇关于lapply与for循环-性能R的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆