lapply 与 for 循环 - 性能 R [英] lapply vs for loop - Performance R

查看:31
本文介绍了lapply 与 for 循环 - 性能 R的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

人们常说应该更喜欢 lapply 而不是 for 循环.有一些例外,例如 Hadley Wickham 在他的 Advance R 书中指出的.

It is often said that one should prefer lapply over for loops. There are some exception as for example Hadley Wickham points out in his Advance R book.

(http://adv-r.had.co.nz/Functionals.html)(就地修改、递归等).以下是这种情况之一.

(http://adv-r.had.co.nz/Functionals.html) (Modifying in place, Recursion etc). The following is one of this case.

为了学习,我尝试以函数形式重写感知器算法以进行基准测试相对表现.来源(https://rpubs.com/FaiHas/197581).

Just for sake of learning, I tried to rewrite a perceptron algorithm in a functional form in order to benchmark relative performance. source (https://rpubs.com/FaiHas/197581).

这是代码.

# prepare input
data(iris)
irissubdf <- iris[1:100, c(1, 3, 5)]
names(irissubdf) <- c("sepal", "petal", "species")
head(irissubdf)
irissubdf$y <- 1
irissubdf[irissubdf[, 3] == "setosa", 4] <- -1
x <- irissubdf[, c(1, 2)]
y <- irissubdf[, 4]

# perceptron function with for
perceptron <- function(x, y, eta, niter) {

  # initialize weight vector
  weight <- rep(0, dim(x)[2] + 1)
  errors <- rep(0, niter)


  # loop over number of epochs niter
  for (jj in 1:niter) {

    # loop through training data set
    for (ii in 1:length(y)) {

      # Predict binary label using Heaviside activation
      # function
      z <- sum(weight[2:length(weight)] * as.numeric(x[ii, 
        ])) + weight[1]
      if (z < 0) {
        ypred <- -1
      } else {
        ypred <- 1
      }

      # Change weight - the formula doesn't do anything
      # if the predicted value is correct
      weightdiff <- eta * (y[ii] - ypred) * c(1, 
        as.numeric(x[ii, ]))
      weight <- weight + weightdiff

      # Update error function
      if ((y[ii] - ypred) != 0) {
        errors[jj] <- errors[jj] + 1
      }

    }
  }

  # weight to decide between the two species

  return(errors)
}

err <- perceptron(x, y, 1, 10)

### my rewriting in functional form auxiliary
### function
faux <- function(x, weight, y, eta) {
  err <- 0
  z <- sum(weight[2:length(weight)] * as.numeric(x)) + 
    weight[1]
  if (z < 0) {
    ypred <- -1
  } else {
    ypred <- 1
  }

  # Change weight - the formula doesn't do anything
  # if the predicted value is correct
  weightdiff <- eta * (y - ypred) * c(1, as.numeric(x))
  weight <<- weight + weightdiff

  # Update error function
  if ((y - ypred) != 0) {
    err <- 1
  }
  err
}

weight <- rep(0, 3)
weightdiff <- rep(0, 3)

f <- function() {
  t <- replicate(10, sum(unlist(lapply(seq_along(irissubdf$y), 
    function(i) {
      faux(irissubdf[i, 1:2], weight, irissubdf$y[i], 
        1)
    }))))
  weight <<- rep(0, 3)
  t
}

由于上述原因,我没有预料到任何持续的改进问题.但是当我看到急剧恶化的时候我真的很惊讶使用 lapplyreplicate.

I did not expected any consistent improvement due to the aforementioned issues. But nevertheless I was really surprised when I saw the sharp worsening using lapply and replicate.

我使用 microbenchmark 库中的 microbenchmark 函数获得了这个结果

I obtained this results using microbenchmark function from microbenchmark library

可能是什么原因?会不会是内存泄漏?

What could possibly be the reasons? Could it be some memory leak?

                                                      expr       min         lq       mean     median         uq
                                                        f() 48670.878 50600.7200 52767.6871 51746.2530 53541.2440
  perceptron(as.matrix(irissubdf[1:2]), irissubdf$y, 1, 10)  4184.131  4437.2990  4686.7506  4532.6655  4751.4795
 perceptronC(as.matrix(irissubdf[1:2]), irissubdf$y, 1, 10)    95.793   104.2045   123.7735   116.6065   140.5545
        max neval
 109715.673   100
   6513.684   100
    264.858   100

第一个函数是lapply/replicate函数

第二个是带有for循环的函数

The second is the function with for loops

第三个是C++中使用Rcpp

这里根据 Roland 的功能剖析.我不确定我能否以正确的方式解释它.在我看来大部分时间都花在子集化上功能分析

Here According to Roland the profiling of the function. I am not sure I can interpret it in the right way. It looks like to me most of the time is spent in subsetting Function profiling

推荐答案

首先,for 循环比 lapply 慢,这是一个早已被揭穿的神话.R 中的 for 循环的性能提高了很多,目前至少与 lapply 一样快.

First of all, it is an already long debunked myth that for loops are any slower than lapply. The for loops in R have been made a lot more performant and are currently at least as fast as lapply.

也就是说,您必须在这里重新考虑您对 lapply 的使用.您的实现需要分配给全局环境,因为您的代码要求您在循环期间更新权重.这是不考虑 lapply 的正当理由.

That said, you have to rethink your use of lapply here. Your implementation demands assigning to the global environment, because your code requires you to update the weight during the loop. And that is a valid reason to not consider lapply.

lapply 是一个你应该使用的函数,因为它的副作用(或没有副作用).函数 lapply 自动将结果组合到一个列表中,并且不会干扰您工作的环境,这与 for 循环相反.replicate 也是如此.另见这个问题:

lapply is a function you should use for its side effects (or lack of side effects). The function lapply combines the results in a list automatically and doesn't mess with the environment you work in, contrary to a for loop. The same goes for replicate. See also this question:

R 的应用族比语法糖更重要吗?

您的 lapply 解决方案慢得多的原因是因为您使用它的方式会产生更多的开销.

The reason your lapply solution is far slower, is because your way of using it creates a lot more overhead.

  • replicate 只不过是 sapply 在内部,所以你实际上结合了 sapplylapply 来实现你的双重环形.sapply 会产生额外的开销,因为它必须测试结果是否可以简化.因此,for 循环实际上比使用 replicate 更快.​​
  • 在您的 lapply 匿名函数中,您必须为每次观察访问 x 和 y 的数据帧.这意味着 - 与您的 for 循环相反 - 例如,每次都必须调用 $ 函数.
  • 因为您使用这些高端函数,所以您的 'lapply' 解决方案调用了 49 个函数,而您的 for 解决方案仅调用了 26 个.这些用于 lapply 解决方案包括调用matchstructure[[names%in% 等函数, sys.call, duplicated, ...for 循环不需要的所有函数,因为它不执行任何这些检查.
  • replicate is nothing else but sapply internally, so you actually combine sapply and lapply to implement your double loop. sapply creates extra overhead because it has to test whether or not the result can be simplified. So a for loop will be actually faster than using replicate.
  • inside your lapply anonymous function, you have to access the dataframe for both x and y for every observation. This means that -contrary to in your for-loop- eg the function $ has to be called every time.
  • Because you use these high-end functions, your 'lapply' solution calls 49 functions, compared to your for solution that only calls 26. These extra functions for the lapply solution include calls to functions like match, structure, [[, names, %in%, sys.call, duplicated, ... All functions not needed by your for loop as that one doesn't do any of these checks.

如果你想看看这个额外的开销是从哪里来的,看看replicateunlistsapply的内部代码>simplify2array.

If you want to see where this extra overhead comes from, look at the internal code of replicate, unlist, sapply and simplify2array.

您可以使用以下代码来更好地了解 lapply 的性能损失.一行一行运行!

You can use the following code to get a better idea of where you lose your performance with the lapply. Run this line by line!

Rprof(interval = 0.0001)
f()
Rprof(NULL)
fprof <- summaryRprof()$by.self

Rprof(interval = 0.0001)
perceptron(as.matrix(irissubdf[1:2]), irissubdf$y, 1, 10) 
Rprof(NULL)
perprof <- summaryRprof()$by.self

fprof$Fun <- rownames(fprof)
perprof$Fun <- rownames(perprof)

Selftime <- merge(fprof, perprof,
                  all = TRUE,
                  by = 'Fun',
                  suffixes = c(".lapply",".for"))

sum(!is.na(Selftime$self.time.lapply))
sum(!is.na(Selftime$self.time.for))
Selftime[order(Selftime$self.time.lapply, decreasing = TRUE),
         c("Fun","self.time.lapply","self.time.for")]

Selftime[is.na(Selftime$self.time.for),]

这篇关于lapply 与 for 循环 - 性能 R的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆