是"* apply"吗?家庭真的没有向量化吗? [英] Is the "*apply" family really not vectorized?

查看:66
本文介绍了是"* apply"吗?家庭真的没有向量化吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

因此,我们习惯于向每个R新用户说" apply没有向量化",请查看Patrick Burns R地狱第4圈"(我引用):

So we are used to say to every R new user that "apply isn't vectorized, check out the Patrick Burns R Inferno Circle 4" which says (I quote):

常见的反射是在apply系列中使用功能. 不是 向量化,它是循环隐藏. apply函数有一个for循环 它的定义. lapply函数掩盖了循环,但是执行了 时间通常大致等于显式的for循环.

A common reflex is to use a function in the apply family. This is not vectorization, it is loop-hiding. The apply function has a for loop in its definition. The lapply function buries the loop, but execution times tend to be roughly equal to an explicit for loop.

实际上,快速浏览apply源代码可以揭示该循环:

Indeed, a quick look on the apply source code reveals the loop:

grep("for", capture.output(getAnywhere("apply")), value = TRUE)
## [1] "        for (i in 1L:d2) {"  "    else for (i in 1L:d2) {"

到目前为止,还不错,但是看看lapplyvapply实际上可以看到一张完全不同的图片:

Ok so far, but a look at lapply or vapply actually reveals a completely different picture:

lapply
## function (X, FUN, ...) 
## {
##     FUN <- match.fun(FUN)
##     if (!is.vector(X) || is.object(X)) 
##        X <- as.list(X)
##     .Internal(lapply(X, FUN))
## }
## <bytecode: 0x000000000284b618>
## <environment: namespace:base>

因此,显然没有隐藏的R for循环,而是它们正在调用内部C编写的函数.

So apparently there is no R for loop hiding there, rather they are calling internal C written function.

兔子 <一个href ="https://github.com/wch/r-source/blob/trunk/src/main/apply.c#L34">孔可以看到几乎相同的图片

A quick look in the rabbit hole reveals pretty much the same picture

此外,让我们以colMeans函数为例,该函数从未被指控没有向量化

Moreover, let's take the colMeans function for example, which was never accused in not being vectorised

colMeans
# function (x, na.rm = FALSE, dims = 1L) 
# {
#   if (is.data.frame(x)) 
#     x <- as.matrix(x)
#   if (!is.array(x) || length(dn <- dim(x)) < 2L) 
#     stop("'x' must be an array of at least two dimensions")
#   if (dims < 1L || dims > length(dn) - 1L) 
#     stop("invalid 'dims'")
#   n <- prod(dn[1L:dims])
#   dn <- dn[-(1L:dims)]
#   z <- if (is.complex(x)) 
#     .Internal(colMeans(Re(x), n, prod(dn), na.rm)) + (0+1i) * 
#     .Internal(colMeans(Im(x), n, prod(dn), na.rm))
#   else .Internal(colMeans(x, n, prod(dn), na.rm))
#   if (length(dn) > 1L) {
#     dim(z) <- dn
#     dimnames(z) <- dimnames(x)[-(1L:dims)]
#   }
#   else names(z) <- dimnames(x)[[dims + 1]]
#   z
# }
# <bytecode: 0x0000000008f89d20>
#   <environment: namespace:base>

嗯?它也只调用.Internal(colMeans(...,我们也可以在兔子洞中找到.那么,这与.Internal(lapply(..有何不同?

Huh? It also just calls .Internal(colMeans(... which we can also find in the rabbit hole. So how is this different from .Internal(lapply(..?

实际上,一个快速基准测试表明,对于大数据集,sapply的性能不比colMeans差,并且比for循环好得多

Actually a quick benchmark reveals that sapply performs no worse than colMeans and much better than a for loop for a big data set

m <- as.data.frame(matrix(1:1e7, ncol = 1e5))
system.time(colMeans(m))
# user  system elapsed 
# 1.69    0.03    1.73 
system.time(sapply(m, mean))
# user  system elapsed 
# 1.50    0.03    1.60 
system.time(apply(m, 2, mean))
# user  system elapsed 
# 3.84    0.03    3.90 
system.time(for(i in 1:ncol(m)) mean(m[, i]))
# user  system elapsed 
# 13.78    0.01   13.93 

换句话说,说lapplyvapply 实际上是矢量化的是正确的(与apply相比,这是for循环,也称为lapply)帕特里克·伯恩斯(Patrick Burns)真正想说什么?

In other words, is it correct to say that lapply and vapply are actually vectorised (compared to apply which is a for loop that also calls lapply) and what did Patrick Burns really mean to say?

推荐答案

首先,在您的示例中,对"data.frame"进行测试,这对colMeansapply"[.data.frame"不公平因为他们有开销:

First of all, in your example you make tests on a "data.frame" which is not fair for colMeans, apply and "[.data.frame" since they have an overhead:

system.time(as.matrix(m))  #called by `colMeans` and `apply`
#   user  system elapsed 
#   1.03    0.00    1.05
system.time(for(i in 1:ncol(m)) m[, i])  #in the `for` loop
#   user  system elapsed 
#  12.93    0.01   13.07

在矩阵上,图片有些不同:

On a matrix, the picture is a bit different:

mm = as.matrix(m)
system.time(colMeans(mm))
#   user  system elapsed 
#   0.01    0.00    0.01 
system.time(apply(mm, 2, mean))
#   user  system elapsed 
#   1.48    0.03    1.53 
system.time(for(i in 1:ncol(mm)) mean(mm[, i]))
#   user  system elapsed 
#   1.22    0.00    1.21

关于问题的主要部分,lapply/mapply/etc和直接的R循环之间的主要区别是进行循环的地方.正如Roland指出的那样,C和R循环都需要在每次迭代中评估R函数,这是最昂贵的.真正快速的C函数是可以在C中完成所有功能的函数,所以,我想,这应该是向量化"的含义吗?

Regading the main part of the question, the main difference between lapply/mapply/etc and straightforward R-loops is where the looping is done. As Roland notes, both C and R loops need to evaluate an R function in each iteration which is the most costly. The really fast C functions are those that do everything in C, so, I guess, this should be what "vectorised" is about?

在每个列表"元素中找到均值的示例:

An example where we find the mean in each of a "list"s elements:

((编辑16年11月11日:我相信找到均值"的示例对于迭代评估R函数和编译后的代码之间的差异不是一个很好的设置,(1 ),因为R的均值算法在简单的sum(x) / length(x)上对数字"的特殊性,并且(2)用length(x) >> lengths(x)在列表"上进行测试应该更有意义.因此,移动了均值"示例到最后,然后换成另一个.)

(EDIT May 11 '16 : I believe the example with finding the "mean" is not a good setup for the differences between evaluating an R function iteratively and compiled code, (1) because of the particularity of R's mean algorithm on "numeric"s over a simple sum(x) / length(x) and (2) it should make more sense to test on "list"s with length(x) >> lengths(x). So, the "mean" example is moved to the end and replaced with another.)

作为一个简单的示例,我们可以考虑找到列表"中每个length == 1元素相反的地方:

As a simple example we could consider the finding of the opposite of each length == 1 element of a "list":

tmp.c文件中:

#include <R.h>
#define USE_RINTERNALS 
#include <Rinternals.h>
#include <Rdefines.h>

/* call a C function inside another */
double oppC(double x) { return(ISNAN(x) ? NA_REAL : -x); }
SEXP sapply_oppC(SEXP x)
{
    SEXP ans = PROTECT(allocVector(REALSXP, LENGTH(x)));
    for(int i = 0; i < LENGTH(x); i++) 
        REAL(ans)[i] = oppC(REAL(VECTOR_ELT(x, i))[0]);

    UNPROTECT(1);
    return(ans);
}

/* call an R function inside a C function;
 * will be used with 'f' as a closure and as a builtin */    
SEXP sapply_oppR(SEXP x, SEXP f)
{
    SEXP call = PROTECT(allocVector(LANGSXP, 2));
    SETCAR(call, install(CHAR(STRING_ELT(f, 0))));

    SEXP ans = PROTECT(allocVector(REALSXP, LENGTH(x)));     
    for(int i = 0; i < LENGTH(x); i++) { 
        SETCADR(call, VECTOR_ELT(x, i));
        REAL(ans)[i] = REAL(eval(call, R_GlobalEnv))[0];
    }

    UNPROTECT(2);
    return(ans);
}

在R侧:

system("R CMD SHLIB /home/~/tmp.c")
dyn.load("/home/~/tmp.so")

带有数据:

set.seed(007)
myls = rep_len(as.list(c(NA, runif(3))), 1e7)

#a closure wrapper of `-`
oppR = function(x) -x

for_oppR = compiler::cmpfun(function(x, f)
{
    f = match.fun(f)  
    ans = numeric(length(x))
    for(i in seq_along(x)) ans[[i]] = f(x[[i]])
    return(ans)
})

基准化:

#call a C function iteratively
system.time({ sapplyC =  .Call("sapply_oppC", myls) }) 
#   user  system elapsed 
#  0.048   0.000   0.047 

#evaluate an R closure iteratively
system.time({ sapplyRC =  .Call("sapply_oppR", myls, "oppR") }) 
#   user  system elapsed 
#  3.348   0.000   3.358 

#evaluate an R builtin iteratively
system.time({ sapplyRCprim =  .Call("sapply_oppR", myls, "-") }) 
#   user  system elapsed 
#  0.652   0.000   0.653 

#loop with a R closure
system.time({ forR = for_oppR(myls, "oppR") })
#   user  system elapsed 
#  4.396   0.000   4.409 

#loop with an R builtin
system.time({ forRprim = for_oppR(myls, "-") })
#   user  system elapsed 
#  1.908   0.000   1.913 

#for reference and testing 
system.time({ sapplyR = unlist(lapply(myls, oppR)) })
#   user  system elapsed 
#  7.080   0.068   7.170 
system.time({ sapplyRprim = unlist(lapply(myls, `-`)) }) 
#   user  system elapsed 
#  3.524   0.064   3.598 

all.equal(sapplyR, sapplyRprim)
#[1] TRUE 
all.equal(sapplyR, sapplyC)
#[1] TRUE
all.equal(sapplyR, sapplyRC)
#[1] TRUE
all.equal(sapplyR, sapplyRCprim)
#[1] TRUE
all.equal(sapplyR, forR)
#[1] TRUE
all.equal(sapplyR, forRprim)
#[1] TRUE

(遵循原始均值查找示例):

#all computations in C
all_C = inline::cfunction(sig = c(R_ls = "list"), body = '
    SEXP tmp, ans;
    PROTECT(ans = allocVector(REALSXP, LENGTH(R_ls)));

    double *ptmp, *pans = REAL(ans);

    for(int i = 0; i < LENGTH(R_ls); i++) {
        pans[i] = 0.0;

        PROTECT(tmp = coerceVector(VECTOR_ELT(R_ls, i), REALSXP));
        ptmp = REAL(tmp);

        for(int j = 0; j < LENGTH(tmp); j++) pans[i] += ptmp[j];

        pans[i] /= LENGTH(tmp);

        UNPROTECT(1);
    }

    UNPROTECT(1);
    return(ans);
')

#a very simple `lapply(x, mean)`
C_and_R = inline::cfunction(sig = c(R_ls = "list"), body = '
    SEXP call, ans, ret;

    PROTECT(call = allocList(2));
    SET_TYPEOF(call, LANGSXP);
    SETCAR(call, install("mean"));

    PROTECT(ans = allocVector(VECSXP, LENGTH(R_ls)));
    PROTECT(ret = allocVector(REALSXP, LENGTH(ans)));

    for(int i = 0; i < LENGTH(R_ls); i++) {
        SETCADR(call, VECTOR_ELT(R_ls, i));
        SET_VECTOR_ELT(ans, i, eval(call, R_GlobalEnv));
    }

    double *pret = REAL(ret);
    for(int i = 0; i < LENGTH(ans); i++) pret[i] = REAL(VECTOR_ELT(ans, i))[0];

    UNPROTECT(3);
    return(ret);
')                    

R_lapply = function(x) unlist(lapply(x, mean))                       

R_loop = function(x) 
{
    ans = numeric(length(x))
    for(i in seq_along(x)) ans[i] = mean(x[[i]])
    return(ans)
} 

R_loopcmp = compiler::cmpfun(R_loop)


set.seed(007); myls = replicate(1e4, runif(1e3), simplify = FALSE)
all.equal(all_C(myls), C_and_R(myls))
#[1] TRUE
all.equal(all_C(myls), R_lapply(myls))
#[1] TRUE
all.equal(all_C(myls), R_loop(myls))
#[1] TRUE
all.equal(all_C(myls), R_loopcmp(myls))
#[1] TRUE

microbenchmark::microbenchmark(all_C(myls), 
                               C_and_R(myls), 
                               R_lapply(myls), 
                               R_loop(myls), 
                               R_loopcmp(myls), 
                               times = 15)
#Unit: milliseconds
#            expr       min        lq    median        uq      max neval
#     all_C(myls)  37.29183  38.19107  38.69359  39.58083  41.3861    15
#   C_and_R(myls) 117.21457 123.22044 124.58148 130.85513 169.6822    15
#  R_lapply(myls)  98.48009 103.80717 106.55519 109.54890 116.3150    15
#    R_loop(myls) 122.40367 130.85061 132.61378 138.53664 178.5128    15
# R_loopcmp(myls) 105.63228 111.38340 112.16781 115.68909 128.1976    15

这篇关于是"* apply"吗?家庭真的没有向量化吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆