R中的多维数组操作:apply vs data.table vs plyr(parallel) [英] Operations on mult-dimensional arrays in R: apply vs data.table vs plyr (parallel)

查看:168
本文介绍了R中的多维数组操作:apply vs data.table vs plyr(parallel)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在我的研究工作中,我通常处理大的4D数组(20-200万元素)。
我试图提高计算的计算速度,寻找速度和简单性之间的最佳权衡。我已经做了一些进步感谢(参见此处 here

In my research work, I normally deal with big 4D arrays (20-200 millions of elements). I'm trying to improve the computational speed of my calculations looking for an optimal trade-off between speed and simplicity. I've already did some step forward thanks to SO (see here and here)

现在,我试图利用最新的软件包,例如 data.table plyr

Now, I'm trying to exploit the latest packages like data.table and plyr.

让我们从类似的开始:

D = c(100, 1000, 8) #x,y,t
d = array(rnorm(prod(D)), dim = D)

我想获得每个 x (第一维)和 y (第二维)高于第90百分位的 t 的值。让我们用基础R:

I'd like to get for each x (first dimension) and y (second dimension) the values of t that are above the 90th percentile. Let's do that with base R:

system.time(
    q1 <- apply(d, c(1,2), function(x) {
        return(x >= quantile(x, .9, names = F))
        })
)    

在我的Macbook上大概是十秒钟。我得到一个数组为:

On my Macbook it's about ten seconds. And I get back an array as:

> dim(q1)
[1]    8  100 1000

应用奇怪地改变维度的顺序,反正我现在不在乎)。现在我可以熔化 reshape2 包)我的数组并将其用于 data.table

(apply strangely change the order of the dimensions, anyway I don't care now). Now I can melt (reshape2 package) my array and use it into data.table:

> d_m = melt(d)
> colnames(d_m) = c('x', 'y', 't', 'value')
> d_t = data.table(d_m)

然后我做一些data.tablemagic p>

Then I do some data.table "magic":

system.time({
    q2 = d_t[,q := quantile(value, .9, names = F), by="x,y"][,ev := value > q]
})


$ b b

现在计算所需时间略少于10秒。现在我想尝试 plyr ddply

system.time({
    q3 <- ddply(d_m, .(x, y), summarise, q = quantile(value, .9, names = F))
})

现在,需要60秒。如果我移动到 dplyr ,我可以在十秒内再次进行相同的计算。

Now, it takes 60 seconds. If I move to dplyr I can do the same calculation again in ten seconds.

但是,我的问题如下:你将如何以更快的方式做同样的计算?如果我考虑一个更大的矩阵(比20倍大),我使用data.table wrt apply 函数获得更快的计算,但是在相同的数量级(14分钟vs 10分钟)。
任何评论都非常感谢...

However, my question is the following: what would you do to do the same calculation in a faster way? If I consider a larger matrix (say 20 times bigger) I obtain a faster computation using data.table wrt the apply function but however at the same order of magnitude (14 minutes vs 10 minutes). Any comment is really appreciated...

EDIT

我使用 Rcpp 在c ++中实现了分位数函数,加速了8次计算。

I've implemented the quantile function in c++ using Rcpp speeding up the computation of eight times.

推荐答案

根据@roland的建议,加速代码的一个可能的解决方案是实现一个更快的版本 quantile 函数。我花了一个小时学习如何使用 Rcpp ,运行时间减少了八次。我实现了类型7 版本的分位数算法(默认选择)。
我们仍然远离MATLAB性能(讨论这里),但在我的情况下,这是一个令人印象深刻的一步。我不为自己写的到目前为止的Rcpp代码感到自豪,我没有时间擦亮它。无论如何,它工作(我检查结果与R函数),所以如果你有兴趣,你可以从此处

As suggested by @roland, one possible solution to speed up the code was to implement a faster version of quantile function. I spent one hour to learn how to do that using Rcpp and the running time decreased eight times. I've implemented the type 7 version of the quantile algorithm (default choice). We are still far from the MATLAB performance (discussed here) but in my case this is an impressive step forward. I am not proud of the Rcpp code I have written so far, I didn't have the time to polish it. Anyway, it works (I checked the results with the R function) and so if you are interested you can download it from here.

这篇关于R中的多维数组操作:apply vs data.table vs plyr(parallel)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆