更快的i,j矩阵单元格填充 [英] Faster i, j matrix cell fill

查看:154
本文介绍了更快的i,j矩阵单元格填充的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想要获取data.frame / matrix的列,并在数据帧的每个单元格( [i,j] )之间应用一个函数,其中i和j是沿着data.frame的列的序列。基本上,我想用 cor 函数与data.frame一起填充单个单元格的矩阵。



这是一个相关问题:创建来自函数的矩阵和两个数字数据帧。然而,我在随机化测试中使用它,并重复操作多次(做出许多矩阵)。我正在寻找最快的方式来做这个操作。我已经加速了一些东西使用并行处理,但我仍然不满意这种速度。不能假定矩阵输出是对称的,这是以 cor 产生对称矩阵的方式(我的示例将反映这一点)。



我今天在data.table网页上看到( http://datatable.r-forge.r-project.org/ )以下内容:


500+倍> DF [i,j] <值


这让我想到 data.table dplyr 或其他方式可能会加快一点。我的大脑已经固定在填充细胞,但也许有一个更好的方法涉及重塑,应用功能和重塑到矩阵或沿着这些线条的东西。我可以在基本R中使用 outer 循环,如下所示。

  ##任意函数
FUN < - function(x,y)round(sqrt(sum(x)) - sum = 1)

##外部方法
outer(
names(mtcars),
names(mtcars),
Vectorize )对于方法
mat < - matrix(rep(NA,ncol(mtcars(mtcars )对于(i in 1:ncol(mtcars)){
for(j in 1:ncol(mtcars)){
mat [i, j] < - FUN(mtcars [,i],mtcars [,j])
}
}
mat

这里是微基准测试 $>



:milliseconds
expr min lq median uq max neval
OUTER()4.450410 4.691124 4.774394 4.877724 55.77333 1000
FOR()4.309527 4.521785 4.588728 4.694156 7.04275 1000

R中最快的方法是什么?

解决方案

仍然坚持 base R解决方案,基于的方法的中的1.7倍加速:




  • $ c> [,i] [[i]] $ c>刚接收到C指针而不是新分配的向量);

  • FUN 的字节码编译;

  • 用于代码添加到函数+字节代码编译(小时间影响);



BTW,在2个循环中交换索引(i,j) - >(j,i)没有导致明显的差异(理论上,应该更快)。



代码:

 
FUN2< - cmpfun(FUN)
for2< - cmpfun(function(mtcars,FUN){
mat< matrix(rep(NA,ncol(mtcars)^ 2) ,ncol(mtcars))
for(i in 1:ncol(mtcars)){
for(j in 1:ncol(mtcars)){
mat [i,j] - FUN(mtcars [[i]],mtcars [[j]])
}
}
mat
})
/ pre>

基准:

 单位:毫秒
min lq median uq max neval
outer 7.791739 7.991474 8.245869 8.538163 16.24460 100
for 8.143679 8.463249 8.588230 9.912008 16.30842 100
for -mods 4.713837 4.875972 5.006202 5.246584 15.66491 100

在我看来,很难找到一个更快的方法(但我可能是错的)。与计算 FUN 多次所需的时间相比, for 循环时间偏差相当小(约0.25 ms) 。


I want to take columns of a data.frame/matrix and apply a function to between each cell ([i, j]) of the dataframe where i and j are the sequences along the columns of the data.frame. Basically I want to fill a matrix of individual cells in the same way that the cor function works with a data.frame.

This is a related question: Create a matrix from a function and two numeric data frames However, I use this in randomization tests and repeat the operation many times (make many matrices). I'm looking for the fastest way to do this operation. I have sped things up a bit using parallel processing but I'm still not happy with this speed. It can not be assumed that the matrix output is symmetrical either, that is in the way cor produces a symmetrical matrix (my example will reflect this).

I saw on the data.table web page today (http://datatable.r-forge.r-project.org/) the following:

500+ times faster than DF[i,j]<-value

This got me thinking that perhaps data.table or dplyr or other means may speed things up a bit. My brain has been fixed on filling cells but maybe there's a better way involving reshaping, applying the function and reshaping to a matrix or something along those lines. I can achieve this in base R using outer or a for loop as follows.

## Arbitrary function
FUN <- function(x, y) round(sqrt(sum(x)) - sum(y), digits=1)

## outer approach
outer(
  names(mtcars), 
  names(mtcars), 
  Vectorize(function(i,j) FUN(mtcars[,i],mtcars[,j]))
)

## for approach
mat <- matrix(rep(NA, ncol(mtcars)^2), ncol(mtcars))
for (i in 1:ncol(mtcars)) {
    for (j in 1:ncol(mtcars)) {
        mat[i, j] <- FUN(mtcars[, i], mtcars[, j])
    }
}
mat

Here are the microbenchmark timings with for getting a slight edge.

Unit: milliseconds
    expr      min       lq   median       uq      max neval
 OUTER() 4.450410 4.691124 4.774394 4.877724 55.77333  1000
   FOR() 4.309527 4.521785 4.588728 4.694156  7.04275  1000

What is the fastest approach to this in R (add on packages welcomed)?

解决方案

Still sticking to base R solution, I got a 1.6-1.7x speedup in the for-based approach by:

  • substituting [,i] for [[i]] (significant time impact - perhaps FUN just receives C pointers here instead of freshly allocated vectors);
  • byte-code compiling of FUN (small time impact);
  • wrapping for code to a function + byte-code compilation (small time impact);

BTW, swapping indices (i,j) -> (j,i) in the 2 loops didn't result in significant differences (theoretically, row-wise matrix access should be faster).

Code:

library(compiler)
FUN2 <- cmpfun(FUN)
for2 <- cmpfun(function(mtcars, FUN) {
      mat <- matrix(rep(NA, ncol(mtcars)^2), ncol(mtcars))
   for (i in 1:ncol(mtcars)) {
       for (j in 1:ncol(mtcars)) {
           mat[i, j] <- FUN(mtcars[[i]], mtcars[[j]])
       }
   }
   mat
})

Benchmarks:

 Unit: milliseconds
                min       lq   median       uq      max neval
 outer     7.791739 7.991474 8.245869 8.538163 16.24460   100
 for       8.143679 8.463249 8.588230 9.912008 16.30842   100
 for-mods  4.713837 4.875972 5.006202 5.246584 15.66491   100

In my opinion, it will be difficult to find a much faster approach (but I may be wrong). The for loop time bias is quite small (ca. 0.25 ms) comparing to the time needed to compute FUN multiple times.

这篇关于更快的i,j矩阵单元格填充的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆