更快的i,j矩阵单元格填充 [英] Faster i, j matrix cell fill
问题描述
我想要获取data.frame / matrix的列,并在数据帧的每个单元格( [i,j]
)之间应用一个函数,其中i和j是沿着data.frame的列的序列。基本上,我想用 cor
函数与data.frame一起填充单个单元格的矩阵。
我今天在data.table网页上看到( http://datatable.r-forge.r-project.org/ )以下内容:
比
500+倍> DF [i,j] <值
这让我想到 data.table
或 dplyr
或其他方式可能会加快一点。我的大脑已经固定在填充细胞,但也许有一个更好的方法涉及重塑,应用功能和重塑到矩阵或沿着这些线条的东西。我可以在基本R中使用 outer
或为
循环,如下所示。
##任意函数
FUN < - function(x,y)round(sqrt(sum(x)) - sum = 1)
##外部方法
outer(
names(mtcars),
names(mtcars),
Vectorize )对于方法
mat < - matrix(rep(NA,ncol(mtcars(mtcars )对于(i in 1:ncol(mtcars)){
for(j in 1:ncol(mtcars)){
mat [i, j] < - FUN(mtcars [,i],mtcars [,j])
}
}
mat
这里是微基准测试 $>
:milliseconds
expr min lq median uq max neval
OUTER()4.450410 4.691124 4.774394 4.877724 55.77333 1000
FOR()4.309527 4.521785 4.588728 4.694156 7.04275 1000
R中最快的方法是什么?
仍然坚持 base
R解决方案,基于的方法的
中的1.7倍加速:
- $ c> [,i]
[[i]]
$ c>刚接收到C指针而不是新分配的向量); -
FUN
的字节码编译; - 将
用于
代码添加到函数+字节代码编译(小时间影响);
BTW,在2个循环中交换索引(i,j) - >(j,i)没有导致明显的差异(理论上,应该更快)。
代码:
FUN2< - cmpfun(FUN)
for2< - cmpfun(function(mtcars,FUN){
mat< matrix(rep(NA,ncol(mtcars)^ 2) ,ncol(mtcars))
for(i in 1:ncol(mtcars)){
for(j in 1:ncol(mtcars)){
mat [i,j] - FUN(mtcars [[i]],mtcars [[j]])
}
}
mat
})
/ pre>
基准:
单位:毫秒
min lq median uq max neval
outer 7.791739 7.991474 8.245869 8.538163 16.24460 100
for 8.143679 8.463249 8.588230 9.912008 16.30842 100
for -mods 4.713837 4.875972 5.006202 5.246584 15.66491 100
在我看来,很难找到一个更快的方法(但我可能是错的)。与计算
FUN
多次所需的时间相比,for
循环时间偏差相当小(约0.25 ms) 。I want to take columns of a data.frame/matrix and apply a function to between each cell (
[i, j]
) of the dataframe where i and j are the sequences along the columns of the data.frame. Basically I want to fill a matrix of individual cells in the same way that thecor
function works with a data.frame.This is a related question: Create a matrix from a function and two numeric data frames However, I use this in randomization tests and repeat the operation many times (make many matrices). I'm looking for the fastest way to do this operation. I have sped things up a bit using parallel processing but I'm still not happy with this speed. It can not be assumed that the matrix output is symmetrical either, that is in the way
cor
produces a symmetrical matrix (my example will reflect this).I saw on the data.table web page today (http://datatable.r-forge.r-project.org/) the following:
500+ times faster than
DF[i,j]<-value
This got me thinking that perhaps
data.table
ordplyr
or other means may speed things up a bit. My brain has been fixed on filling cells but maybe there's a better way involving reshaping, applying the function and reshaping to a matrix or something along those lines. I can achieve this in base R usingouter
or afor
loop as follows.## Arbitrary function FUN <- function(x, y) round(sqrt(sum(x)) - sum(y), digits=1) ## outer approach outer( names(mtcars), names(mtcars), Vectorize(function(i,j) FUN(mtcars[,i],mtcars[,j])) ) ## for approach mat <- matrix(rep(NA, ncol(mtcars)^2), ncol(mtcars)) for (i in 1:ncol(mtcars)) { for (j in 1:ncol(mtcars)) { mat[i, j] <- FUN(mtcars[, i], mtcars[, j]) } } mat
Here are the microbenchmark timings with
for
getting a slight edge.Unit: milliseconds expr min lq median uq max neval OUTER() 4.450410 4.691124 4.774394 4.877724 55.77333 1000 FOR() 4.309527 4.521785 4.588728 4.694156 7.04275 1000
What is the fastest approach to this in R (add on packages welcomed)?
解决方案Still sticking to
base
R solution, I got a 1.6-1.7x speedup in thefor
-based approach by:
- substituting
[,i]
for[[i]]
(significant time impact - perhapsFUN
just receives C pointers here instead of freshly allocated vectors); - byte-code compiling of
FUN
(small time impact); - wrapping
for
code to a function + byte-code compilation (small time impact);
BTW, swapping indices (i,j) -> (j,i) in the 2 loops didn't result in significant differences (theoretically, row-wise matrix access should be faster).
Code:
library(compiler)
FUN2 <- cmpfun(FUN)
for2 <- cmpfun(function(mtcars, FUN) {
mat <- matrix(rep(NA, ncol(mtcars)^2), ncol(mtcars))
for (i in 1:ncol(mtcars)) {
for (j in 1:ncol(mtcars)) {
mat[i, j] <- FUN(mtcars[[i]], mtcars[[j]])
}
}
mat
})
Benchmarks:
Unit: milliseconds
min lq median uq max neval
outer 7.791739 7.991474 8.245869 8.538163 16.24460 100
for 8.143679 8.463249 8.588230 9.912008 16.30842 100
for-mods 4.713837 4.875972 5.006202 5.246584 15.66491 100
In my opinion, it will be difficult to find a much faster approach (but I may be wrong). The for
loop time bias is quite small (ca. 0.25 ms) comparing to the time needed to compute FUN
multiple times.
这篇关于更快的i,j矩阵单元格填充的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!