更快的 i, j 矩阵单元填充 [英] Faster i, j matrix cell fill
问题描述
我想获取 data.frame/matrix 的列并在数据帧的每个单元格 ([i, j]
) 之间应用一个函数,其中 i 和 j 是沿列的序列的data.frame.基本上,我想以与 cor
函数处理 data.frame 相同的方式填充单个单元格的矩阵.
I want to take columns of a data.frame/matrix and apply a function to between each cell ([i, j]
) of the dataframe where i and j are the sequences along the columns of the data.frame. Basically I want to fill a matrix of individual cells in the same way that the cor
function works with a data.frame.
这是一个相关问题:创建一个函数的矩阵和两个数字数据框 但是,我在随机化测试中使用它并多次重复该操作(制作许多矩阵).我正在寻找执行此操作的最快方法.我使用并行处理加快了速度,但我仍然对这种速度不满意.也不能假设矩阵输出是对称的,即 cor
产生对称矩阵的方式(我的示例将反映这一点).
This is a related question: Create a matrix from a function and two numeric data frames However, I use this in randomization tests and repeat the operation many times (make many matrices). I'm looking for the fastest way to do this operation. I have sped things up a bit using parallel processing but I'm still not happy with this speed. It can not be assumed that the matrix output is symmetrical either, that is in the way cor
produces a symmetrical matrix (my example will reflect this).
我今天在data.table网页上看到了(http://datatable.r-forge.r-project.org/) 以下内容:
I saw on the data.table web page today (http://datatable.r-forge.r-project.org/) the following:
比 DF[i,j]<-value
这让我想到也许 data.table
或 dplyr
或其他方式可能会加快速度.我的大脑一直专注于填充细胞,但也许有更好的方法涉及重塑、应用函数和重塑矩阵或类似的东西.我可以使用 outer
或 for
循环在基本 R 中实现这一点,如下所示.
This got me thinking that perhaps data.table
or dplyr
or other means may speed things up a bit. My brain has been fixed on filling cells but maybe there's a better way involving reshaping, applying the function and reshaping to a matrix or something along those lines. I can achieve this in base R using outer
or a for
loop as follows.
## Arbitrary function
FUN <- function(x, y) round(sqrt(sum(x)) - sum(y), digits=1)
## outer approach
outer(
names(mtcars),
names(mtcars),
Vectorize(function(i,j) FUN(mtcars[,i],mtcars[,j]))
)
## for approach
mat <- matrix(rep(NA, ncol(mtcars)^2), ncol(mtcars))
for (i in 1:ncol(mtcars)) {
for (j in 1:ncol(mtcars)) {
mat[i, j] <- FUN(mtcars[, i], mtcars[, j])
}
}
mat
这里是 microbenchmark 计时与 获得微弱优势.
Here are the microbenchmark timings with for
getting a slight edge.
Unit: milliseconds
expr min lq median uq max neval
OUTER() 4.450410 4.691124 4.774394 4.877724 55.77333 1000
FOR() 4.309527 4.521785 4.588728 4.694156 7.04275 1000
R 中最快的方法是什么(欢迎添加软件包)?
What is the fastest approach to this in R (add on packages welcomed)?
推荐答案
仍然坚持 base
R 解决方案,我在 for
中获得了 1.6-1.7 倍的加速 -基于的方法:
Still sticking to base
R solution, I got a 1.6-1.7x speedup in the for
-based approach by:
- 用
[,i]
代替[[i]]
(显着的时间影响 - 也许FUN
只是在这里接收 C 指针而不是新鲜分配的向量); - 字节码编译(时间影响小);
- 将
for
代码包装成函数 + 字节码编译(时间影响小);
FUN
的- substituting
[,i]
for[[i]]
(significant time impact - perhapsFUN
just receives C pointers here instead of freshly allocated vectors); - byte-code compiling of
FUN
(small time impact); - wrapping
for
code to a function + byte-code compilation (small time impact);
顺便说一句,在 2 个循环中交换索引 (i,j) -> (j,i) 不会导致显着差异(理论上,逐行矩阵访问应该更快).
BTW, swapping indices (i,j) -> (j,i) in the 2 loops didn't result in significant differences (theoretically, row-wise matrix access should be faster).
代码:
library(compiler)
FUN2 <- cmpfun(FUN)
for2 <- cmpfun(function(mtcars, FUN) {
mat <- matrix(rep(NA, ncol(mtcars)^2), ncol(mtcars))
for (i in 1:ncol(mtcars)) {
for (j in 1:ncol(mtcars)) {
mat[i, j] <- FUN(mtcars[[i]], mtcars[[j]])
}
}
mat
})
基准测试:
Unit: milliseconds
min lq median uq max neval
outer 7.791739 7.991474 8.245869 8.538163 16.24460 100
for 8.143679 8.463249 8.588230 9.912008 16.30842 100
for-mods 4.713837 4.875972 5.006202 5.246584 15.66491 100
在我看来,很难找到更快的方法(但我可能错了).与多次计算 FUN
所需的时间相比,for
循环时间偏差非常小(大约 0.25 毫秒).
In my opinion, it will be difficult to find a much faster approach (but I may be wrong). The for
loop time bias is quite small (ca. 0.25 ms) comparing to the time needed to compute FUN
multiple times.
这篇关于更快的 i, j 矩阵单元填充的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!