如何优化读取和写入R中矩阵的子部分(可能使用data.table) [英] How to optimize Read and Write to subsections of a matrix in R (possibly using data.table)

查看:147
本文介绍了如何优化读取和写入R中矩阵的子部分(可能使用data.table)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

TL; DR




R中读取和写入
列的最快方法是什么从一个非常大的矩阵。我尝试了一个与data.table
的解决方案,但需要一个快速的方法来提取一系列列?



操作是赋值。因此,解决方案是坚持一个矩阵,并使用Rcpp和C ++来修改矩阵到位。有两个很好的答案下面的例子。[对于那些应用于其他问题,一定要阅读解决方案中的免责声明!滚动到问题底部,了解更多的经验教训。







第一个堆栈溢出问题 - 我非常感谢你的时间,看看,我道歉,如果我留下了任何东西。我正在研究一个R包,其中我有一个性能瓶颈从子集和写入矩阵的一部分(统计学家,应用程序正在更新足够的统计数据处理每个数据点后)。单个操作速度非常快,但是它们的数量要求它尽可能快。该想法的最简单版本是维数K×V的矩阵,其中K通常在5和1000之间,V可以在1000和1,000,000之间。

 code> set.seed(94253)
K <-100
V <100000
mat < - matrix(runif(K * V),nrow = ncol = V)

然后我们最终对列子集执行计算,全矩阵。
天真地看起来像

  Vsub<  -  sample(1:V,20)
toinsert < - matrix(runif(K * length(Vsub)),nrow = K,ncol = length(Vsub))
mat [,Vsub] (microbenchmark)
microbenchmark(mat [,Vsub] < - mat [,Vsub] + toinsert)


$ b b

因为这样做很多次,它可能是相当慢的,因为R的副本on-change语义(但看到下面的经验教训,修改实际上可能发生在一些地方的地方)。



对于我的问题,对象不需要是一个矩阵(我对这里的差异很敏感将矩阵分配给数据表的子集。我总是想要完整的列,所以列表结构的数据框是好的。我的解决方案是使用Matthew Dowle的真棒data.table包。使用set()可以非常快地完成写入。不幸的是,获得价值有点更复杂。我们必须调用变量设置为= FALSE,这显着减慢了事情。

  library(data.table)
DT< - as.data.table(mat)
set(DT,i = NULL,j = Vsub,DT [,Vsub,with = FALSE] + as.numeric(toinsert))

在set()函数中,使用i = NULL来引用所有行非常快,但是(可能是由于存储在内部的方式),所以没有可比较的选项。 @Roland在注释中注释,一个选项将转换为三重表示(行号,列号,值),并使用data.tables二进制搜索来加速检索。我手动测试,虽然它是快速,它做矩阵的大约三倍的内存需求。如果可能,我想避免这种情况。



按照这里的问题:从data.table和data.frame对象获取单个elemets的时间。 Hadley Wickham为单个索引提供了难以置信的快速解决方案

  Vone<  -  Vsub [1] 
toinsert.one < - toinsert [,1]
set(DT,i = NULL,j = Vone,(。subset2(DT,Vone)+ toinsert.one))
/ pre>

然而由于.subset2(DT,i)只是DT [[i]]没有方法调度,没有办法几个列一次,虽然它肯定似乎应该是可能的。和上一个问题一样,它似乎是因为我们可以迅速覆盖这些值,我们应该能够快速阅读它们。



有任何建议吗?还请让我知道如果有一个比这个问题的data.table更好的解决方案。我意识到它在许多方面不是真正的预期用例,但我试图避免将整个系列的操作移植到C。



这里是一系列的时间元素讨论 - 前两个都是列,后两个只是一列。

  microbenchmark(mat [,Vsub] < -  mat [,Vsub] + toinsert,
set i = NULL,j = Vsub,DT [,Vsub,with = FALSE] + as.numeric(toinsert)),
mat [,Vone]< - mat [,Vone] + toinsert.one,
set(DT,i = NULL,j = Vone,(。subset2(DT,Vone)+ toinsert.one)),
times = 1000L)

单位:微秒
expr min lq median uq max neval
Matrix 51.970 53.895 61.754 77.313 135.698 1000
数据表4751.982 4962.426 5087.376 5256.597 23710.826 1000
Matrix Single Col 8.021 9.304 10.427 19.570 55303.659 1000
Data.Table Single Col 6.737 7.700 9.304 11.549 89.824 1000






回答和经验教训:




注释将操作中最昂贵的部分识别为分配过程。两个解决方案都给出了基于C代码的答案,该代码修改矩阵就地破坏不修改参数到函数但提供更快结果的R约定。



Hadley Wickham在注释中注意到,只要对象mat没有被其他地方引用,矩阵修改实际上就已经完成了(参见http://adv-r.had.co.nz/memory.html#modification-in-place)。这指向一个有趣和微妙的点。我在RStudio执行我的评估。 RStudio在Hadley的书中为每个不在函数内的对象创建了一个附加引用。因此,在函数的性能情况下,修改将发生在适当位置,在命令行中,它产生了一个变化时复制的效果。 Hadley的包pryr具有一些用于跟踪引用和内存地址的很好的函数。



解决方案

Fun with Rcpp:



您可以使用 Eigen's Map类以修改R对象。

 库(RcppEigen)
库(内联)

incl< - '
使用Eigen :: Map;
使用Eigen :: MatrixXd;
使用Eigen :: VectorXi;
typedef Map< MatrixXd> MapMatd;
typedef Map< VectorXi> MapVeci;
'

body< - '
MapMatd A(as< MapMatd>(AA));
const MapMatd B(as< MapMatd>(BB));
const MapVeci ix(as< MapVeci>(ind));
const int mB(B.cols());
for(int i = 0; i {
A.col(ix.coeff(i)-1)+ = B.col ;
}
'

funRcpp< - cxxfunction(签名(AA =matrix,BB =matrix,ind =integer),
,RcppEigen,incl)

set.seed(94253)
K < - 100
V < - 100000
mat2 < - mat< - 矩阵(runif(K * V),nrow = K,ncol = V)

Vsub toinsert& K * length(Vsub))nrow = K,ncol = length(Vsub)
mat [,Vsub] < - mat [,Vsub] + toinsert

invisible(funRcpp matb,matlab,matlab,matlab,matlab,matlab,matlab,matlab,matlab,matlab,matlab ]
#单位:微秒
#expr min lq median uq max neval








$ b mat [,Vsub] < - mat [,Vsub] + toinsert 49.273 49.628 50.3250 50.8075 20020.400 100
#funRcpp(mat2,toinsert,Vsub)6.450 6.805 7.6605 7.9215 25.914 100

我认为这基本上是@Joshua Ulrich提出的。他对关于破坏R的函数范式的警告适用。



我在C ++中添加了,但是将函数更改为只做赋值操作并不重要。



显然,如果你可以在Rcpp中实现你的整个循环,你可以避免在R级重复的函数调用,并获得性能。


TL;DR

What is the fastest method in R for reading and writing a subset of columns from a very large matrix. I attempt a solution with data.table but need a fast way to extract a sequence of columns?

Short Answer: The expensive part of the operation is assignment. Thus the solution is to stick with a matrix and use Rcpp and C++ to modify the matrix in place. There are two excellent answers below with examples.[for those applying to other problems be sure to read the disclaimers in the solutions!]. Scroll to the bottom of the question for some more lessons learned.


This is my first Stack Overflow question- I greatly appreciate your time in taking a look and I apologize if I've left anything out. I'm working on an R package where I have a performance bottleneck from subsetting and writing to portions of a matrix (NB for statisticians the application is updating sufficient statistics after processing each data point). The individual operations are incredibly fast but the sheer number of them requires it to be as fast as possible. The simplest version of the idea is a matrix of dimension K by V where K is generally between 5 and 1000 and V can be between 1000 and 1,000,000.

set.seed(94253)
K <- 100
V <- 100000
mat <-  matrix(runif(K*V),nrow=K,ncol=V)

we then end up performing a calculation on a subset of the columns and adding this into the full matrix. thus naively it looks like

Vsub <- sample(1:V, 20)
toinsert <- matrix(runif(K*length(Vsub)), nrow=K, ncol=length(Vsub))
mat[,Vsub] <- mat[,Vsub] + toinsert
library(microbenchmark)
microbenchmark(mat[,Vsub] <- mat[,Vsub] + toinsert)

because this is done so many times it can be quite slow as a result of R's copy-on-change semantics (but see the lessons learned below, modification can actually happen in place in some cricumstances).

For my problem the object need not be a matrix (and I'm sensitive to the difference as outlined here Assign a matrix to a subset of a data.table). I always want the full column and so the list structure of a data frame is fine. My solution was to use Matthew Dowle's awesome data.table package. The write can be done extraordinarily quickly using set(). Unfortunately getting the value is somewhat more complicated. We have to call the variables setting with=FALSE which dramatically slows things down.

library(data.table)
DT <- as.data.table(mat)  
set(DT, i=NULL, j=Vsub,DT[,Vsub,with=FALSE] + as.numeric(toinsert))

Within the set() function using i=NULL to reference all rows is incredibly fast but (presumably due to the way things are stored under the hood) there is no comparable option for j. @Roland notes in the comments that one option would be to convert to a triple representation (row number, col number, value) and use data.tables binary search to speed retrieval. I tested this manually and while it is quick, it does approximately triple the memory requirements for the matrix. I would like to avoid this if possible.

Following the question here: Time in getting single elemets from data.table and data.frame objects. Hadley Wickham gave an incredibly fast solution for a single index

Vone <- Vsub[1]
toinsert.one <- toinsert[,1]
set(DT, i=NULL, j=Vone,(.subset2(DT, Vone) + toinsert.one))

however since the .subset2(DT,i) is just DT[[i]] without the methods dispatch there is no way (to my knowledge) to grab several columns at once although it certainly seems like it should be possible. As in the previous question, it seems like since we can overwrite the values quickly we should be able to read them quickly.

Any suggestions? Also please let me know if there is a better solution than data.table for this problem. I realized its not really the intended use case in many respects but I'm trying to avoid porting the whole series of operations to C.

Here are a sequence of timings of elements discussed- the first two are all columns, the second two are just one column.

microbenchmark(mat[,Vsub] <- mat[,Vsub] + toinsert,
              set(DT, i=NULL, j=Vsub,DT[,Vsub,with=FALSE] + as.numeric(toinsert)),
              mat[,Vone] <- mat[,Vone] + toinsert.one,
              set(DT, i=NULL, j=Vone,(.subset2(DT, Vone) + toinsert.one)),
              times=1000L)

Unit: microseconds
                  expr      min       lq   median       uq       max neval
                Matrix   51.970   53.895   61.754   77.313   135.698  1000
            Data.Table 4751.982 4962.426 5087.376 5256.597 23710.826  1000
     Matrix Single Col    8.021    9.304   10.427   19.570 55303.659  1000
 Data.Table Single Col    6.737    7.700    9.304   11.549    89.824  1000


Answer and Lessons Learned:

Comments identified the most expensive part of the operation as the assignment process. Both solutions give answers based on C code which modify the matrix in place breaking R convention of not modifying the argument to a function but providing a much faster result.

Hadley Wickham stopped by in the comments to note that the matrix modification is actually done in place as long as the object mat is not referenced elsewhere (see http://adv-r.had.co.nz/memory.html#modification-in-place). This points to an interesting and subtle point. I was performing my evaluations in RStudio. RStudio as Hadley notes in his book creates an additional reference for every object not within a function. Thus while in the performance case of a function the modification would happen in place, at the command line it was producing a copy-on-change effect. Hadley's package pryr has some nice functions for tracking references and addresses of memory.

解决方案

Fun with Rcpp:

You can use Eigen's Map class to modify an R object in place.

library(RcppEigen)
library(inline)

incl <- '
using  Eigen::Map;
using  Eigen::MatrixXd;
using  Eigen::VectorXi;
typedef  Map<MatrixXd>  MapMatd;
typedef  Map<VectorXi>  MapVeci;
'

body <- '
MapMatd              A(as<MapMatd>(AA));
const MapMatd        B(as<MapMatd>(BB));
const MapVeci        ix(as<MapVeci>(ind));
const int            mB(B.cols());
for (int i = 0; i < mB; ++i) 
{
A.col(ix.coeff(i)-1) += B.col(i);
}
'

funRcpp <- cxxfunction(signature(AA = "matrix", BB ="matrix", ind = "integer"), 
                       body, "RcppEigen", incl)

set.seed(94253)
K <- 100
V <- 100000
mat2 <-  mat <-  matrix(runif(K*V),nrow=K,ncol=V)

Vsub <- sample(1:V, 20)
toinsert <- matrix(runif(K*length(Vsub)), nrow=K, ncol=length(Vsub))
mat[,Vsub] <- mat[,Vsub] + toinsert

invisible(funRcpp(mat2, toinsert, Vsub))
all.equal(mat, mat2)
#[1] TRUE

library(microbenchmark)
microbenchmark(mat[,Vsub] <- mat[,Vsub] + toinsert,
               funRcpp(mat2, toinsert, Vsub))
# Unit: microseconds
#                                  expr    min     lq  median      uq       max neval
# mat[, Vsub] <- mat[, Vsub] + toinsert 49.273 49.628 50.3250 50.8075 20020.400   100
#         funRcpp(mat2, toinsert, Vsub)  6.450  6.805  7.6605  7.9215    25.914   100

I think this is basically what @Joshua Ulrich proposed. His warnings regarding breaking R's functional paradigm apply.

I do the addition in C++, but it is trivial to change the function to only do assignment.

Obviously, if you can implement your whole loop in Rcpp, you avoid repeated function calls at the R level and will gain performance.

这篇关于如何优化读取和写入R中矩阵的子部分(可能使用data.table)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆