如何优化 R 中矩阵子部分的读写(可能使用 data.table) [英] How to optimize Read and Write to subsections of a matrix in R (possibly using data.table)

查看:8
本文介绍了如何优化 R 中矩阵子部分的读写(可能使用 data.table)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

TL;DR

<块引用>

在 R 中读取和写入子集的最快方法是什么来自一个非常大的矩阵的列.我尝试使用 data.table 解决方案但需要一种快速的方法来提取列序列?

简短回答:操作中代价高昂的部分是赋值.因此,解决方案是坚持使用矩阵并使用 Rcpp 和 C++ 来修改矩阵.下面有两个很好的示例答案.[对于那些适用于其他问题的人,请务必阅读解决方案中的免责声明!].滚动到问题的底部,了解更多经验教训.

<小时>

这是我的第一个 Stack Overflow 问题 - 非常感谢您抽出时间查看,如果我遗漏了任何内容,我深表歉意.我正在研究一个 R 包,我在子集和写入矩阵部分时遇到了性能瓶颈(对于统计学家来说,应用程序在处理每个数据点后更新足够的统计数据).单个操作的速度非常快,但它们的绝对数量要求它尽可能快.这个想法的最简单版本是一个维度为 K 乘以 V 的矩阵,其中 K 通常介于 5 和 1000 之间,而 V 可以介于 1000 和 1,000,000 之间.

set.seed(94253)K <- 100V <- 100000mat <- 矩阵(runif(K*V),nrow=K,ncol=V)

然后我们最终对列的子集执行计算并将其添加到完整矩阵中.因此天真地看起来像

Vsub <- sample(1:V, 20)toinsert <- 矩阵(runif(K*length(Vsub)), nrow=K, ncol=length(Vsub))mat[,Vsub] <- mat[,Vsub] + toinsert库(微基准)微基准(mat[,Vsub] <- mat[,Vsub] + toinsert)

因为这样做了很多次,由于 R 的更改时复制语义,它可能会非常慢(但请参阅下面的经验教训,修改实际上可以在某些情况下发生).

对于我的问题,对象不必是矩阵(我对此处概述的差异很敏感 将矩阵分配给 data.table 的子集).我总是想要完整的列,因此数据框的列表结构很好.我的解决方案是使用 Matthew Dowle 很棒的 data.table 包.使用 set() 可以非常快速地完成写入.不幸的是,获得价值有点复杂.我们必须使用=FALSE 调用变量设置,这会大大减慢速度.

库(data.table)DT <- as.data.table(mat)set(DT, i=NULL, j=Vsub,DT[,Vsub,with=FALSE] + as.numeric(toinsert))

在 set() 函数中,使用 i=NULL 来引用所有行的速度非常快,但是(可能是由于事物在后台存储的方式)j 没有可比的选项.@Roland 在评论中指出,一种选择是转换为三重表示(行号、列号、值)并使用 data.tables 二进制搜索来加快检索速度.我手动测试了它,虽然它很快,但它确实使矩阵的内存需求大约增加了三倍.如果可能的话,我想避免这种情况.

按照这里的问题:从 data.table 和 data.frame 对象获取单个元素的时间.Hadley Wickham 为单个索引提供了令人难以置信的快速解决方案

Vone <- Vsub[1]toinsert.one <- toinsert[,1]设置(DT,i = NULL,j = Vone,(.subset2(DT,Vone)+ toinsert.one))

但是,由于 .subset2(DT,i) 只是 DT[[i]] 没有方法分派,因此(据我所知)无法一次抓取几列,尽管它看起来确实应该是可能的.与上一个问题一样,似乎既然我们可以快速覆盖这些值,我们应该能够快速读取它们.

有什么建议吗?另外请让我知道是否有比 data.table 更好的解决方案来解决这个问题.我意识到它在很多方面并不是真正的预期用例,但我试图避免将整个系列的操作移植到 C 中.

这里是讨论的元素的时序序列 - 前两个都是列,后两个只是一列.

microbenchmark(mat[,Vsub] <- mat[,Vsub] + toinsert,set(DT, i=NULL, j=Vsub,DT[,Vsub,with=FALSE] + as.numeric(toinsert)),mat[,Vone] <- mat[,Vone] + toinsert.one,set(DT, i=NULL, j=Vone,(.subset2(DT, Vone) + toinsert.one)),次=1000L)单位:微秒expr min lq 中值 uq max neval矩阵 51.970 53.895 61.754 77.313 135.698 1000数据表 4751.982 4962.426 5087.376 5256.597 23710.826 1000矩阵单列 8.021 9.304 10.427 19.570 55303.659 1000Data.Table 单列 6.737 7.700 9.304 11.549 89.824 1000

<小时>

答案和经验教训:

<块引用>

评论将操作中最昂贵的部分确定为分配过程.两种解决方案都基于 C 代码给出答案,这些代码修改了矩阵,打破了 R 约定,即不修改函数的参数,但提供更快的结果.

Hadley Wickham 在评论中停下来注意到,只要对象垫没有在其他地方引用,矩阵修改实际上就完成了(参见 http://adv-r.had.co.nz/memory.html#modification-in-place).这指出了一个有趣而微妙的观点.我在 RStudio 中执行我的评估.正如 Hadley 在他的书中提到的,RStudio 为不在函数内的每个对象创建了一个额外的参考.因此,虽然在函数的性能情况下,修改会发生在适当的位置,但在命令行它会产生更改时复制的效果.Hadley 的包 pryr 有一些很好的函数来跟踪内存的引用和地址.

解决方案

Rcpp 的乐趣:

您可以使用 Eigen 的 Map 类 来修改 R 对象.p>

库(RcppEigen)库(内联)包括 <- '使用 Eigen::Map;使用 Eigen::MatrixXd;使用 Eigen::VectorXi;typedef 映射<MatrixXd>地图;typedef 映射<VectorXi>地图维西;'正文 <- 'MapMatd A(as<MapMatd>(AA));const MapMatd B(as<MapMatd>(BB));const MapVeci ix(as<MapVeci>(ind));常量 int mB(B.cols());对于 (int i = 0; i < mB; ++i){A.col(ix.coeff(i)-1) += B.col(i);}'funRcpp <- cxx函数(签名(AA =矩阵",BB =矩阵",ind =整数"),身体,RcppEigen",包括)set.seed(94253)K <- 100V <- 100000mat2 <- mat <- 矩阵(runif(K*V),nrow=K,ncol=V)Vsub <- 样本(1:V, 20)toinsert <- 矩阵(runif(K*length(Vsub)), nrow=K, ncol=length(Vsub))mat[,Vsub] <- mat[,Vsub] + toinsert不可见(funRcpp(mat2,toinsert,Vsub))all.equal(垫子,垫子2)#[1] 是的库(微基准)microbenchmark(mat[,Vsub] <- mat[,Vsub] + toinsert,funRcpp(mat2, toinsert, Vsub))# 单位:微秒# expr min lq 中值 uq max neval# mat[, Vsub] <- mat[, Vsub] + toinsert 49.273 49.628 50.3250 50.8075 20020.400 100# funRcpp(mat2, toinsert, Vsub) 6.450 6.805 7.6605 7.9215 25.914 100

我认为这基本上是@Joshua Ulrich 提出的.他关于打破 R 函数范式的警告适用.

我在 C++ 中进行加法,但将函数更改为仅进行赋值是微不足道的.

显然,如果您可以在 Rcpp 中实现整个循环,您可以避免在 R 级别重复调用函数,并且会提高性能.

TL;DR

What is the fastest method in R for reading and writing a subset of columns from a very large matrix. I attempt a solution with data.table but need a fast way to extract a sequence of columns?

Short Answer: The expensive part of the operation is assignment. Thus the solution is to stick with a matrix and use Rcpp and C++ to modify the matrix in place. There are two excellent answers below with examples.[for those applying to other problems be sure to read the disclaimers in the solutions!]. Scroll to the bottom of the question for some more lessons learned.


This is my first Stack Overflow question- I greatly appreciate your time in taking a look and I apologize if I've left anything out. I'm working on an R package where I have a performance bottleneck from subsetting and writing to portions of a matrix (NB for statisticians the application is updating sufficient statistics after processing each data point). The individual operations are incredibly fast but the sheer number of them requires it to be as fast as possible. The simplest version of the idea is a matrix of dimension K by V where K is generally between 5 and 1000 and V can be between 1000 and 1,000,000.

set.seed(94253)
K <- 100
V <- 100000
mat <-  matrix(runif(K*V),nrow=K,ncol=V)

we then end up performing a calculation on a subset of the columns and adding this into the full matrix. thus naively it looks like

Vsub <- sample(1:V, 20)
toinsert <- matrix(runif(K*length(Vsub)), nrow=K, ncol=length(Vsub))
mat[,Vsub] <- mat[,Vsub] + toinsert
library(microbenchmark)
microbenchmark(mat[,Vsub] <- mat[,Vsub] + toinsert)

because this is done so many times it can be quite slow as a result of R's copy-on-change semantics (but see the lessons learned below, modification can actually happen in place in some cricumstances).

For my problem the object need not be a matrix (and I'm sensitive to the difference as outlined here Assign a matrix to a subset of a data.table). I always want the full column and so the list structure of a data frame is fine. My solution was to use Matthew Dowle's awesome data.table package. The write can be done extraordinarily quickly using set(). Unfortunately getting the value is somewhat more complicated. We have to call the variables setting with=FALSE which dramatically slows things down.

library(data.table)
DT <- as.data.table(mat)  
set(DT, i=NULL, j=Vsub,DT[,Vsub,with=FALSE] + as.numeric(toinsert))

Within the set() function using i=NULL to reference all rows is incredibly fast but (presumably due to the way things are stored under the hood) there is no comparable option for j. @Roland notes in the comments that one option would be to convert to a triple representation (row number, col number, value) and use data.tables binary search to speed retrieval. I tested this manually and while it is quick, it does approximately triple the memory requirements for the matrix. I would like to avoid this if possible.

Following the question here: Time in getting single elemets from data.table and data.frame objects. Hadley Wickham gave an incredibly fast solution for a single index

Vone <- Vsub[1]
toinsert.one <- toinsert[,1]
set(DT, i=NULL, j=Vone,(.subset2(DT, Vone) + toinsert.one))

however since the .subset2(DT,i) is just DT[[i]] without the methods dispatch there is no way (to my knowledge) to grab several columns at once although it certainly seems like it should be possible. As in the previous question, it seems like since we can overwrite the values quickly we should be able to read them quickly.

Any suggestions? Also please let me know if there is a better solution than data.table for this problem. I realized its not really the intended use case in many respects but I'm trying to avoid porting the whole series of operations to C.

Here are a sequence of timings of elements discussed- the first two are all columns, the second two are just one column.

microbenchmark(mat[,Vsub] <- mat[,Vsub] + toinsert,
              set(DT, i=NULL, j=Vsub,DT[,Vsub,with=FALSE] + as.numeric(toinsert)),
              mat[,Vone] <- mat[,Vone] + toinsert.one,
              set(DT, i=NULL, j=Vone,(.subset2(DT, Vone) + toinsert.one)),
              times=1000L)

Unit: microseconds
                  expr      min       lq   median       uq       max neval
                Matrix   51.970   53.895   61.754   77.313   135.698  1000
            Data.Table 4751.982 4962.426 5087.376 5256.597 23710.826  1000
     Matrix Single Col    8.021    9.304   10.427   19.570 55303.659  1000
 Data.Table Single Col    6.737    7.700    9.304   11.549    89.824  1000


Answer and Lessons Learned:

Comments identified the most expensive part of the operation as the assignment process. Both solutions give answers based on C code which modify the matrix in place breaking R convention of not modifying the argument to a function but providing a much faster result.

Hadley Wickham stopped by in the comments to note that the matrix modification is actually done in place as long as the object mat is not referenced elsewhere (see http://adv-r.had.co.nz/memory.html#modification-in-place). This points to an interesting and subtle point. I was performing my evaluations in RStudio. RStudio as Hadley notes in his book creates an additional reference for every object not within a function. Thus while in the performance case of a function the modification would happen in place, at the command line it was producing a copy-on-change effect. Hadley's package pryr has some nice functions for tracking references and addresses of memory.

解决方案

Fun with Rcpp:

You can use Eigen's Map class to modify an R object in place.

library(RcppEigen)
library(inline)

incl <- '
using  Eigen::Map;
using  Eigen::MatrixXd;
using  Eigen::VectorXi;
typedef  Map<MatrixXd>  MapMatd;
typedef  Map<VectorXi>  MapVeci;
'

body <- '
MapMatd              A(as<MapMatd>(AA));
const MapMatd        B(as<MapMatd>(BB));
const MapVeci        ix(as<MapVeci>(ind));
const int            mB(B.cols());
for (int i = 0; i < mB; ++i) 
{
A.col(ix.coeff(i)-1) += B.col(i);
}
'

funRcpp <- cxxfunction(signature(AA = "matrix", BB ="matrix", ind = "integer"), 
                       body, "RcppEigen", incl)

set.seed(94253)
K <- 100
V <- 100000
mat2 <-  mat <-  matrix(runif(K*V),nrow=K,ncol=V)

Vsub <- sample(1:V, 20)
toinsert <- matrix(runif(K*length(Vsub)), nrow=K, ncol=length(Vsub))
mat[,Vsub] <- mat[,Vsub] + toinsert

invisible(funRcpp(mat2, toinsert, Vsub))
all.equal(mat, mat2)
#[1] TRUE

library(microbenchmark)
microbenchmark(mat[,Vsub] <- mat[,Vsub] + toinsert,
               funRcpp(mat2, toinsert, Vsub))
# Unit: microseconds
#                                  expr    min     lq  median      uq       max neval
# mat[, Vsub] <- mat[, Vsub] + toinsert 49.273 49.628 50.3250 50.8075 20020.400   100
#         funRcpp(mat2, toinsert, Vsub)  6.450  6.805  7.6605  7.9215    25.914   100

I think this is basically what @Joshua Ulrich proposed. His warnings regarding breaking R's functional paradigm apply.

I do the addition in C++, but it is trivial to change the function to only do assignment.

Obviously, if you can implement your whole loop in Rcpp, you avoid repeated function calls at the R level and will gain performance.

这篇关于如何优化 R 中矩阵子部分的读写(可能使用 data.table)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆