大矩阵和内存问题 [英] Big matrix and memory problems

查看：320 发布时间：2020/5/7 19:19:15 r matrix r-bigmemory bigdata

本文介绍了大矩阵和内存问题的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在处理一个巨大的数据集，我想得出一个测试统计量的分布.因此，我需要使用巨大的矩阵(200000x200000)进行计算，并且您可能会预测到我遇到了内存问题.更准确地说，我得到以下信息:错误:无法分配大小为... Gb的向量.我使用R的64位版本，而我的RAM是8Gb.我尝试使用bigmemory软件包，但收效不大.

I am working on a huge dataset and I would like to derive the distribution of a test statistic. Hence I need to do calculations with huge matrices (200000x200000) and as you might predict I have memory issues. More precisely I get the following: Error: cannot allocate vector of size ... Gb. I work on the 64-bit version of R and my RAM is 8Gb. I tried to use the package bigmemory but with not big success.

第一个问题是我必须计算距离矩阵时.我在名为Dist的amap程序包中发现了这个不错的函数，该函数可计算并行数据帧的列的距离，并且效果很好，但是会产生较低/较高的三角形.我需要距离矩阵来执行矩阵乘法，但是不幸的是，我无法使用一半的矩阵.当使用as.matrix函数使其变满时，我再次遇到内存问题.

The first issue comes when I have to calculate the distance matrix. I found this nice function in amap package called Dist that calculates the distance of a columns of a dataframe on parallel and it works well, however it produces the lower/upper triangular. I need the distance matrix to perform matrix multiplications and unfortunately I cannot with half of the matrix. When use the as.matrix function to make it full, I have again memory issues.

所以我的问题是如何通过跳过as.matrix步骤将dist对象转换为big.matrix.我想这可能是一个Rccp问题，请记住我真的是Rccp的新手.

So my question is how can I convert a dist object to a big.matrix by skipping the as.matrix step. I suppose that it might be an Rccp question, please have in mind that I am really new at Rccp.

先谢谢您！

推荐答案

在将"dist"对象转换为(大)矩阵"时: stats:::as.matrix.dist具有对row，col，t的调用以及创建大型中间对象的运算符.避免这些，您可以使用其他替代方法，例如:

On converting a "dist" object to "(big.)matrix": stats:::as.matrix.dist has calls to row, col, t and operators that create large intermediate objects. Avoiding these you could, among other alternatives, use something like:

有数据:

nr = 1e4
m = matrix(runif(nr), nr, 10)
d = dist(m)

然后，慢慢地分配并填充一个矩阵":

Then, slowly, allocate and fill a "matrix":

#as.matrix(d) #this gives error on my machine
n = attr(d, "Size")
md = matrix(0, n, n) 
id = cumsum(c(1L, (n - 1L) - 0:(n - 2L))) #to split "d"
for(j in 1:(n - 1L)) {
    i = (j + 1L):n
    md[i, j] = md[j, i] = d[id[j]:(id[j] + (n - (j + 1L)))]
}

(似乎将"md"分配为big.matrix(n, n, init = 0)同样有效)

(It seems that with allocating "md" as big.matrix(n, n, init = 0) equally works)

md[2:5, 1]
#[1] 2.64625973 2.01071637 0.09207748 0.09346157
d[1:4]
#[1] 2.64625973 2.01071637 0.09207748 0.09346157

使用较小的"nr"我们可以测试:

Using smaller "nr" we could test:

all.equal(as.matrix(md), as.matrix(d), check.attributes = FALSE)
#[1] TRUE

这篇关于大矩阵和内存问题的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

大矩阵和内存问题 [英] Big matrix and memory problems

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

大矩阵和内存问题 [英] Big matrix and memory problems

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭