将稀疏距离的CSV转换为距离对象R的有效方法 [英] Efficient Way to Convert CSV of Sparse Distances to Dist Object R

查看:99
本文介绍了将稀疏距离的CSV转换为距离对象R的有效方法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个非常大的csv文件(大约9100万行,因此for循环在R中花费的时间太长)关键字(大约50,000个唯一关键字)之间的相似性,当我读入data.frame时看起来像这样:

I have a very large csv file (about 91 million rows so a for loop takes too long in R) of similarities between keywords (about 50,000 unique keywords) that when I read into a data.frame looks like:

> df   
kwd1 kwd2 similarity  
a  b  1  
b  a  1  
c  a  2  
a  c  2 

这是一个稀疏列表,我可以使用sparseMatrix()将其转换为稀疏矩阵:

It is a sparse list and I can convert it into a sparse matrix using sparseMatrix():

> myMatrix 
  a b c  
a . 1 2
b 1 . .
c 2 . .

但是,现在我想将其转换为dist对象.我尝试了as.dist(myMatrix),但得到的错误是as.dist()的问题太大".我还尝试使用myMatrix = myMatrix * lower.tri(myMatrix)将稀疏矩阵转换为较低的三角形稀疏矩阵,然后转换为dist对象(认为这样做可能更好),但是随后我遇到了相同的错误,但对于较低的.tri函数.

However, now I would like to convert this into a dist object. I tried as.dist(myMatrix) but I was given the error that the 'problem was too large' for as.dist(). I also tried converting the sparse matrix to a lower triangular sparse matrix then to a dist object (thinking this might be better) using myMatrix = myMatrix * lower.tri(myMatrix), but I then had the same error but with regard to the lower.tri function.

感谢您的帮助!

推荐答案

"dist"的对象是密集对象.要从稀疏表示出发,将需要一个向量,

An object of class "dist" is a dense object. To go from the sparse representation will require a vector on the order of

R> 0.5*(91000000*90999999)
[1] 4.1405e+15

个元素(为对角线取或取).在R中,向量的最大长度为2 ^ 31-1:

elements (give or take for the diagonal). In R, the maximum length of a vector is 2^31 - 1:

R> 2^31 - 1
[1] 2147483647

小于存储密集的"dist"对象所需的元素数量,因此将不可能,这就是dist()错误的原因.出于类似的原因,您将无法将数据的下三角版本存储为密集对象,因为它也被存储为具有相同长度限制的向量.

which is way smaller than the number of elements you need to store the dense "dist" object so it won't be possible and that is the reason for the error from dist(). For similar reasons you won't be able to store the lower triangle version of the data as a dense object as it too is held as a vector with the same length limits.

在这一点上,我认为您需要更多地说明实际问题以及(在另一个问题中!)您要使用的差异对象是什么?您是否需要9100万个对象之间的所有差异,还是可以从中获得适合R向量的当前长度限制的样本?

At this point I think you'll need to explain more about the actual problem and what you want the dissimilarity object for (in another Question!)? Do you need all dissimilarities between the 91 million objects or could you get by with a sample from this that will fit into the current length limitations for R's vectors?

这篇关于将稀疏距离的CSV转换为距离对象R的有效方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆