tm 包错误“无法将 DocumentTermMatrix 转换为普通矩阵,因为向量太大" [英] tm package error "Cannot convert DocumentTermMatrix into normal matrix since vector is too large"

查看:33
本文介绍了tm 包错误“无法将 DocumentTermMatrix 转换为普通矩阵,因为向量太大"的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我创建了一个 DocumentTermMatrix,其中包含 1859 个文档(行)和 25722 个(列).为了对该矩阵执行进一步的计算,我需要将其转换为常规矩阵.我想使用 as.matrix() 命令.但是,它返回以下错误:无法分配大小为 364.8 MB 的向量.

I have created a DocumentTermMatrix that contains 1859 documents (rows) and 25722 (columns). In order to perform further calculations on this matrix I need to convert it to a regular matrix. I want to use the as.matrix() command. However, it returns the following error: cannot allocate vector of size 364.8 MB.

> corp
A corpus with 1859 text documents
> mat<-DocumentTermMatrix(corp)
> dim(mat)
[1]  1859 25722
> is(mat)
[1] "DocumentTermMatrix"
> mat2<-as.matrix(mat)
Fehler: kann Vektor der Größe 364.8 MB nicht allozieren # cannot allocate vector of size 364.8 MB
> object.size(mat)
5502000 bytes

由于某种原因,当对象转换为常规矩阵时,对象的大小似乎会急剧增加.我怎样才能避免这种情况?

For some reason the size of the object seems to increase dramatically whenever it is transformed to a regular matrix. How can I avoid this?

或者是否有其他方法可以在 DocumentTermMatrix 上执行常规矩阵运算?

Or is there an alternative way to perform regular matrix operations on a DocumentTermMatrix?

推荐答案

快速而肮脏的方法是将数据从 Matrix 等外部包导出到稀疏矩阵对象中.

The quick and dirty way is to export your data into a sparse matrix object from an external package like Matrix.

> attributes(dtm)
$names
[1] "i"        "j"        "v"        "nrow"     "ncol"     "dimnames"

$class
[1] "DocumentTermMatrix"    "simple_triplet_matrix"

$Weighting
[1] "term frequency" "tf"            

dtm 对象具有 i、j 和 v 属性,它们是 DocumentTermMatrix 的内部表示.使用:

The dtm object has the i, j and v attributes which is the internal representation of your DocumentTermMatrix. Use:

library("Matrix") 
mat <- sparseMatrix(
           i=dtm$i,
           j=dtm$j, 
           x=dtm$v,
           dims=c(dtm$nrow, dtm$ncol)
           )

你就完成了.

对象之间的简单比较:

> mat[1,1:100]
> head(as.vector(dtm[1,]), 100)

每个都会给你完全相同的输出.

will each give you the exact same output.

这篇关于tm 包错误“无法将 DocumentTermMatrix 转换为普通矩阵,因为向量太大"的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆