将文档术语矩阵转换为包含大量数据的矩阵会导致溢出 [英] Converting a Document Term Matrix into a Matrix with lots of data causes overflow

查看:121
本文介绍了将文档术语矩阵转换为包含大量数据的矩阵会导致溢出的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

让我们进行一些文本挖掘

Let's do some Text Mining

在这里,我站在文档术语矩阵的基础上(来自tm软件包)

Here I stand with a document term matrix (from the tm Package)

dtm <- TermDocumentMatrix(
     myCorpus,
     control = list(
         weight = weightTfIdf,
         tolower=TRUE,
         removeNumbers = TRUE,
         minWordLength = 2,
         removePunctuation = TRUE,
         stopwords=stopwords("german")
      ))

当我做

typeof(dtm)

我看到它是一个列表",结构看起来像

I see that it is a "list" and the structure looks like

Docs
Terms        1 2 ...
  lorem      0 0 ...
  ipsum      0 0 ...
  ...        .......

所以我尝试

wordMatrix = as.data.frame( t(as.matrix(  dtm )) ) 

适用于1000个文档.

That works for 1000 Documents.

但是当我尝试使用40000时,它不再可用.

But when I try to use 40000 it doesn't anymore.

我收到此错误:

Fehler in vector(typeof(x$v), nr * nc) : Vektorgröße kann nicht NA sein
Zusätzlich: Warnmeldung:
In nr * nc : NAs durch Ganzzahlüberlauf erzeugt

向量中的错误...:向量不能为NA 额外的: 在整数溢出创建的nr * nc NA中

Error in vector ... : Vector can't be NA Additional: In nr * nc NAs created by integer overflow

因此,我查看了as.matrix,结果发现该函数以某种方式将其转换为具有as.vector的向量,而不是矩阵. 向向量的转换有效,但从向量到矩阵的转换无效.

So I looked at as.matrix and it turns out that somehow the function converts it to a vector with as.vector and than to a matrix. The convertion to a vector works but not the one from the vector to the matrix dosen't.

您有什么建议可能是什么问题?

Do you have any suggestions what could be the problem?

谢谢队长

推荐答案

整数溢出会准确地告诉您问题出在哪里:使用40000个文档,您有太多数据.问题是在转换为矩阵的过程中开始的,如果您查看基础函数的代码,则可以看出:

Integer overflow tells you exactly what the problem is : with 40000 documents, you have too much data. It is in the conversion to a matrix that the problem begins btw, which can be seen if you look at the code of the underlying function :

class(dtm)
[1] "TermDocumentMatrix"    "simple_triplet_matrix"

getAnywhere(as.matrix.simple_triplet_matrix)

A single object matching ‘as.matrix.simple_triplet_matrix’ was found
...
function (x, ...) 
{
    nr <- x$nrow
    nc <- x$ncol
    y <- matrix(vector(typeof(x$v), nr * nc), nr, nc)
   ...
}

这是错误消息引用的行.发生了什么,可以通过以下方式轻松模拟:

This is the line referenced by the error message. What's going on, can be easily simulated by :

as.integer(40000 * 60000) # 40000 documents is 40000 rows in the resulting frame
[1] NA
Warning message:
NAs introduced by coercion 

函数vector()接受具有长度的参数,在这种情况下为nr*nc,如果它大于appx. 2e9(.Machine$integer.max),它将被NA代替.此NA不能作为vector()的参数.

The function vector() takes an argument with the length, in this case nr*nc If this is larger than appx. 2e9 ( .Machine$integer.max ), it will be replaced by NA. This NA is not valid as an argument for vector().

底线:您正在遇到R的限制.就目前而言,以64位工作对您无济于事.您将不得不采用不同的方法.一种可能是继续使用您拥有的列表(dtm是列表),使用列表操作选择所需的数据,然后从那里开始.

Bottomline : You're running into the limits of R. As for now, working in 64bit won't help you. You'll have to resort to different methods. One possibility would be to continue working with the list you have (dtm is a list), selecting the data you need using list manipulation and go from there.

PS:我通过以下方式创建了dtm对象

PS : I made a dtm object by

require(tm)
data("crude")
dtm <- TermDocumentMatrix(crude,
                          control = list(weighting = weightTfIdf,
                                         stopwords = TRUE))

这篇关于将文档术语矩阵转换为包含大量数据的矩阵会导致溢出的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆