大文档术语矩阵-计算文档的字符数时出错 [英] big document term matrix - error when counting the number of characters of documents

查看:177
本文介绍了大文档术语矩阵-计算文档的字符数时出错的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已经使用包RTextTools构建了一个大型文档术语矩阵.

I have built a big document-term matrix with the package RTextTools.

现在,我正在尝试计算矩阵行中的字符数,以便在执行主题建模之前可以删除空文档.

Now I am trying to count the number of characters in the matrix rows so that I can remove empty documents before performing topic modeling.

当我将其应用于我的语料库样本时,得到一个较小的矩阵,我的代码没有错误,但是当我尝试计算整个语料库产生的矩阵中文档的行长时(〜75000条推文),我收到以下错误消息:

My code gives no errors when I apply it to a sample of my corpus, obtaining a smaller matrix, but when I try to count the row length of the documents in the matrix produced from my entire corpus (~75000 tweets) I get the following error message:

Error in vector(typeof(x$v), nr * nc) : 
  the dimension of the vector no cannot be NA
And: Warning message:
In nr * nc : NA produced by integer overflow

这是我的代码:

matrix <- create_matrix(data$clean_text, language="french", stemWords=TRUE, removeStopwords=TRUE, removeNumbers=TRUE, stripWhitespace=TRUE, toLower=TRUE, removePunctuation=TRUE, minWordLength=3)

rowTotals <- apply(matrix, 1, sum)

如果我尝试使用25000个文档的矩阵,则会出现以下错误:

If I try with a matrix of 25000 documents I get the following error:

message: rowTotals <- apply(matrix, 1, sum) 
Errore: cannot allocate vector of size 7.1 Gb

推荐答案

如果将数据保存在dtm中,则可能可以解决此问题,该方法使用的稀疏矩阵表示比常规矩阵具有更高的内存效率.

You might be able to work around this if you keep your data in the dtm, which uses a sparse matrix representation that is much more memory efficient than a regular matrix.

apply函数给出错误的原因是因为它将稀疏矩阵转换为规则矩阵(Q中的matrix对象-顺便说一句,给数据对象名称也是函数名称的样式很差,尤其是基本功能).这意味着R必须为dtm中的所有零分配内存(通常大部分为零,因此其中有很多内存为零).对于稀疏矩阵,R不需要存储任何零.

The reason why the apply function gives an error is because it converts the sparse matrix into a regular matrix (the matrix object in your Q - btw it's poor style to give data objects names that are also names of functions, especially base functions). This means that R has to allocate memory for all the zeros in the dtm (which are typically mostly zeros, so that's a lot of memory with zeros in it). With a sparse matrix R doesn't need to store any of the zeros.

这是apply的源代码的前几行,请参阅此处的最后一行,以转换为常规矩阵:

Here's the first few lines of the source for apply, see the last line here for the conversion to regular matrix:

apply
function (X, MARGIN, FUN, ...) 
{
    FUN <- match.fun(FUN)
    dl <- length(dim(X))
    if (!dl) 
        stop("dim(X) must have a positive length")
    if (is.object(X)) 
        X <- if (dl == 2L) 
            as.matrix(X) # this is where your memory gets filled with zeros

那么如何避免这种转换呢?这是一种在保持稀疏矩阵格式的同时,在行上循环以获取总和的方法:

So how to avoid that conversion? Here's one way to loop over the rows to get their sums while keeping the sparse matrix format:

sapply(seq(nrow(matrix)), function(i) sum(matrix[i,]))
[1] 2 1 2 2 1

以这种方式设置子集将保留稀疏格式,并且不会将对象转换为内存成本更高的通用矩阵表示形式.我们可以检查表示形式:

Subsetting this way preserves the sparse format and does not convert the object to the more memory expensive common matrix representation. We can check the representation:

str(matrix[1,])
List of 6
 $ i       : int [1:2] 1 1
 $ j       : int [1:2] 1 3
 $ v       : num [1:2] 1 1
 $ nrow    : int 1
 $ ncol    : int 6
 $ dimnames:List of 2
  ..$ Docs : chr "1"
  ..$ Terms: chr [1:6] "document" "file" "first" "second" ...
 - attr(*, "class")= chr [1:2] "DocumentTermMatrix" "simple_triplet_matrix"
 - attr(*, "weighting")= chr [1:2] "term frequency" "tf"

因此,在sapply函数中,我们始终在处理稀疏矩阵.即使sum(或您在此处使用的任何函数)进行某种转换,也只会转换dtm的一行,而不是整个行.

So in the sapply function we are always working on a sparse matrix. And even if sum (or whatever function you use there) does some kind of conversion, it's only going to be converting one row of the dtm, rather than the entire thing.

在R中使用较大的文本数据时,一般原则是将dtm保持为稀疏矩阵,然后应该能够保持在内存限制内.

The general principle when working with largish text data in R is to keep your dtm as a sparse matrix and then you should be able to keep within memory limits.

这篇关于大文档术语矩阵-计算文档的字符数时出错的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆