从R topicmodels中的DocumentTermMatrix中删除空文档? [英] Remove empty documents from DocumentTermMatrix in R topicmodels?

查看:458
本文介绍了从R topicmodels中的DocumentTermMatrix中删除空文档?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用R中的topicmodels包进行主题建模.我正在创建一个Corpus对象,进行一些基本的预处理,然后创建一个DocumentTermMatrix:

I am doing topic modelling using the topicmodels package in R. I am creating a Corpus object, doing some basic preprocessing, and then creating a DocumentTermMatrix:

corpus <- Corpus(VectorSource(vec), readerControl=list(language="en")) 
corpus <- tm_map(corpus, tolower)
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeWords, stopwords("english"))
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, removeNumbers)
...snip removing several custom lists of stopwords...
corpus <- tm_map(corpus, stemDocument)
dtm <- DocumentTermMatrix(corpus, control=list(minDocFreq=2, minWordLength=2))

然后执行LDA:

LDA(dtm, 30)

对LDA()的最终调用返回错误

This final call to LDA() returns the error

  "Each row of the input matrix needs to contain at least one non-zero entry". 

我认为这意味着在预处理之后,至少有一个文档中没有任何术语.有没有一种简便的方法可以从DocumentTermMatrix中删除不包含任何术语的文档?

I assume this means that there is at least one document that has no terms in it after preprocessing. Is there an easy way to remove documents that contain no terms from a DocumentTermMatrix?

我查看了有关topicmodels包的文档,发现功能removeSparseTerms,该功能删除了任何文档中都没有出现的术语,但是没有类似的方法可以删除文档.

I looked in the documentation for the topicmodels package and found the function removeSparseTerms, which removes terms that do not appear in any document, but there is no analogue for removing documents.

推荐答案

"Each row of the input matrix needs to contain at least one non-zero entry"

该错误表示稀疏矩阵包含没有条目(单词)的行.一种想法是按行计算单词的总和

The error means that sparse matrix contain a row without entries(words). one Idea is to compute the sum of words by row

rowTotals <- apply(dtm , 1, sum) #Find the sum of words in each Document
dtm.new   <- dtm[rowTotals> 0, ]           #remove all docs without words

这篇关于从R topicmodels中的DocumentTermMatrix中删除空文档?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆