尝试从 DocumentTermMatrix 中删除单词以使用主题模型 [英] Trying to remove words from a DocumentTermMatrix in order to use topicmodels
问题描述
因此,我正在尝试将 topicmodels
包用于 R
(大约 6400 个文档的语料库中的 100 个主题,每个文档大约 1000 个字).该进程运行然后死亡,我认为是因为它的内存不足.
So, I am trying to use the topicmodels
package for R
(100 topics on a corpus of ~6400 documents, which are each ~1000 words). The process runs and then dies, I think because it is running out of memory.
所以我尝试缩小 lda()
函数作为输入的文档术语矩阵的大小;我想我可以在生成文档术语矩阵时使用 minDocFreq
函数来做到这一点.但是当我使用它时,它似乎没有任何区别.这是一些代码:
So I try to shrink the size of the document term matrix that the lda()
function takes as input; I figure I can do that do using the minDocFreq
function when I generate my document term matrices. But when I use it, it doesn't seem to make any difference. Here is some code:
这是相关的代码:
> corpus <- Corpus(DirSource('./chunks/'),fileEncoding='utf-8')
> dtm <- DocumentTermMatrix(corpus)
> dim(dtm)
[1] 6423 4163
# So, I assume this next command will make my document term matrix smaller, i.e.
# fewer columns. I've chosen a larger number, 100, to illustrate the point.
> smaller <- DocumentTermMatrix(corpus, control=list(minDocFreq=100))
> dim(smaller)
[1] 6423 41613
相同的维度和相同的列数(即相同的项数).
Same dimensions, and same number of columns (that is, same number of terms).
知道我做错了什么吗?谢谢.
Any sense what I'm doing wrong? Thanks.
推荐答案
你的问题的答案在这里:https://stackoverflow.com/a/13370840/1036500(给它一个赞!)
The answer to your question is over here: https://stackoverflow.com/a/13370840/1036500 (give it an upvote!)
简而言之,tm
包的更新版本不包含 minDocFreq
而是使用 bounds
,例如,您的
In brief, more recent versions of the tm
package do not include minDocFreq
but instead use bounds
, for example, your
smaller <- DocumentTermMatrix(corpus, control=list(minDocFreq=100))
现在应该是
require(tm)
data("crude")
smaller <- DocumentTermMatrix(crude, control=list(bounds = list(global = c(5,Inf))))
dim(smaller) # after Terms that appear in <5 documents are discarded
[1] 20 67
smaller <- DocumentTermMatrix(crude, control=list(bounds = list(global = c(10,Inf))))
dim(smaller) # after Terms that appear in <10 documents are discarded
[1] 20 17
这篇关于尝试从 DocumentTermMatrix 中删除单词以使用主题模型的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!