尝试从 DocumentTermMatrix 中删除单词以使用主题模型 [英] Trying to remove words from a DocumentTermMatrix in order to use topicmodels

查看:27
本文介绍了尝试从 DocumentTermMatrix 中删除单词以使用主题模型的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

因此,我正在尝试将 topicmodels 包用于 R(大约 6400 个文档的语料库中的 100 个主题,每个文档大约 1000 个字).该进程运行然后死亡,我认为是因为它的内存不足.

So, I am trying to use the topicmodels package for R (100 topics on a corpus of ~6400 documents, which are each ~1000 words). The process runs and then dies, I think because it is running out of memory.

所以我尝试缩小 lda() 函数作为输入的文档术语矩阵的大小;我想我可以在生成文档术语矩阵时使用 minDocFreq 函数来做到这一点.但是当我使用它时,它似乎没有任何区别.这是一些代码:

So I try to shrink the size of the document term matrix that the lda() function takes as input; I figure I can do that do using the minDocFreq function when I generate my document term matrices. But when I use it, it doesn't seem to make any difference. Here is some code:

这是相关的代码:

> corpus <- Corpus(DirSource('./chunks/'),fileEncoding='utf-8')
> dtm <- DocumentTermMatrix(corpus)
> dim(dtm)
[1] 6423 4163
# So, I assume this next command will make my document term matrix smaller, i.e.
# fewer columns. I've chosen a larger number, 100, to illustrate the point.
> smaller <- DocumentTermMatrix(corpus, control=list(minDocFreq=100))
> dim(smaller)
[1]  6423 41613

相同的维度和相同的列数(即相同的项数).

Same dimensions, and same number of columns (that is, same number of terms).

知道我做错了什么吗?谢谢.

Any sense what I'm doing wrong? Thanks.

推荐答案

你的问题的答案在这里:https://stackoverflow.com/a/13370840/1036500(给它一个赞!)

The answer to your question is over here: https://stackoverflow.com/a/13370840/1036500 (give it an upvote!)

简而言之,tm 包的更新版本不包含 minDocFreq 而是使用 bounds,例如,您的

In brief, more recent versions of the tm package do not include minDocFreq but instead use bounds, for example, your

smaller <- DocumentTermMatrix(corpus, control=list(minDocFreq=100))

现在应该是

require(tm)
data("crude")

smaller <- DocumentTermMatrix(crude, control=list(bounds = list(global = c(5,Inf))))
dim(smaller) # after Terms that appear in <5 documents are discarded
[1] 20 67
smaller <- DocumentTermMatrix(crude, control=list(bounds = list(global = c(10,Inf))))
dim(smaller) # after Terms that appear in <10 documents are discarded
[1] 20 17

这篇关于尝试从 DocumentTermMatrix 中删除单词以使用主题模型的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆