TF-IDF文档术语矩阵和LDA:R中的错误消息 [英] tf-idf document term matrix and LDA: Error messages in R

查看:73
本文介绍了TF-IDF文档术语矩阵和LDA:R中的错误消息的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我们可以将tf-idf文档术语矩阵输入到潜在Dirichlet分配(LDA)中吗?如果是,怎么办?

Can we input tf-idf document term matrix into Latent Dirichlet Allocation (LDA)? if yes, how?

在我的情况下不起作用,并且LDA函数需要词频"文档词矩阵.

It does not work in my case and the LDA function requires the 'term-frequency' document term matrix.

谢谢

(我提出的问题尽可能简洁.因此,如果您需要更多详细信息,我可以添加

(I make a question as concise as possible. So, if you need more details, I can add

##########################################################################
                           TF-IDF Document matrix construction
##########################################################################    

> DTM_tfidf <-DocumentTermMatrix(corpora,control = list(weighting = 
function(x)+   weightTfIdf(x, normalize = FALSE)))
> str(DTM_tfidf)
List of 6
$ i       : int [1:4466] 1 1 1 1 1 1 1 1 1 1 ...
$ j       : int [1:4466] 6 10 22 26 28 36 39 41 47 48 ...
$ v       : num [1:4466] 6 2.09 1.05 3.19 2.19 ...
$ nrow    : int 64
$ ncol    : int 297
$ dimnames:List of 2
  ..$ Docs : chr [1:64] "1" "2" "3" "4" ...
  ..$ Terms: chr [1:297] "accommod" "account" "achiev" "act" ...
- attr(*, "class")= chr [1:2] "DocumentTermMatrix" "simple_triplet_matrix"
- attr(*, "weighting")= chr [1:2] "term frequency - inverse document 
frequency" "tf-idf"

##########################################################################
                           LDA section
##########################################################################

> LDA_results <-LDA(DTM_tfidf,k, method="Gibbs", control=list(nstart=nstart,
  +                                seed = seed, best=best, 
  +                                burnin = burnin, iter = iter, thin=thin))

##########################################################################
                           Error messages
##########################################################################
  Error in LDA(DTM_tfidf, k, method = "Gibbs", control = list(nstart = 
  nstart,  : 
  The DocumentTermMatrix needs to have a term frequency weighting

推荐答案

如果您使用 topicmodels 包探索 LDA 主题建模的文档,例如通过在 R 控制台中键入 ?LDA,您将会看到,该建模过程期望使用频率加权的文档项矩阵,而不是tf-idf加权.

If you explore the documentation for LDA topic modeling using the topicmodels package, for example by typing ?LDA in the R console, you'll see that this modeling procedure is expecting a frequency-weighted document-term matrix, not tf-idf-weighted.

"Object of class "DocumentTermMatrix" with term-frequency weighting or an object coercible..."

因此答案是否定的,您不能直接在此函数中使用tf-idf加权DTM.如果您已经拥有 tf-idf加权的DTM,则可以使用 tm :: weightTf()对其进行转换,以获取必要的权重.如果您要从头开始构建文档术语矩阵,请不要通过tf-idf对其加权.

So the answer is no, you cannot use a tf-idf-weighted DTM directly in this function. If you have a tf-idf-weighted DTM already, you can convert it using tm::weightTf() to get to the necessary weighting. If you are building a document-term matrix from scratch, then don't weight it by tf-idf.

这篇关于TF-IDF文档术语矩阵和LDA:R中的错误消息的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆