来自R lda包的lda.collapsed.gibbs.sampler命令的输出 [英] Output of lda.collapsed.gibbs.sampler command from R lda package

查看:189
本文介绍了来自R lda包的lda.collapsed.gibbs.sampler命令的输出的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我不理解lda.collapsed.gibbs.sampler命令的输出内容.我不明白的是为什么同一个单词在不同主题中的编号不同?例如,为什么在主题8中有37个主题时,第二个主题中有4个单词用于测试".不同主题中相同单词的数量不应该是相同的整数还是0?

I don't understand this part of output from lda.collapsed.gibbs.sampler command. What I don't understand is why the numbers of the same word in different topics are different? For example, why for the word "test" there is 4 of them in second topics when topic 8 get 37 of them. Shouldn't number of same word in different topic be the same integer or 0?

还是我误解了一些,这些数字不代表主题中的单词数?

Or Do I misunderstood something and these numbers don't stand for number of word in the topic?

$topics
      tests-loc fail  test testmultisendcookieget
 [1,]         0    0     0                      0
 [2,]         0    0     4                      0
 [3,]         0    0     0                      0
 [4,]         0    1     0                      0
 [5,]         0    0     0                      0
 [6,]         0    0     0                      0
 [7,]         0    0     0                      0
 [8,]         0    0    37                      0
 [9,]         0    0     0                      0
[10,]         0    0     0                      0
[11,]         0    0     0                      0
[12,]         0    2     0                      0
[13,]         0    0     0                      0
[14,]         0    0     0                      0
[15,]         0    0     0                      0
[16,]         0    0     0                      0
[17,]         0    0     0                      0
[18,]         0    0     0                      0
[19,]         0    0     0                      0
[20,]         0    0     0                      0
[21,]         0    0     0                      0
[22,]         0  361  1000                      0
[23,]         0    0     0                      0
[24,]         0    0     0                      0
[25,]         0    0     0                      0
[26,]         0    0     0                      0
[27,]         0    0     0                      0
[28,]         0 1904 12617                      0
[29,]         0    0     0                      0
[30,]         0    0     0                      0
[31,]         0    0     0                      0
[32,]         0 1255  3158                      0
[33,]         0    0     0                      0
[34,]         0    0     0                      0
[35,]         0    0     0                      0
[36,]         1    0     0                      1
[37,]         0    1     0                      0
[38,]         0    0     0                      0
[39,]         0    0     0                      0
[40,]         0    0     0                      0
[41,]         0    0     0                      0
[42,]         0    0     0                      0
[43,]         0    0     0                      0
[44,]         0    0     0                      0
[45,]         0    2     0                      0
[46,]         0    0     0                      0
[47,]         0    0     0                      0
[48,]         0    0     4                      0
[49,]         0    0     0                      0
[50,]         0    1     0                      0

这是我运行的代码.

library(lda)
data=read.documents(filename = "data.ldac")
vocab=read.vocab(filename = "words.csv")

K=100
num.iterations=100
alpha=1
eta=1


result = lda.collapsed.gibbs.sampler(data, K,vocab, num.iterations, alpha,eta, initial = NULL, burnin = NULL, compute.log.likelihood = FALSE,trace = 0L, freeze.topics = FALSE)

options(max.print=100000000) 
result

PS.抱歉,我的帖子太长了,我的英语不好.

PS. Sorry for the long post and my bad english.

推荐答案

LDA中的主题分布就是:多项式分布.这些与您上面的矩阵行相对应.对于任何主题,在任何给定主题中看到单词的概率均不被限制为固定值(或零).也就是说,测试"一词在一个主题中发生的可能性为3%,在另一主题中发生的可能性为1%.

The topic distributions in LDA are just that: multinomial distributions. These correspond to the rows of the matrix you have above. The probability of seeing a word in any given topic is not constrained to be a fixed value (or zero) for any of the topics. That is, the word 'test' can have a 3% chance of occurring in one topic, a 1% chance of occurring in another.

n.b.如果要将矩阵转换为概率,只需对行进行归一化并添加先前的平滑常数即可.这里的函数仅返回上一次Gibbs采样扫描中的原始分配数量.

n.b. If you want to convert the matrix to probabilities just row normalize and add the smoothing constant from your prior. The function here just returns the raw number of assignments in the last Gibbs sampling sweep.

这篇关于来自R lda包的lda.collapsed.gibbs.sampler命令的输出的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆