预测新数据的LDA主题 [英] Predicting LDA topics for new data

查看:214
本文介绍了预测新数据的LDA主题的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

似乎这个问题可能已经被问过几次了(和此处),但尚未得到答复.我希望这是由于之前所提问题的含糊之处,如评论所示.对于再次提出类似的问题而违反协议的规定,我深感抱歉,我只是以为这些问题不会有新的答案.

It looks like this question has may have been asked a few times before (here and here), but it has yet to be answered. I'm hoping this is due to the previous ambiguity of the question(s) asked, as indicated by comments. I apologize if I am breaking protocol by asking a simliar question again, I just assumed that those questions would not be seeing any new answers.

无论如何,我是Latent Dirichlet Allocation的新手,正在探索将其用作文本数据降维的一种方法.最终,我想从一大袋单词中提取出较少的主题集,并使用这些主题作为模型中的一些变量来构建分类模型.我已经成功地在训练集上运行了LDA,但是我遇到的问题是能够预测那些相同的主题中的哪些出现在其他测试数据集中.我现在正在使用R的topicmodels程序包,但是如果还有其他方法可以使用其他程序包,我也可以使用.

Anyway, I am new to Latent Dirichlet Allocation and am exploring its use as a means of dimension reduction for textual data. Ultimately I would like extract a smaller set of topics from a very large bag of words and build a classification model using those topics as a few variables in the model. I've had success in running LDA on a training set, but the problem I am having is being able to predict which of those same topics appear in some other test set of data. I am using R's topicmodels package right now, but if there is another way to this using some other package I am open to that as well.

以下是我要执行的操作的示例:

Here is an example of what I am trying to do:

library(topicmodels)
data(AssociatedPress)

train <- AssociatedPress[1:100]
test <- AssociatedPress[101:150]

train.lda <- LDA(train,5)
topics(train.lda)

#how can I predict the most likely topic(s) from "train.lda" for each document in "test"?

推荐答案

借助Ben出色的文档阅读技能,我相信使用posterior()函数是可行的.

With the help of Ben's superior document reading skills, I believe this is possible using the posterior() function.

library(topicmodels)
data(AssociatedPress)

train <- AssociatedPress[1:100]
test <- AssociatedPress[101:150]

train.lda <- LDA(train,5)
(train.topics <- topics(train.lda))
#  [1] 4 5 5 1 2 3 1 2 1 2 1 3 2 3 3 2 2 5 3 4 5 3 1 2 3 1 4 4 2 5 3 2 4 5 1 5 4 3 1 3 4 3 2 1 4 2 4 3 1 2 4 3 1 1 4 4 5
# [58] 3 5 3 3 5 3 2 3 4 4 3 4 5 1 2 3 4 3 5 5 3 1 2 5 5 3 1 4 2 3 1 3 2 5 4 5 5 1 1 1 4 4 3

test.topics <- posterior(train.lda,test)
(test.topics <- apply(test.topics$topics, 1, which.max))
#  [1] 3 5 5 5 2 4 5 4 2 2 3 1 3 3 2 4 3 1 5 3 5 3 1 2 2 3 4 1 2 2 4 4 3 3 5 5 5 2 2 5 2 3 2 3 3 5 5 1 2 2

这篇关于预测新数据的LDA主题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆