如何使用主题模型(LDA)输出来匹配和检索新的相同主题的文档 [英] How to use Topic Model (LDA) output to match and retrieve new, same-topic documents

查看:811
本文介绍了如何使用主题模型(LDA)输出来匹配和检索新的相同主题的文档的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在语料库上使用LDA模型来学习其中涉及的主题.我正在使用gensim包(例如gensim.models.ldamodel.LdaModel);必要时可以轻松使用其他版本的LDA.

I am using a LDA model on a corpus to learn the topics covered in it. I am using the gensim package (e.g., gensim.models.ldamodel.LdaModel); can easily use other versions of LDA if necessary.

我的问题是使用参数化模型和/或主题词或主题ID查找和检索包含主题的新文档的最有效方法是什么?

My question is what is the most efficient way to use the parameterized model and/or topic words or topic IDs to find and retrieve new documents that contain the topic?

具体来说,我想抓取一个媒体API,以查找与原始语料库中包含的主题相关的新文章(样本外文档).因为我正在执行盲目搜索",所以在每个新文档上运行LDA可能会很麻烦;大多数新文档都不会包含该主题.

Concretely, I want to scrape a media API to find new articles (out-of-sample documents) that relate to my topics contained in my original corpus. Because I am doing this 'blind search', running the LDA on each new document may be too cumbersome; most new documents will not contain the topic.

当然可以简单地检索包含LDA学习的主题的大部分常用单词的新文档;然后将LDA应用于返回的文档以进一步增强信心.

Can of course simply retrieve new documents that contain one to n most of the frequent words of the LDA-learned topics; and then apply LDA to the returned documents for further confidence.

我想知道是否有一种更复杂的方法可以使人们更加确信新的样本外文章实际上包含相同的主题;而不是巧合地包含一个或两个主题词.

I am wondering if there is a more sophisticated method that gives better confidence that the new out-of-sample articles actually contain the same topic; as opposed to coincidentally containing one or two of the topic words.

我正在查看主题切片算法,但不确定它们是否适用于此.

Am looking at Topic Tiling algorithms but not sure if they are applicable here.

推荐答案

我认为您不改变主题空间中的所有内容就不能在主题空间中搜索.有人可能会争论创建在主题空间中返回相似度而不在主题空间中进行转换(例如使用神经网络)的函数,但我认为这超出了问题的范围.

I do not think you can search in the topic space without transforming everything in the topic space. One could argue about creating functions that return the similarity in the topic space without transforming in the topic space (for instance with neural networks) but I think it is beyond the scope of the question.

现在,由于上述方法并没有真正的帮助,因此人们可以想到很多方法可以比简单的关键字存在更好地生成候选对象,我将编写其中的一些方法.

Now since the above is not really helpful there are a lot of methods one can think of that will generate candidates better than simple keyword existence and I will write a couple of them.

主题只是单词上的分布,因此您可以将它们用作文档,并计算它们与测试文档之间的余弦相似度,以获取文档中主题概率的估计值.

The topics are simply distributions over the words so you could use them as documents and compute the cosine similarity between them and a test document to get an estimate for the topic's probability in the document.

您可以使用训练集中每个主题的k个文档作为示例,并计算这些文档与测试文档的相似度,以获得对该文档中该主题概率的估计.

You could use k documents from the training set for each topic as examples and compute the similarity of those documents with a test document to get an estimate for the topic's probability in the document.

例如,使用上述两种技术,您还可以使用位置敏感哈希 simhash ,以便更有效地从大型语料库中生成候选.

With both of the above techniques you could also use locality sensitive hashing, for instance simhash, to more efficiently generate candidates from large corpora.

为了使我的观点更清楚,我将使用以下管道(在伪python中)

To make my last point clearer, I would use the following pipeline (in pseudo python)

# t is a topic
ht = simhash(t) # few bits here
candidates = []
final_texts = []
for text in new_texts:
    if simhash(text) == ht:
        candidates.append(text)
for text in candidates:
    topic_distribution = lda.infer(text)
    if argmax(topic_distribution) == t:
        final_textx.append(text)

这篇关于如何使用主题模型(LDA)输出来匹配和检索新的相同主题的文档的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆