TopicModel:如何通过主题模型"topic"查询文档? [英] TopicModel: How to query documents by topic model "topic"?

查看:95
本文介绍了TopicModel:如何通过主题模型"topic"查询文档?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

下面,我创建了一个完整的可复制示例,以计算给定DataFrame的主题模型.

Below I created a full reproducible example to compute the topic model for a given DataFrame.

import numpy as np  
import pandas as pd

data = pd.DataFrame({'Body': ['Here goes one example sentence that is generic',
                  'My car drives really fast and I have no brakes',
                  'Your car is slow and needs no brakes', 
                  'Your and my vehicle are both not as fast as the airplane']})

from sklearn.decomposition import LatentDirichletAllocation
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(lowercase = True, analyzer = 'word')

data_vectorized = vectorizer.fit_transform(data.Body)
lda_model = LatentDirichletAllocation(n_components=4, 
                                      learning_method='online', 
                                      random_state=0,
                                      verbose=1)
lda_topic_matrix = lda_model.fit_transform(data_vectorized)

问题:如何按主题过滤文档?如果是这样,文档是否可以具有多个主题标签,或者需要一个阈值?

Question: How is it possible to filter documents by topic? If so, can documents have multiple topic tags, or is a threshold needed?

最后,我希望将每个文档的主题2和主题3的负荷都很高,将其标记为"1",否则将其标记为"0".

In the end, I like to tag every document with "1" depending on whether it has a high loading of topic 2 and topic 3, else "0".

推荐答案

lda_topic_matrix包含文档属于特定主题/标签的概率分布.在人类中,这意味着每一行的总和为1,而每个索引处的值是该文档属于特定主题的概率.因此,每个文档确实具有不同程度的所有主题标签.如果您有4个主题,则所有标签均相等的文档在lda_topic_matrix中将有一个对应的行,类似于 [0.25, 0.25, 0.25, 0.25].并且只有一个主题("0")的文档行将变为[0.97, 0.01, 0.01, 0.01],具有两个主题("1"和"2")的文档的分布将变为[0.01, 0.54, 0.44, 0.01]

lda_topic_matrix contains distribution of probabilities of a document to belong to specific topic/tag. In human it means that each row sums to 1, while the value at each index is a probability of that document to belong to a specific topic. So, each document does have all topics tags, with different degree. In case you have 4 topics, the document that has all tags equally will have a corresponding row in lda_topic_matrix similar to [0.25, 0.25, 0.25, 0.25]. And the row of a document with only single topic ("0") will become something like [0.97, 0.01, 0.01, 0.01] and document with two topics ("1" and "2") will have a distribution like [0.01, 0.54, 0.44, 0.01]

因此,最简单的方法是选择概率最高的主题,然后检查它是2还是3:

So the most simplistic approach is to select the topic with the highest probability and check whether it is 2 or 3:

main_topic_of_document = np.argmax(lda_topic_matrix, axis=1)
tagged = ((main_topic_of_document==2) | (main_topic_of_document==3)).astype(np.int64)

本文为LDA的内部机制提供了很好的解释.

This article provides a good explanation on inner mechanics of LDA.

这篇关于TopicModel:如何通过主题模型"topic"查询文档?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆