如何在LDA中改善不同主题中的单词分配 [英] how to improve word assignement in different topics in lda

查看:49
本文介绍了如何在LDA中改善不同主题中的单词分配的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用一种非英语的语言,并且我已经从不同来源收集了数据.我已经完成了标点删除,停用词删除和标记化之类的预处理.现在我想提取特定领域的词典.假设我拥有与体育,娱乐等相关的数据,并且我想提取与这些特定领域(例如板球等)相关的单词,并将其放在紧密相关的主题中.我尝试为此使用lda,但没有获得正确的群集.同样,在一个单词是一个主题的一部分的群集中,它也出现在其他主题中.

I am working on a language that is the not english and I have scraped the data from different sources. I have done my preprocessing like punctuation removal, stop-words removal and tokenization. Now I want to extract domain specific lexicons. Let's say that I have data related to sports, entertainment, etc and I want to extract words that are related to these particular fields, like cricket etc, and place them in topics that are closely related. I tried to use lda for this, but I am not getting the correct clusters. Also in the clusters in which a word which is a part of one topic, it also appears in other topics.

如何改善搜索结果?

 # URDU STOP WORDS REMOVAL
        doc_clean = []
        stopwords_corpus = UrduCorpusReader('./data', ['stopwords-ur.txt'])    
        stopwords = stopwords_corpus.words()
        # print(stopwords)
        for infile in (wordlists.fileids()):
            words = wordlists.words(infile)
            #print(words)


            finalized_words = remove_urdu_stopwords(stopwords, words)
            doc = doc_clean.append(finalized_words)

            print("\n==== WITHOUT STOPWORDS ===========\n")
            print(finalized_words)

            # making dictionary and corpus
        dictionary  = corpora.Dictionary(doc_clean)
        # convert tokenized documents into a document-term matrix
        matrx= [dictionary.doc2bow(text) for text in doc_clean]
        # generate LDA model
        lda = models.ldamodel.LdaModel(corpus=matrx, id2word=dictionary, num_topics=5, passes=10)
        for top in lda.print_topics():
                print("\n===topics from files===\n")
                print (top)

推荐答案

LDA及其缺点:LDA的想法是从语料库中发现潜在主题.这种无监督的机器学习方法的一个缺点是,您最终会遇到难以理解的主题.另一个缺点是,您很可能最终会遇到一些通用主题,包括出现在每个文档中的单词(例如简介",日期",作者"等).第三,您将无法发现根本不够呈现的潜在主题.如果您只有1条关于板球的文章,则该算法将无法识别该文章.

LDA and its drawbacks: The idea of LDA is to uncover latent topics from your corpus. A drawback of this unsupervised machine learning approach, is that you will end up with topics that may be hard to interpret by humans. Another drawback is that you will most likely end up with some generic topics including words that appear in every document (like 'introduction', 'date', 'author' etc.). Thirdly, you will not be able to uncover latent topics that are simply not present enough. If you have only 1 article about cricket, it will not be recognised by the algorithm.

为什么LDA不适合您的情况:您正在搜索明确的主题,例如 cricket you ,想了解一些有关板球词汇的知识,对吗?但是,LDA将输出一些主题,需要识别板球词汇才能确定例如话题5与板球有关.LDA通常会识别与其他相关主题混合在一起的主题.请记住,有以下三种情况:

Why LDA doesn't fit your case: You are searching for explicit topics like cricket and you want to learn something about cricket vocabulary, correct? However, LDA will output some topics and you need to recognise cricket vocabulary in order to determine that e.g. topic 5 is concerned with cricket. Often times the LDA will identify topics that are mixed with other -related- topics. Keeping this in mind, there are three scenarios:

  1. 您对板球一无所知,但是您可以确定与板球有关的主题.
  2. 您是板球专家,已经知道板球词汇
  3. 您对板球一无所知,也无法识别LDA产生的语义主题.

在第一种情况下,您将遇到一个问题,即您可能会将单词与相关联,而实际上与to无关,因为您依靠LDA输出来提供仅的高质量主题与板球有关,而与其他相关主题或通用术语无关.在第二种情况下,您首先不需要分析,因为您已经知道板球词汇!当您依靠计算机解释主题时,可能出现第三种情况.但是,在 LDA 中,您总是依赖于人类对输出进行语义解释.

In the first case, you will have the problem that you are likely to associate words with cricket, that are actually not related to cricket, because you count on the LDA output to provide high-quality topics that are only concerned with cricket and no other related topics or generic terms. In the second case, you don't need the analysis in the first place, because you already know the cricket vocabulary! The third case is likely when you are relying on your computer to interpret the topics. However, in LDA you always rely on humans to give a semantic interpretation of the output.

该怎么做:有一篇名为

So what to do: There's a paper called Targeted Topic Modeling for Focused Analysis (Wang 2016), which tries to identify which documents are concerned with a pre-defined topic (like cricket). If you have a list of topics for which you'd like to get some topic-specific vocabulary (cricket, basketball, romantic comedies, ..), a starting point could be to first identify relevant documents to then proceed and analyse the word-distributions of the documents related to a certain topic.

请注意,也许有完全不同的方法将完全执行您要寻找的方法.如果您想保留与LDA相关的文献,我相对相信我所链接的文章是您的最佳选择.

Note that perhaps there are completely different methods that will perform exactly what you're looking for. If you want to stay in the LDA-related literature, I'm relatively confident that the article I linked is your best shot.

修改:如果这个答案对您有用,您可能也会发现我的论文也很有趣.它采用标签化的学术经济学论文数据集(600多个可能的标签),并尝试各种LDA风格,以对新的学术论文获得最佳预测.回购包含我的代码,文档以及论文本身

Edit: If this answer is useful to you, you may find my paper interesting, too. It takes a labeled dataset of academic economics papers (600+ possible labels) and tries various LDA flavours to get the best predictions on new academic papers. The repo contains my code, documentation and also the paper itself

这篇关于如何在LDA中改善不同主题中的单词分配的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆