我们可以使用gensim使用自制的语料库来训练LDA吗? [英] Can we use a self made corpus for training for LDA using gensim?

查看:191
本文介绍了我们可以使用gensim使用自制的语料库来训练LDA吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我必须应用LDA(潜在狄利克雷分配)来从我收集的20,000个文档的数据库中获取可能的主题.

I have to apply LDA (Latent Dirichlet Allocation) to get the possible topics from a data base of 20,000 documents that I collected.

我如何使用这些文档而不是像布朗语料库或英语维基百科这样的其他语料库作为培训语料库?

How can I use these documents rather than the other corpus available like the Brown Corpus or English Wikipedia as training corpus ?

您可以参考页面.

推荐答案

在仔细阅读了Gensim软件包的文档后,我发现共有四种将文本存储库转换为语料库的方法.

After going through the documentation of the Gensim package, I found out that there are total 4 ways of transforming a text repository into a corpus.

语料库共有4种格式:

  1. 市场矩阵(.mm)
  2. SVM灯(.svmlight)
  3. Blie格式(.lad-c)
  4. 低格式(.low)

在此问题中,如上所述,数据库中共有19,188个文档. 必须阅读每个文档,并从句子中删除停用词和标点符号,这可以使用nltk来完成.

In this problem, as mentioned above there are total of 19,188 documents in the database. One has to read each document and remove stopwords and punctuations from the sentences, which can be done using nltk.

import gensim
from gensim import corpora, similarities, models

##
##Text Preprocessing is done here using nltk
##

##Saving of the dictionary and corpus is done here
##final_text contains the tokens of all the documents

dictionary = corpora.Dictionary(final_text)
dictionary.save('questions.dict');
corpus = [dictionary.doc2bow(text) for text in final_text]
corpora.MmCorpus.serialize('questions.mm', corpus)
corpora.SvmLightCorpus.serialize('questions.svmlight', corpus)
corpora.BleiCorpus.serialize('questions.lda-c', corpus)
corpora.LowCorpus.serialize('questions.low', corpus)

##Then the dictionary and corpus can be used to train using LDA

mm = corpora.MmCorpus('questions.mm')
lda = gensim.models.ldamodel.LdaModel(corpus=mm, id2word=dictionary, num_topics=100, update_every=0, chunksize=19188, passes=20)

这样,人们就可以将他的数据集转换为语料库,并且可以使用gensim程序包使用LDA对其进行训练以进行主题建模.

This way one can transform his dataset to a corpus that can be trained for topic modelling using LDA using gensim package.

这篇关于我们可以使用gensim使用自制的语料库来训练LDA吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆