如何将标记添加到gensim词典 [英] how to add tokens to gensim dictionary

查看:154
本文介绍了如何将标记添加到gensim词典的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我使用的问题,从文件收集.每个文档都是令牌列表.这是我的代码

I use gensim to build dictionary from a collection of documents. Each document is a list of tokens. this my code

def constructModel(self, docTokens):
    """ Given document tokens, constructs the tf-idf and similarity models"""

    #construct dictionary for the BOW (vector-space) model : Dictionary = a mapping between words and their integer ids = collection of (word_index,word_string) pairs
    #print "dictionary"
    self.dictionary = corpora.Dictionary(docTokens)

    # prune dictionary: remove words that appear too infrequently or too frequently
    print "dictionary size before filter_extremes:",self.dictionary#len(self.dictionary.values())
    #self.dictionary.filter_extremes(no_below=1, no_above=0.9, keep_n=100000)
    #self.dictionary.compactify()

    print "dictionary size after filter_extremes:",self.dictionary

    #construct the corpus bow vectors; bow vector = collection of (word_id,word_frequency) pairs
    corpus_bow = [self.dictionary.doc2bow(doc) for doc in docTokens]


    #construct the tf-idf model 
    self.model = models.TfidfModel(corpus_bow,normalize=True)
    corpus_tfidf = self.model[corpus_bow]   # first transform each raw bow vector in the corpus to the tfidf model's vector space
    self.similarityModel = similarities.MatrixSimilarity(corpus_tfidf)  # construct the term-document index

我的问题是如何向此词典添加新文档(令牌)并对其进行更新.我搜索了gensim文档,但没有找到解决方法

my question is how to add a new doc (tokens) to this dictionary and update it. I searched in gensim documents but I didn't find a solution

推荐答案

gensim网页此处

方法是用新文档创建另一个词典,然后将它们合并.

The way to do it is create another dictionary with the new documents and then merge them.

from gensim import corpora

dict1 = corpora.Dictionary(firstDocs)
dict2 = corpora.Dictionary(moreDocs)
dict1.merge_with(dict2)

根据文档,这将将相同的令牌映射到相同的ID,将新的令牌映射到新的ID".

According to the docs, this will map "same tokens to the same ids and new tokens to new ids".

这篇关于如何将标记添加到gensim词典的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆