在Gensim中按ID检索文档的字符串版本 [英] Retrieve string version of document by ID in Gensim

查看:65
本文介绍了在Gensim中按ID检索文档的字符串版本的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用Gensim进行某些主题建模,现在我已经开始使用LSI和tf-idf模型进行相似性查询了.我得到了一组ID和相似性,例如. (299501, 0.64505910873413086).

如何获取与ID相关的文本文档,在这种情况下为299501?

我看了有关语料库,字典,索引和模型的文档,但似乎找不到它.

解决方案

我刚刚经历了相同的过程,并且达到了使用带有文档ID的"sims"但想要我的原始商品代码"的相同点.尽管未完全提供,但整个Gensim库中都有元数据功能,并且示例可以提供帮助.我会在记住我必须做的事情的同时回答这个问题,以防将来对这个老问题的访问者有所帮助.

请参见gensim.corpora.textcorpus.TextCorpus#get_texts,如果启用了metadata标志,它会返回文本或元数据行号"的简单单个项目:

def get_texts(self):
    """Iterate over the collection, yielding one document at a time. A document
    is a sequence of words (strings) that can be fed into `Dictionary.doc2bow`.
    Each document will be fed through `preprocess_text`. That method should be
    overridden to provide different preprocessing steps. This method will need
    to be overridden if the metadata you'd like to yield differs from the line
    number.
    Returns:
        generator of lists of tokens (strings); each list corresponds to a preprocessed
        document from the corpus `input`.
    """
    lines = self.getstream()
    if self.metadata:
        for lineno, line in enumerate(lines):
            yield self.preprocess_text(line), (lineno,)
    else:
        for line in lines:
            yield self.preprocess_text(line)

我已经实现了一个自定义的make_corpus.py脚本和一个试用分类器脚本,该脚本使用相似性来查找与搜索文档相关的文档.从那时起,我为利用元数据所做的更改如下:

在make_corpus脚本中,我在构造函数中为TextCorpus子类启用了元数据:

corpus = SysRevArticleCorpus(inp, lemmatize=lemmatize, metadata=True)

我还需要对元数据进行序列化,因为我不会在语料生成后立即进行处理(如某些示例所示),因此您也需要在序列化步骤中打开元数据:

MmCorpus.serialize(outp + '_bow.mm', corpus, progress_cnt=10000, metadata=True)

这使gensim.matutils.MmWriter#write_corpus与语料库.mm文件一起保存xxx_bow.mm.metadata.cpickle文件.

要将更多项目添加到元数据中,您需要实现并重写TextCorpus子类中的一些内容.我已经以WikiCorpus示例类为基础,因为我有自己的现有语料库可供阅读.

构造函数需要接收元数据标志,例如:

def __init__(self, fname, processes=None, lemmatize=utils.has_pattern(), 
    dictionary=None, metadata=False,
...
    self.metadata = metadata

    if dictionary is None:
        # temporarily disable metadata to make internal dict
        metadata_setting = self.metadata
        self.metadata = False
        self.dictionary = Dictionary(self.get_texts())
        self.metadata = metadata_setting
    else:
        self.dictionary = dictionary

我实际上是从JSON语料库中读取的,因此我已经编写了一个自定义解析器.我的文章具有代码"属性,这是我的规范文档ID.我还想存储标题",并且文档主体位于文本"属性中. (这将替换Wiki示例中的XML解析.)

def extract_articles(f, filter_namespaces=False):
    """
    Extract article from a SYSREV article export JSON = open file-like object `f`.

    Return an iterable over (str, str, str) which generates (title, content, pageid) triplets.
    """
    elems = (elem for elem in f)
    for elem in elems:
        yield elem["title"], elem["text"] or "", elem["code"]

这是从覆盖的get_texts中调用的(在父类中,它提到您需要覆盖此元素以使用自定义元数据).总结:

def get_texts(self):
...
    with open(self.fname) as data_file:    
        corpusdata = json.load(data_file)
    texts = \
        ((text, self.lemmatize, title, pageid)
         for title, text, pageid
         in extract_articles(corpusdata['docs'], self.filter_namespaces))

... (skipping pool processing stuff for clarity)

    for tokens, title, pageid in pool.imap(process_article, group):

        if self.metadata:
            yield (tokens, (pageid, title))
        else:
            yield tokens

因此,这应该使您在corpus.mm文件旁边保存元数据.当您想在以后的脚本中重新读取它时,您将需要重新读取pickle文件-似乎没有任何内置方法可以重新读取元数据.幸运的是,它只是由Gensim生成的文档ID索引的Dictionary,因此易于加载和使用. (请参阅wiki-sim-search )

例如在我的试验分类器中,我只添加了两件事:metadata = pickle.load()metadata[docID]以便最终找到原始文章.

# re-load everything...
dictionary = corpora.Dictionary.load_from_text(datapath+'/en_wordids.txt')
  corpus = corpora.MmCorpus(datapath +'/xxx_bow.mm')
metadata = pickle.load(open(datapath + 'xxx_bow.mm.metadata.cpickle', 'rb'))

lsiModel = models.LsiModel(corpus, id2word=dictionary, num_topics=4)
index = similarities.MatrixSimilarity(lsiModel[corpus])

# example search
doc = "electronic cognitive simulation"
vec_bow = dictionary.doc2bow(doc.lower().split())
vec_lsi = lsiModel[vec_bow]  # convert the query to LSI space

# perform a similarity query against the corpus
sims = index[vec_lsi]  
sims = sorted(enumerate(sims), key=lambda item: -item[1])

# Look up the original article metadata for the top hit
(docID, prob) = sims[0]
print(metadata[docID])

# Prints (CODE, TITLE)
('ShiShani2008ProCarNur', 'Jordanian nurses and physicians learning needs for promoting smoking cessation.')

我知道这并没有按照您的要求提供原始的文本(我自己也不需要),但是您可以很容易地将文本添加到元数据"中(尽管扩展了元数据的定义,可能会很大!).我想Gensim假设您已经有了一些原始文档的数据库,因此它超出了范围.但是,我觉得Gensim生成的ID与原始文档标识符之间需要映射,元数据功能可以很好地实现此映射.

I am using Gensim for some topic modelling and I have gotten to the point where I am doing similarity queries using the LSI and tf-idf models. I get back the set of IDs and similarities, eg. (299501, 0.64505910873413086).

How do I get the text document that is related to the ID, in this case 299501?

I have looked at the docs for corpus, dictionary, index, and the model and cannot seem to find it.

解决方案

I have just gone through the same process and reached the same point of having "sims" with a document ID but wanting my original "article code". Although it's not provided entirely, there is a metadata feature throughout the Gensim library and the examples which can help. I'll answer this while I remember what I had to do, in case it helps any future visitors to this old question.

See gensim.corpora.textcorpus.TextCorpus#get_texts, which either returns the text or a simple single item of metadata "linenumber" if the metadata flag is enabled:

def get_texts(self):
    """Iterate over the collection, yielding one document at a time. A document
    is a sequence of words (strings) that can be fed into `Dictionary.doc2bow`.
    Each document will be fed through `preprocess_text`. That method should be
    overridden to provide different preprocessing steps. This method will need
    to be overridden if the metadata you'd like to yield differs from the line
    number.
    Returns:
        generator of lists of tokens (strings); each list corresponds to a preprocessed
        document from the corpus `input`.
    """
    lines = self.getstream()
    if self.metadata:
        for lineno, line in enumerate(lines):
            yield self.preprocess_text(line), (lineno,)
    else:
        for line in lines:
            yield self.preprocess_text(line)

I had already implemented a custom make_corpus.py script, and a trial classifier script which uses similarity to find related documents to a search document. The changes I made to utilise the metadata from that point were as follows:

In the make_corpus script, I enabled metadata in the constructor to my TextCorpus daughter class:

corpus = SysRevArticleCorpus(inp, lemmatize=lemmatize, metadata=True)

I also needed to serialise the metadata, as I'm not doing the processing immediately after corpus generation (as some of the examples do), so you need to turn on metadata in the serialise step too:

MmCorpus.serialize(outp + '_bow.mm', corpus, progress_cnt=10000, metadata=True)

This makes gensim.matutils.MmWriter#write_corpus save a "xxx_bow.mm.metadata.cpickle" file with your corpus .mm files.

To add more items into the metadata, you need to implement and override a few things in a TextCorpus daughter class. I already had based one off the WikiCorpus example class, as I have my own existing corpus to read.

The constructor needs to receive the metadata flag e.g.:

def __init__(self, fname, processes=None, lemmatize=utils.has_pattern(), 
    dictionary=None, metadata=False,
...
    self.metadata = metadata

    if dictionary is None:
        # temporarily disable metadata to make internal dict
        metadata_setting = self.metadata
        self.metadata = False
        self.dictionary = Dictionary(self.get_texts())
        self.metadata = metadata_setting
    else:
        self.dictionary = dictionary

I'm actually reading in from a JSON corpus so I'd already written a custom parser. My articles have a "code" property which is my canonical document ID. I also want to store the "title", and the document body is in the "text" property. (This replaces the XML parsing in the wiki example).

def extract_articles(f, filter_namespaces=False):
    """
    Extract article from a SYSREV article export JSON = open file-like object `f`.

    Return an iterable over (str, str, str) which generates (title, content, pageid) triplets.
    """
    elems = (elem for elem in f)
    for elem in elems:
        yield elem["title"], elem["text"] or "", elem["code"]

This is called from within the overridden get_texts (in the parent class it mentions you need to override this to use custom metadata). Summarised:

def get_texts(self):
...
    with open(self.fname) as data_file:    
        corpusdata = json.load(data_file)
    texts = \
        ((text, self.lemmatize, title, pageid)
         for title, text, pageid
         in extract_articles(corpusdata['docs'], self.filter_namespaces))

... (skipping pool processing stuff for clarity)

    for tokens, title, pageid in pool.imap(process_article, group):

        if self.metadata:
            yield (tokens, (pageid, title))
        else:
            yield tokens

So this should get you saving metadata along side your corpus.mm files. When you want to re-read this in a later script, you will need to read the pickle file back in - there doesn't seem to be any built in methods to re-read the metadata. Fortunately it's just a Dictionary indexed by the Gensim-generated document ID, so it's easy to load and use. (See wiki-sim-search)

e.g. in my trial classifier, I just added two things: metadata = pickle.load() and metadata[docID] to finally find the original article.

# re-load everything...
dictionary = corpora.Dictionary.load_from_text(datapath+'/en_wordids.txt')
  corpus = corpora.MmCorpus(datapath +'/xxx_bow.mm')
metadata = pickle.load(open(datapath + 'xxx_bow.mm.metadata.cpickle', 'rb'))

lsiModel = models.LsiModel(corpus, id2word=dictionary, num_topics=4)
index = similarities.MatrixSimilarity(lsiModel[corpus])

# example search
doc = "electronic cognitive simulation"
vec_bow = dictionary.doc2bow(doc.lower().split())
vec_lsi = lsiModel[vec_bow]  # convert the query to LSI space

# perform a similarity query against the corpus
sims = index[vec_lsi]  
sims = sorted(enumerate(sims), key=lambda item: -item[1])

# Look up the original article metadata for the top hit
(docID, prob) = sims[0]
print(metadata[docID])

# Prints (CODE, TITLE)
('ShiShani2008ProCarNur', 'Jordanian nurses and physicians learning needs for promoting smoking cessation.')

I know this doesn't provide the original text as you asked (I don't need it myself), but you could very easily add the text to the "metadata" (although this rather stretches the definition of metadata and could be very big!). I guess Gensim presumes you will already have some database of your original documents, and therefore it would be out of scope. However I feel there needs to be a mapping between the Gensim-generated IDs and the original document identifiers, which the metadata feature fulfils quite well.

这篇关于在Gensim中按ID检索文档的字符串版本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆