如何使用Gensim doc2vec执行有效查询? [英] How to perform efficient queries with Gensim doc2vec?

查看:165
本文介绍了如何使用Gensim doc2vec执行有效查询?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在研究具有以下用例的句子相似性算法:给定一个新句子,我想从给定集合中检索其n个最相似的句子.我正在使用Gensim v.3.7.1,并且已经训练了word2vec和doc2vec模型.后者的结果优于word2vec,但我无法在Doc2Vec模型上执行有效的查询.此模型使用分布式词袋实现(dm = 0).

I’m working on a sentence similarity algorithm with the following use case: given a new sentence, I want to retrieve its n most similar sentences from a given set. I am using Gensim v.3.7.1, and I have trained both word2vec and doc2vec models. The results of the latter outperform word2vec’s, but I’m having trouble performing efficient queries with my Doc2Vec model. This model uses the distributed bag of words implementation (dm = 0).

我曾经使用内置方法model.most_similar()来推论相似性,但是一旦我开始使用更多要查询的数据进行训练时,这是不可能的.也就是说,我想在我的训练数据集的子集中找到最相似的句子.为此,我的快速解决方案是使用余弦相似度将新句子的向量与集合中的每个向量进行比较,但是显然这无法扩展,因为我必须计算嵌入的负载并进行大量比较.

I used to infer similarity using the built in method model.most_similar(), but this was not possible once I started training with more data that the one I want to query against. That's to say, I want to find the most similar sentence among a subset of my training dataset. My quick fix to this was comparing the vector of the new sentence with every vector on my set using cosine similarity, but obviously this does not scale as I have to compute loads of embeddings and make a lot of comparisons.

我成功使用了移词距离对于word2vec和doc2vec来说,但是当使用余弦相似度时,对于doc2vec我可以获得更好的结果.如何使用PV-DBOW Doc2Vec模型和类别相似性?

I successfully use word-mover distance for both of word2vec and doc2vec, but I get better results for doc2vec when using cosine similarity. How can I efficiently query a new document against my set using the PV-DBOW Doc2Vec model and a method from class Similarity?

我正在寻找一种与WMD相似的方法,但对于doc2vec余弦相似性:

I'm looking for a similar approach to what I do with WMD, but for doc2vec cosine similarity:

# set_to_query contains ~10% of the training data + some future updates
set_to_query_tokenized = [sentence.split() for sentence in set_to_query]
w2v_model = gensim.models.Word2Vec.load("my_w2v_model")
w2v_to_query = gensim.similarities.WmdSimilarity(
               corpus = set_to_query_tokenized,
               w2v_model = w2v_model,
               num_best=10
              )
new_query = "I want to find the most similar sentence to this one".split()
most_similar = w2v_to_query[new_query]

推荐答案

创建自己的向量子集(作为KeyedVectors实例)并不容易,也不应该如此.

Creating your own subset of vectors, as a KeyedVectors instance, isn't quite as easy as it could or should be.

但是,您应该能够使用仅加载感兴趣的向量的WordEmbeddingsKeyedVectors(即使您正在使用doc-vectors).我尚未对此进行测试,但是假设d2v_model是您的Doc2Vec模型,并且list_of_tags是您想要在子集中使用的标签,请尝试执行以下操作:

But, you should be able to use a WordEmbeddingsKeyedVectors (even though you're working with doc-vectors) that you load with just the vectors of interest. I haven't tested this, but assuming d2v_model is your Doc2Vec model, and list_of_tags are the tags you want in your subset, try something like:

subset_vectors = WordEmbeddingsKeyedVectors(vector_size)
subset_vectors.add(list_of_tags, d2v_model.docvecs[list_of_tags])

然后,您可以执行常规操作,例如subset_vectors上的most_similar().

Then you can perform the usual operations, like most_similar() on subset_vectors.

这篇关于如何使用Gensim doc2vec执行有效查询?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆