Doc2Vec获取最相似的文档 [英] Doc2Vec Get most similar documents

查看:261
本文介绍了Doc2Vec获取最相似的文档的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试建立一个文档检索模型,该模型返回按查询或搜索字符串的相关性排序的大多数文档.为此,我使用gensim中的Doc2Vec模型训练了一个doc2vec模型.我的数据集是熊猫数据集的形式,其中每个文档都以字符串形式存储在每一行上.这是我到目前为止的代码

I am trying to build a document retrieval model that returns most documents ordered by their relevancy with respect to a query or a search string. For this I trained a doc2vec model using the Doc2Vec model in gensim. My dataset is in the form of a pandas dataset which has each document stored as a string on each line. This is the code I have so far

import gensim, re
import pandas as pd

# TOKENIZER
def tokenizer(input_string):
    return re.findall(r"[\w']+", input_string)

# IMPORT DATA
data = pd.read_csv('mp_1002_prepd.txt')
data.columns = ['merged']
data.loc[:, 'tokens'] = data.merged.apply(tokenizer)
sentences= []
for item_no, line in enumerate(data['tokens'].values.tolist()):
    sentences.append(LabeledSentence(line,[item_no]))

# MODEL PARAMETERS
dm = 1 # 1 for distributed memory(default); 0 for dbow 
cores = multiprocessing.cpu_count()
size = 300
context_window = 50
seed = 42
min_count = 1
alpha = 0.5
max_iter = 200

# BUILD MODEL
model = gensim.models.doc2vec.Doc2Vec(documents = sentences,
dm = dm,
alpha = alpha, # initial learning rate
seed = seed,
min_count = min_count, # ignore words with freq less than min_count
max_vocab_size = None, # 
window = context_window, # the number of words before and after to be used as context
size = size, # is the dimensionality of the feature vector
sample = 1e-4, # ?
negative = 5, # ?
workers = cores, # number of cores
iter = max_iter # number of iterations (epochs) over the corpus)

# QUERY BASED DOC RANKING ??

我要努力的部分是查找与查询最相似/最相关的文档.我使用了infer_vector,但随后意识到它将查询视为一个文档,更新了模型并返回了结果.我尝试使用most_similarmost_similar_cosmul方法,但得到的单词以及相似性得分(我猜)都作为回报.我想做的是,当我输入搜索字符串(查询)时,我应该获得最相关的文档(标识)以及相似性得分(余弦等).我该如何完成这部分?

The part where I am struggling is in finding documents that are most similar/relevant to the query. I used the infer_vector but then realised that it considers the query as a document, updates the model and returns the results. I tried using the most_similar and most_similar_cosmul methods but I get words along with a similarity score(I guess) in return. What I want to do is when I enter a search string(a query), I should get the documents (ids) that are most relevant along with a similarity score(cosine etc). How do I get this part done?

推荐答案

您需要使用infer_vector获取新文本的文档矢量-不会更改基础模型.

You need to use infer_vector to get a document vector of the new text - which does not alter the underlying model.

这是您的操作方式:

tokens = "a new sentence to match".split()

new_vector = model.infer_vector(tokens)
sims = model.docvecs.most_similar([new_vector]) #gives you top 10 document tags and their cosine similarity

这是一个示例,说明在调用infer_vec之后基础模型如何不发生变化.

Here is an example of how the underlying model does not change after infer_vec is called.

import numpy as np

words = "king queen man".split()

len_before =  len(model.docvecs) #number of docs

#word vectors for king, queen, man
w_vec0 = model[words[0]]
w_vec1 = model[words[1]]
w_vec2 = model[words[2]]

new_vec = model.infer_vector(words)

len_after =  len(model.docvecs)

print np.array_equal(model[words[0]], w_vec0) # True
print np.array_equal(model[words[1]], w_vec1) # True
print np.array_equal(model[words[2]], w_vec2) # True

print len_before == len_after #True

这篇关于Doc2Vec获取最相似的文档的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆