如何解释以前查看过的数据上的doc2vec结果? [英] How to interpret doc2vec results on previously seen data?

查看:104
本文介绍了如何解释以前查看过的数据上的doc2vec结果?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我使用gensim 4.0.1并训练doc2vec:

I use gensim 4.0.1 and train doc2vec:

from gensim.test.utils import common_texts
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
    
sentences = [['hello', 'world'], ['james', 'bond'], ['adam', 'smith']]
documents = [TaggedDocument(doc, [i]) for i, doc in enumerate(sentences)]
model = Doc2Vec(documents, vector_size=5, window=5, min_count=0, workers=4) 

documents
    [TaggedDocument(words=['hello', 'world'], tags=[0]),
    TaggedDocument(words=['james', 'bond'], tags=[1]),
    TaggedDocument(words=['adam', 'smith'], tags=[2])]

model.dv[0],model.dv[1],model.dv[2]
        (array([-0.10461631, -0.11958256, -0.1976151 ,  0.1710569 ,  0.0713223 ],
               dtype=float32),
         array([ 0.00526548, -0.19761242, -0.10334401, -0.19437183,  0.04021204],
               dtype=float32),
         array([ 0.05662392,  0.09290017, -0.08597242, -0.06293383, -0.06159503],
               dtype=float32))

我希望与TaggedDocument#1匹配

I expect to get a match on TaggedDocument #1

seen = ['james','bond']

令人惊讶地,已知文本(詹姆斯·邦德)产生了完全看不见的"文本.向量:

Surprisingly, that known text (james bond) produces a completely "unseen" vector:

new_vector = model.infer_vector(seen)
new_vector
        
        array([-0.07762126,  0.03976333, -0.02985927,  0.07899596, -0.03556045],
              dtype=float32)

most_similar()没有指向预期的Tag = 1.而且,所有3个得分都非常低,表明数据完全看不见.

The most_similar() does not point to the expected Tag=1. Moreover, all 3 scores are quite weak implying completely unseen data.

model.dv.most_similar_cosmul(positive=[new_vector]) 
[(0, 0.5322251915931702), (2, 0.4972134530544281), (1, 0.46321794390678406)]

这里有什么问题,有什么主意吗?

What is wrong here, any ideas?

推荐答案

对于只有6个单词,6个唯一单词和3个2个单词的文本的玩具大小的数据集,五个维度仍然太多.

Five dimensions is still too many for a toy-sized dataset of just 6 words, 6 unique words, and 3 2-word texts.

Word2Vec / Doc2Vec / FastText 类型的算法都无法很好地处理少量的人为数据.他们只能从多种情况下对单词的许多微妙用法中学习自己的模式.

None of the Word2Vec/Doc2Vec/FastText-type algorithms works well on tiny amounts of contrived data. They only learn their patterns from many, subtly-contrasting usages of words in varied contexts.

只有在向量具有50、100或数百个维的宽度时,它们的真正优势才会显现-训练多个维度需要(至少)成千上万个单词的独特词汇-理想情况下,成千上万个成千上万个单词单词–每个单词都有许多用法示例.(对于类似 Doc2Vec 的变体,您同样需要成千上万个不同的文档.)

Their real strengths only emerge with vectors that are 50, 100, or hundreds-of-dimensions wide - and training that many dimensions requires a unique vocabulary of (at least) many thousands of words – ideally tens or hundreds of thousands of words – with many usage examples of each. (For a variant like Doc2Vec, you'd similarly want many thousands of varied documents.)

使用足够的训练数据,您会发现与预期结果的相关性得到改善.

You'll see improved correlations with expected results when using sufficient training data.

这篇关于如何解释以前查看过的数据上的doc2vec结果?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆