如何访问gensim word2vec中的输出嵌入(输出向量)? [英] How can I access output embedding(output vector) in gensim word2vec?

查看:468
本文介绍了如何访问gensim word2vec中的输出嵌入(输出向量)?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想使用word2vec的输出嵌入,例如>本文(使用双词嵌入).

I want to use output embedding of word2vec such as in this paper (Improving document ranking with dual word embeddings).

我知道输入向量在syn0中,如果是负采样,则输出向量在syn1和syn1neg中.

I know input vectors are in syn0, output vectors are in syn1 and syn1neg if negative sampling.

但是当我计算与输出向量的most_like时,由于删除了syn1或syn1neg,我在某些范围内得到了相同的结果.

But when I calculated most_similar with output vector, I got same result in some ranges because of removing syn1 or syn1neg.

这就是我得到的.

IN[1]: model = Word2Vec.load('test_model.model')

IN[2]: model.most_similar([model.syn1neg[0]])

OUT[2]: [('of', -0.04402521997690201),
('has', -0.16387106478214264),
('in', -0.16650712490081787),
('is', -0.18117375671863556),
('by', -0.2527652978897095),
('was', -0.254993200302124),
('from', -0.2659570872783661),
('the', -0.26878535747528076),
('on', -0.27521973848342896),
('his', -0.2930959463119507)]

但是另一个syn1neg numpy向量已经是类似的输出.

but another syn1neg numpy vector is already similar output.

IN[3]: model.most_similar([model.syn1neg[50]])

OUT[3]: [('of', -0.07884830236434937),
('has', -0.16942456364631653),
('the', -0.1771494299173355),
('his', -0.2043554037809372),
('is', -0.23265135288238525),
('in', -0.24725285172462463),
('by', -0.27772971987724304),
('was', -0.2979024648666382),
('time', -0.3547973036766052),
('he', -0.36455872654914856)]

我想获得训练期间保留的输出numpy数组(是否为负).

I want to get output numpy arrays(negative or not) with preserved during training.

让我知道如何访问纯syn1或syn1neg或代码,或者某些word2vec模块可以获取输出嵌入.

Let me know how can I access pure syn1 or syn1neg, or code, or some word2vec module can get output embedding.

推荐答案

使用负采样时,syn1neg权重是每个单词的权重,其顺序与syn0相同.

With negative-sampling, syn1neg weights are per-word, and in the same order as syn0.

您的两个示例给出相似的结果这一事实并不一定表明有什么错误.默认情况下,单词是按频率排序的,因此早期单词(包括位置0和50的单词)是非常常见的单词,具有基于共现的含义(可能彼此接近).

The mere fact that your two examples give similar results doesn't necessarily indicate anything is wrong. The words are by default sorted by frequency, so the early words (including those in position 0 and 50) are very-frequent words with very-generic cooccurrence-based meanings (that may all be close to each other).

选择一个具有更独特含义的中频单词,您可能会得到更有意义的结果(如果您的语料库/设置/需求足够类似于双词嵌入"论文中的内容).例如,您可能要比较:

Pick a medium-frequency word with a more distinct meaning, and you may get more meaningful results (if your corpus/settings/needs are sufficiently like those of the 'dual word embeddings' paper). For example, you might want to compare:

model.most_similar('cousin')

...与...

model.most_similar(positive=[model.syn1neg[model.vocab['cousin'].index])

但是,在所有情况下,现有的most_similar()方法仅在syn0中寻找相似的向量-本文术语的"IN"向量.因此,我相信上面的代码只会真正计算出论文所称的"OUT-IN"相似度:列出了哪些IN向量与给定的OUT向量最相似.实际上,他们似乎在吹捧反向的"IN-OUT"相似性,这是有用的. (那是与给定的IN向量最相似的OUT向量.)

However, in all cases the existing most_similar() method only looks for similar-vectors in syn0 – the 'IN' vectors of the paper's terminology. So I believe the above code would only really be computing what the paper might call 'OUT-IN' similarity: a list of which IN vectors are most similar to a given OUT vector. They actually seem to tout the reverse, 'IN-OUT' similarity, as something useful. (That'd be the OUT vectors most similar to a given IN vector.)

gensim的最新版本引入了KeyedVectors类,用于表示由字串键入的一组单词向量,与特定的Word2Vec模型或其他训练方法分开.您可能会创建一个额外的KeyedVectors实例,用syn1neg替换常规的syn0,以获取与目标向量相似的OUT向量列表(从而计算出前n个"IN-OUT"相似度甚至是"OUT" -OUT'的相似性).

The latest versions of gensim introduce a KeyedVectors class for representing a set of word-vectors, keyed by string, separate from the specific Word2Vec model or other training method. You could potentially create an extra KeyedVectors instance that replaces the usual syn0 with syn1neg, to get lists of OUT vectors similar to a target vector (and thus calculate top-n 'IN-OUT' similarities or even 'OUT-OUT' similarities).

例如,此可能工作(我尚未测试过):

For example, this might work (I haven't tested it):

outv = KeyedVectors()
outv.vocab = model.wv.vocab  # same
outv.index2word = model.wv.index2word  # same
outv.syn0 = model.syn1neg  # different
inout_similars = outv.most_similar(positive=[model['cousin']])

syn1仅在使用分层采样时存在,并且不清楚单个单词的输出嵌入"将在哪里. (有多个输出节点对应于预测任何一个单词,并且它们都需要更接近各自的正确0/1值才能预测单个单词.因此,与`syn1neg不同,没有一个地方可以读取向量,这意味着一个单词的输出.您可能必须计算/逼近一些hidden-> output权重集合,这些权重会将这些多个输出节点推向正确的值.)

syn1 only exists when using hierarchical-sampling, and it's less clear what an "output embedding" for an individual word would be there. (There are multiple output nodes corresponding to predicting any one word, and they all need to be closer to their proper respective 0/1 values to predict a single word. So unlike with `syn1neg, there's no one place to read a vector that means a single word's output. You might have to calculate/approximate some set of hidden->output weights that would drive those multiple output nodes to the right values.)

这篇关于如何访问gensim word2vec中的输出嵌入(输出向量)?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆