gensim word2vec访问输入/输出向量 [英] gensim word2vec accessing in/out vectors

查看:515
本文介绍了gensim word2vec访问输入/输出向量的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在word2vec模型中,存在两个线性变换,这些变换将词汇空间中的单词带到隐藏层(输入"向量),然后又回到词汇空间(输出"向量).通常在训练后将这个输出向量丢弃.我想知道是否有一种简单的方法来访问gensim python中的out向量?同样,我如何访问输出矩阵?

In the word2vec model, there are two linear transforms that take a word in vocab space to a hidden layer (the "in" vector), and then back to the vocab space (the "out" vector). Usually this out vector is discarded after training. I'm wondering if there's an easy way of accessing the out vector in gensim python? Equivalently, how can I access the out matrix?

动机:我想实现最近这篇论文中提出的想法:双重嵌入空间模型文档排名

Motivation: I would like to implement the ideas presented in this recent paper: A Dual Embedding Space Model for Document Ranking

这里有更多详细信息.根据上面的参考,我们有以下word2vec模型:

Here are more details. From the reference above we have the following word2vec model:

在这里,输入层的大小为$ V $,词汇量为大小,隐藏层的大小为$ d $,输出层的大小为$ V $.这两个矩阵是W_ {IN}和W_ {OUT}. 通常,word2vec模型仅保留W_IN矩阵.这是返回的结果,在gensim中训练了word2vec模型后,您得到的东西如下:

Here, the input layer is of size $V$, the vocabulary size, the hidden layer is of size $d$, and an output layer of size $V$. The two matrices are W_{IN} and W_{OUT}. Usually, the word2vec model keeps only the W_IN matrix. This is what is returned where, after training a word2vec model in gensim, you get stuff like:

model ['potato'] = [-0.2,0.5,2,...]

model['potato']=[-0.2,0.5,2,...]

如何访问或保留W_ {OUT}?这可能在计算上非常昂贵,并且我真的希望gensim中的一些内置方法可以做到这一点,因为恐怕如果我从头开始编写代码,将无法提供良好的性能.

How can I access, or retain W_{OUT}? This is likely quite computationally expensive, and I'm really hoping for some built in methods in gensim to do this because I'm afraid that if I code this from scratch, it would not give good performance.

推荐答案

虽然这可能不是一个正确的答案(尚无法发表评论),但没有人指出,请看一下

While this might not be a proper answer (can't comment yet) and noone pointed this out, take a look here. The creator seems to answer a similar question. Also that's the place where you have a higher chance for a valid answer.

链接他在word2vec源代码中发布了您可以更改syn1删除以适合您的需求的信息.只需记住在完成后将其删除,因为事实证明它是内存消耗.

Digging around in the link he posted in the word2vec source code you could change the syn1 deletion to suit your needs. Just remember to delete it after you're done, since it proves to be a memory hog.

这篇关于gensim word2vec访问输入/输出向量的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆