Gensim:如何从文本文件加载预先计算的单词向量 [英] Gensim: how to load precomputed word vectors from text file

查看:98
本文介绍了Gensim:如何从文本文件加载预先计算的单词向量的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个文本文件,其中包含我的预先计算的字向量,格式如下(示例):

I have a text file with my precomputed word vectors in the following format (example):

单词-0.0762464299711 0.0128308048976 ... 0.0712385589283 \ n’

(用297个额外的浮点数代替 ... ).我试图用Gensim作为KeyedVectors加载它们,因为我最终想计算出余弦相似度,找到最相似的词,等等.不幸的是,我从文档之前和之前都没有使用Gensim,我对如何做还不太清楚这.我尝试了以下内容,我在此处找到了

on each line for every word (with 297 extra floats in place of the ...). I am trying to load these with Gensim as KeyedVectors, because I ultimately would like to compute the cosine similarity, find most similar words, etc. Unfortunately I have not worked with Gensim before and from the documentation it's not quite clear to me how to do this. I have tried the following which I found here:

word_vectors = KeyedVectors.load_word2vec_format('/embeddings/word.vectors',binary = False)

但这会产生以下错误:

ValueError:以10为底的int()的无效文字:'the'

"the"是文本文件中的第一个单词,因此我怀疑加载功能期望存在不存在的内容.但是我找不到应该在那里的任何信息.我非常希望能找到此类信息或解决我的问题的任何其他解决方案.谢谢!

'the' is the first word in the text file, so I suspect that the loading function is expecting something to be there that is not. But I can't find any information on what should be there. I would highly appreciate a pointer to such information or any other solution to my problem. Thanks!

推荐答案

您可以在此处一个> Word2Vec格式的示例.第一行应该包含文件中包含的单词数,然后是向量的维数.这可能就是您的脚本向您返回错误的原因.

You can see here an example of Word2Vec format. The first line is supposed to contain the number of words you have in your file followed by the dimension of your vectors. This is probably why your script is returning you an error.

在您的示例中:

1 300
word -0.0762464299711 0.0128308048976 ... 0.0712385589283

这篇关于Gensim:如何从文本文件加载预先计算的单词向量的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆