Gensim:KeyError:“单词不在词汇表中" [英] Gensim: KeyError: "word not in vocabulary"

查看:494
本文介绍了Gensim:KeyError:“单词不在词汇表中"的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个使用Python的Gensim库训练有素的Word2vec模型.我有一个标记化的列表,如下所示. vocab的大小是34,但我只给出了34个中的几个:

I have a trained Word2vec model using Python's Gensim Library. I have a tokenized list as below. The vocab size is 34 but I am just giving few out of 34:

b = ['let',
 'know',
 'buy',
 'someth',
 'featur',
 'mashabl',
 'might',
 'earn',
 'affili',
 'commiss',
 'fifti',
 'year',
 'ago',
 'graduat',
 '21yearold',
 'dustin',
 'hoffman',
 'pull',
 'asid',
 'given',
 'one',
 'piec',
 'unsolicit',
 'advic',
 'percent',
 'buy']

模型

model = gensim.models.Word2Vec(b,min_count=1,size=32)
print(model) 
### prints: Word2Vec(vocab=34, size=32, alpha=0.025) ####

如果我尝试通过对列表中的单词之一进行model['buy']来获得相似性得分,则得到

if I try to get the similarity score by doing model['buy'] of one the words in the list, I get the

KeyError:单词'buy'不在词汇表中"

KeyError: "word 'buy' not in vocabulary"

你们能建议我我在做错什么吗?有什么方法可以检查该模型,该模型可以进一步用于训练PCA或t-sne,以可视化构成主题的相似单词?谢谢你.

Can you guys suggest me what I am doing wrong and what are the ways to check the model which can be further used to train PCA or t-sne in order to visualize similar words forming a topic? Thank you.

推荐答案

传递给gensim.models.Word2Vec的第一个参数是句子的可迭代项.句子本身就是单词列表.从文档中:

The first parameter passed to gensim.models.Word2Vec is an iterable of sentences. Sentences themselves are a list of words. From the docs:

sentences的可迭代项初始化模型.每个句子是一个 用于训练的单词(unicode字符串)列表.

Initialize the model from an iterable of sentences. Each sentence is a list of words (unicode strings) that will be used for training.

现在,它认为列表b中的每个单词都是一个句子,因此它对每个单词中的每个字符执行Word2Vec,而不是您b.现在,您可以执行以下操作:

Right now, it thinks that each word in your list b is a sentence and so it is doing Word2Vec for each character in each word, as opposed to each word in your b. Right now you can do:

model = gensim.models.Word2Vec(b,min_count=1,size=32)

print(model['a'])
array([  7.42487283e-03,  -5.65282721e-03,   1.28707094e-02, ... ]

要使其适用于单词,只需将b包装在另一个列表中,以便正确解释:

To get it to work for words, simply wrap b in another list so that it is interpreted correctly:

model = gensim.models.Word2Vec([b],min_count=1,size=32)

print(model['buy'])
array([-0.01331611,  0.00496594, -0.00165093, -0.01444992,  0.01393849, ... ]

这篇关于Gensim:KeyError:“单词不在词汇表中"的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆