word2vec-KeyError:“单词X不在词汇表中" [英] word2vec - KeyError: "word X not in vocabulary"
问题描述
使用模块gensim
的Word2Vec
实现,以便为我在纯文本文件中拥有的句子构建单词嵌入.尽管词汇中定义了单词happy
,但仍然出现错误KeyError: "word 'happy' not in vocabulary"
.尝试将给定的答案应用于类似的问题,但没有用.因此,发表了我自己的问题.
Using the Word2Vec
implementation of the module gensim
in order to construct word embeddings for the sentences I do have in a plain text file. Despite the word happy
is defined in the vocabulary, getting the error KeyError: "word 'happy' not in vocabulary"
. Tried to apply the given the answers to a similar question, but did not work. Hence, posted my own question.
这是代码:
try:
data = []
with open(TXT_PATH, 'r', encoding='utf-8') as txt_file:
for line in txt_file:
for part in line.split(' '):
data.append(part.strip())
# When I debug, both of the words 'happy' and 'birthday' exist in the variable 'data'
word2vec = Word2Vec(data, min_count=5, size=10000, window=5, workers=4)
# Print result
word_1 = 'happy'
word_2 = 'birthday'
print(f'Similarity between {word_1} and {word_2} thru word2vec: {word2vec.similarity(word_1, word_2)}')
except Exception as err:
print(f'An error happened! Detail: {str(err)}')
推荐答案
当您从Word2Vec
收到这样的词汇不正确"错误时,您可以相信它:'happy'
确实不在模型中.
When you get a "not in vocabulary" error like this from Word2Vec
, you can trust it: 'happy'
really isn't in the model.
即使您的视觉检查在文件中显示了'happy'
,它在模型中可能无法显示的一些原因也包括:
Even if your visual check shows 'happy'
inside your file, a few reasons why it might not wind up inside the model include:
-
它至少出现过
min_count=5
次
data
格式不适用于Word2Vec
,因此看不到您希望看到的字词.
the data
format isn't correct for Word2Vec
, so it's not seeing the words you expect it to see.
看看代码如何编写data
,它看起来像文件中所有单词的庞大列表.相反,Word2Vec
期望一个序列,该序列具有与每个项目相同的单词列表.因此:不是单词列表,而是每个项目都是单词列表的列表.
Looking at how data
is prepared by your code, it looks like a giant list of all words in your file. Word2Vec
instead expects a sequence that has, as each item, a list-of-words for that one text. So: not a list-of-words, but a list where each item is a list-of-words.
如果您提供了...
[
'happy',
'birthday',
]
...而不是预期的...
...instead of the expected...
[
['happy', 'birthday',],
]
...这些单个单词字符串将被视为一个字符列表,因此Word2Vec
会认为您想学习一堆单字符单词的单词向量.您可以通过查看词汇量是否较小(len(model.wv)
)或学习单词样本只是单字符单词('model.wv.index2entity [:10]`)来检查这是否影响了模型.
...those single-word-strings will be seen a lists-of-characters, so Word2Vec
will think you want to learn word-vectors for a bunch of one-character words. You can check if this has affected your model by seeing if the vocabulary size seems small (len(model.wv)
) or if a sample of learned-words is only single-character words ('model.wv.index2entity[:10]`).
如果您提供正确格式的单词(至少min_count
次)作为训练数据的一部分,它将在模型中带有一个向量.
If you supply a word in the right format, at least min_count
times, as part of the training-data, it will wind up with a vector in the model.
(单独:size=10000
是通常的100-400范围之外的一种选择方式.我从未见过对字向量使用如此高维的项目,并且只有在您拥有庞大的词汇量和训练集.词汇量/数据较小的超大向量可能会产生无用的过拟合结果.)
(Separately: size=10000
is a choice way outside the usual range of 100-400. I've never seen a project using such high-dimensionality for word-vectors, and it would only be theoretically justifiable if you had a massively-large vocabulary and training-set. Oversized vectors with smaller vocabularies/data are likely to create uselessly overfit results.)
这篇关于word2vec-KeyError:“单词X不在词汇表中"的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!