word2vec-KeyError:“单词X不在词汇表中" [英] word2vec - KeyError: "word X not in vocabulary"

查看:616
本文介绍了word2vec-KeyError:“单词X不在词汇表中"的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

使用模块gensimWord2Vec实现,以便为我在纯文本文件中拥有的句子构建单词嵌入.尽管词汇中定义了单词happy,但仍然出现错误KeyError: "word 'happy' not in vocabulary".尝试将给定的答案应用于类似的问题,但没有用.因此,发表了我自己的问题.

Using the Word2Vec implementation of the module gensim in order to construct word embeddings for the sentences I do have in a plain text file. Despite the word happy is defined in the vocabulary, getting the error KeyError: "word 'happy' not in vocabulary". Tried to apply the given the answers to a similar question, but did not work. Hence, posted my own question.

这是代码:

try:
    data = []
    with open(TXT_PATH, 'r', encoding='utf-8') as txt_file:
        for line in txt_file:
            for part in line.split(' '):
                data.append(part.strip())

    # When I debug, both of the words 'happy' and 'birthday' exist in the variable 'data'
    word2vec = Word2Vec(data, min_count=5, size=10000, window=5, workers=4)

    # Print result
    word_1 = 'happy'
    word_2 = 'birthday'
    print(f'Similarity between {word_1} and {word_2} thru word2vec: {word2vec.similarity(word_1, word_2)}')
except Exception as err:
    print(f'An error happened! Detail: {str(err)}')

推荐答案

当您从Word2Vec收到这样的词汇不正确"错误时,您可以相信它:'happy'确实不在模型中.

When you get a "not in vocabulary" error like this from Word2Vec, you can trust it: 'happy' really isn't in the model.

即使您的视觉检查在文件中显示了'happy',它在模型中可能无法显示的一些原因也包括:

Even if your visual check shows 'happy' inside your file, a few reasons why it might not wind up inside the model include:

  • 它至少出现过min_count=5

data格式不适用于Word2Vec,因此看不到您希望看到的字词.

the data format isn't correct for Word2Vec, so it's not seeing the words you expect it to see.

看看代码如何编写data,它看起来像文件中所有单词的庞大列表.相反,Word2Vec期望一个序列,该序列具有与每个项目相同的单词列表.因此:不是单词列表,而是每个项目都是单词列表的列表.

Looking at how data is prepared by your code, it looks like a giant list of all words in your file. Word2Vec instead expects a sequence that has, as each item, a list-of-words for that one text. So: not a list-of-words, but a list where each item is a list-of-words.

如果您提供了...

[
  'happy',
  'birthday',
]

...而不是预期的...

...instead of the expected...

[
  ['happy', 'birthday',],
]

...这些单个单词字符串将被视为一个字符列表,因此Word2Vec会认为您想学习一堆单字符单词的单词向量.您可以通过查看词汇量是否较小(len(model.wv))或学习单词样本只是单字符单词('model.wv.index2entity [:10]`)来检查这是否影响了模型.

...those single-word-strings will be seen a lists-of-characters, so Word2Vec will think you want to learn word-vectors for a bunch of one-character words. You can check if this has affected your model by seeing if the vocabulary size seems small (len(model.wv)) or if a sample of learned-words is only single-character words ('model.wv.index2entity[:10]`).

如果您提供正确格式的单词(至少min_count次)作为训练数据的一部分,它将在模型中带有一个向量.

If you supply a word in the right format, at least min_count times, as part of the training-data, it will wind up with a vector in the model.

(单独:size=10000是通常的100-400范围之外的一种选择方式.我从未见过对字向量使用如此高维的项目,并且只有在您拥有庞大的词汇量和训练集.词汇量/数据较小的超大向量可能会产生无用的过拟合结果.)

(Separately: size=10000 is a choice way outside the usual range of 100-400. I've never seen a project using such high-dimensionality for word-vectors, and it would only be theoretically justifiable if you had a massively-large vocabulary and training-set. Oversized vectors with smaller vocabularies/data are likely to create uselessly overfit results.)

这篇关于word2vec-KeyError:“单词X不在词汇表中"的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆