gensim在GoogleNews中添加新词 [英] add new words to GoogleNews by gensim

查看：222 发布时间：2020/11/13 6:18:56 python word2vec gensim google-news

本文介绍了gensim在GoogleNews中添加新词的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我想为语料库中的单词嵌入单词.我决定使用 gensim 库在 GoogleNews 中使用预训练的单词向量.但是我的语料库包含的一些单词不在GoogleNews单词中.对于这些遗漏的单词，我想在GoggoleNews单词中使用n个最相似的单词的算术平均值.首先，我加载GoogleNews并检查其中是否包含"to"一词?

I want to get word embeddings for the words in a corpus. I decide to use pretrained word vectors in GoogleNews by gensim library. But my corpus contains some words that are not in GoogleNews words. for these missing words, I want to use arithmatic mean of n most similar words to it in GoggoleNews words. First I load GoogleNews and check that the word "to" is in it?

#Load GoogleNews pretrained word2vec model
model=word2vec.KeyedVectors.Load_word2vec_format("GoogleNews-vectors-negative33.bin",binary=True)
print(model["to"])

我收到一个错误:keyError: "word 'to' not in vocabulary" 这么大的数据集可能没有这个词吗?对于其他一些常见的单词(例如"a")也是如此！

I receive an error: keyError: "word 'to' not in vocabulary" is it possible that such a large dataset doesn't have this word? this is true also for some other common word like "a"!

要向word2vec模型添加缺少的单词，首先，我想获取GoogleNews中单词的索引.对于遗漏的单词，我使用了索引0.

For adding missing words to word2vec model,first I want to get indices of words that are in GoogleNews. for missing words I have used index 0.

#obtain index of words
word_to_idx=OrderedDict({w:0 for w in corpus_words})
word_to_idx=OrderedDict({w:model.wv.vocab[w].index for w in corpus_words if w in model.wv.vocab})

然后我计算出最相似的词与每个遗漏词的嵌入向量的平均值.

then I calculate the mean of embedding vectors of most similar words to each missing word.

missing_embd={}
for key,value in word_to_idx.items():
    if value==0:
        similar_words=model.wv.most_similar(key)
        similar_embeddings=[model.wv[a[0]] for a in similar_words]
        missing_embd[key]=mean(similar_embeddings)

然后我通过以下方式将这些新闻嵌入添加到word2vec模型中:

And then I add these news embeddings to word2vec model by:

for word,embd in missing_embd.items():
    # model.wv.build_vocab(word,update=True)
    model.wv.syn0[model.wv.vocab[word].index]=embd

不一致.当我打印missing_embed时，它是空的.好像没有任何遗漏的单词. 但是当我通过这个检查它时:

There is an un-consistency. When I print missing_embed, it's empty. As if there were not any missing words. But when I check it by this:

for w in tokens_lower:
    if(w in model.wv.vocab)==False:
        print(w)
        print("***********")

我发现了很多遗漏的单词. 现在，我有3个问题: 1-为什么在缺少某些单词的情况下 missing_embed 为空? 2- GoogleNews是否可能没有诸如"to"之类的词? 3-如何将新的嵌入内容添加到word2vec模型?我使用了 build_vocab 和 syn0 .谢谢.

I found a lot of missing words. Now, I have 3 questions: 1- why missing_embed is empty while there are some missing words? 2- Is it possible that GoogleNews doesn't have words like "to"? 3- how can I append new embeddings to word2vec model? I used build_vocab and syn0. Thanks.

gensim在GoogleNews中添加新词 [英] add new words to GoogleNews by gensim

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

gensim在GoogleNews中添加新词 [英] add new words to GoogleNews by gensim

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭