gensim在GoogleNews中添加新词 [英] add new words to GoogleNews by gensim
问题描述
我想为语料库中的单词嵌入单词.我决定使用 gensim 库在 GoogleNews 中使用预训练的单词向量.但是我的语料库包含的一些单词不在GoogleNews单词中.对于这些遗漏的单词,我想在GoggoleNews单词中使用n个最相似的单词的算术平均值.首先,我加载GoogleNews并检查其中是否包含"to"一词?
I want to get word embeddings for the words in a corpus. I decide to use pretrained word vectors in GoogleNews by gensim library. But my corpus contains some words that are not in GoogleNews words. for these missing words, I want to use arithmatic mean of n most similar words to it in GoggoleNews words. First I load GoogleNews and check that the word "to" is in it?
#Load GoogleNews pretrained word2vec model
model=word2vec.KeyedVectors.Load_word2vec_format("GoogleNews-vectors-negative33.bin",binary=True)
print(model["to"])
我收到一个错误:keyError: "word 'to' not in vocabulary"
这么大的数据集可能没有这个词吗?对于其他一些常见的单词(例如"a")也是如此!
I receive an error: keyError: "word 'to' not in vocabulary"
is it possible that such a large dataset doesn't have this word? this is true also for some other common word like "a"!
要向word2vec模型添加缺少的单词,首先,我想获取GoogleNews中单词的索引.对于遗漏的单词,我使用了索引0.
For adding missing words to word2vec model,first I want to get indices of words that are in GoogleNews. for missing words I have used index 0.
#obtain index of words
word_to_idx=OrderedDict({w:0 for w in corpus_words})
word_to_idx=OrderedDict({w:model.wv.vocab[w].index for w in corpus_words if w in model.wv.vocab})
然后我计算出最相似的词与每个遗漏词的嵌入向量的平均值.
then I calculate the mean of embedding vectors of most similar words to each missing word.
missing_embd={}
for key,value in word_to_idx.items():
if value==0:
similar_words=model.wv.most_similar(key)
similar_embeddings=[model.wv[a[0]] for a in similar_words]
missing_embd[key]=mean(similar_embeddings)
然后我通过以下方式将这些新闻嵌入添加到word2vec模型中:
And then I add these news embeddings to word2vec model by:
for word,embd in missing_embd.items():
# model.wv.build_vocab(word,update=True)
model.wv.syn0[model.wv.vocab[word].index]=embd
不一致.当我打印missing_embed时,它是空的.好像没有任何遗漏的单词. 但是当我通过这个检查它时:
There is an un-consistency. When I print missing_embed, it's empty. As if there were not any missing words. But when I check it by this:
for w in tokens_lower:
if(w in model.wv.vocab)==False:
print(w)
print("***********")
我发现了很多遗漏的单词. 现在,我有3个问题: 1-为什么在缺少某些单词的情况下 missing_embed 为空? 2- GoogleNews是否可能没有诸如"to"之类的词? 3-如何将新的嵌入内容添加到word2vec模型?我使用了 build_vocab 和 syn0 .谢谢.
I found a lot of missing words. Now, I have 3 questions: 1- why missing_embed is empty while there are some missing words? 2- Is it possible that GoogleNews doesn't have words like "to"? 3- how can I append new embeddings to word2vec model? I used build_vocab and syn0. Thanks.
推荐答案
在这种情况下,我们添加了一个丢失的小写字母.
Here is a scenario where we are adding a missing lower case word.
from gensim.models import KeyedVectors
path = '../input/embeddings/GoogleNews-vectors-negative300/GoogleNews-vectors-negative300.bin'
embedding = KeyedVectors.load_word2vec_format(path, binary=True)
'Quoran' in embedding.vocab
Output : True
'quoran' in embedding.vocab
Output : False
此处存在Quoran,但缺少小写的quoran
Here Quoran is present but quoran in lower case is missing
# add quoran in lower case
embedding.add('quoran',embedding.get_vector('Quoran'),replace=False)
'quoran' in embedding.vocab
Output : True
这篇关于gensim在GoogleNews中添加新词的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!