gensim在GoogleNews中添加新词 [英] add new words to GoogleNews by gensim

查看:222
本文介绍了gensim在GoogleNews中添加新词的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想为语料库中的单词嵌入单词.我决定使用 gensim 库在 GoogleNews 中使用预训练的单词向量.但是我的语料库包含的一些单词不在GoogleNews单词中.对于这些遗漏的单词,我想在GoggoleNews单词中使用n个最相似的单词的算术平均值.首先,我加载GoogleNews并检查其中是否包含"to"一词?

I want to get word embeddings for the words in a corpus. I decide to use pretrained word vectors in GoogleNews by gensim library. But my corpus contains some words that are not in GoogleNews words. for these missing words, I want to use arithmatic mean of n most similar words to it in GoggoleNews words. First I load GoogleNews and check that the word "to" is in it?

#Load GoogleNews pretrained word2vec model
model=word2vec.KeyedVectors.Load_word2vec_format("GoogleNews-vectors-negative33.bin",binary=True)
print(model["to"])

我收到一个错误:keyError: "word 'to' not in vocabulary" 这么大的数据集可能没有这个词吗?对于其他一些常见的单词(例如"a")也是如此!

I receive an error: keyError: "word 'to' not in vocabulary" is it possible that such a large dataset doesn't have this word? this is true also for some other common word like "a"!

要向word2vec模型添加缺少的单词,首先,我想获取GoogleNews中单词的索引.对于遗漏的单词,我使用了索引0.

For adding missing words to word2vec model,first I want to get indices of words that are in GoogleNews. for missing words I have used index 0.

#obtain index of words
word_to_idx=OrderedDict({w:0 for w in corpus_words})
word_to_idx=OrderedDict({w:model.wv.vocab[w].index for w in corpus_words if w in model.wv.vocab})

然后我计算出最相似的词与每个遗漏词的嵌入向量的平均值.

then I calculate the mean of embedding vectors of most similar words to each missing word.

missing_embd={}
for key,value in word_to_idx.items():
    if value==0:
        similar_words=model.wv.most_similar(key)
        similar_embeddings=[model.wv[a[0]] for a in similar_words]
        missing_embd[key]=mean(similar_embeddings)

然后我通过以下方式将这些新闻嵌入添加到word2vec模型中:

And then I add these news embeddings to word2vec model by:

for word,embd in missing_embd.items():
    # model.wv.build_vocab(word,update=True)
    model.wv.syn0[model.wv.vocab[word].index]=embd

不一致.当我打印missing_embed时,它是空的.好像没有任何遗漏的单词. 但是当我通过这个检查它时:

There is an un-consistency. When I print missing_embed, it's empty. As if there were not any missing words. But when I check it by this:

for w in tokens_lower:
    if(w in model.wv.vocab)==False:
        print(w)
        print("***********")

我发现了很多遗漏的单词. 现在,我有3个问题: 1-为什么在缺少某些单词的情况下 missing_embed 为空? 2- GoogleNews是否可能没有诸如"to"之类的词? 3-如何将新的嵌入内容添加到word2vec模型?我使用了 build_vocab syn0 .谢谢.

I found a lot of missing words. Now, I have 3 questions: 1- why missing_embed is empty while there are some missing words? 2- Is it possible that GoogleNews doesn't have words like "to"? 3- how can I append new embeddings to word2vec model? I used build_vocab and syn0. Thanks.

推荐答案

在这种情况下,我们添加了一个丢失的小写字母.

Here is a scenario where we are adding a missing lower case word.

from gensim.models import KeyedVectors
path = '../input/embeddings/GoogleNews-vectors-negative300/GoogleNews-vectors-negative300.bin'
embedding = KeyedVectors.load_word2vec_format(path, binary=True)

'Quoran' in embedding.vocab
 Output : True

'quoran' in embedding.vocab
 Output : False

此处存在Quoran,但缺少小写的quoran

Here Quoran is present but quoran in lower case is missing

# add quoran in lower case
embedding.add('quoran',embedding.get_vector('Quoran'),replace=False)

'quoran' in embedding.vocab
 Output : True

这篇关于gensim在GoogleNews中添加新词的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆