Gensim Word2Vec从预训练模型中选择次要词向量集 [英] Gensim Word2Vec select minor set of word vectors from pretrained model
问题描述
我在gensim中有一个大型的预训练Word2Vec模型,从中我想将预训练的词向量用于我的Keras模型中的嵌入层.
I have a large pretrained Word2Vec model in gensim from which I want to use the pretrained word vectors for an embedding layer in my Keras model.
问题在于嵌入量很大,并且我不需要大多数单词向量(因为我知道哪些单词可以作为Input出现).因此,我想摆脱它们以减小嵌入层的大小.
The problem is that the embedding size is enormous and I don't need most of the word vectors (because I know which words can occure as Input). So I want to get rid of them to reduce the size of my embedding layer.
是否有一种方法可以根据单词白名单保留所需的单词矢量(包括对应的索引!)?
Is there a way to just keep desired wordvectors (including the coresponding indices!), based on a whitelist of words?
推荐答案
感谢此答案(我已经更改了代码,以使其变得更好).您可以使用此代码解决问题.
Thanks to this answer (I've changed the code a little bit to make it better). you can use this code for solving your problem.
我们在restricted_word_set
中有所有次要的单词集(可以是列表或集合),而w2v
是我们的模型,所以这里是函数:
we have all our minor set of words in restricted_word_set
(it can be either list or set) and w2v
is our model, so here is the function:
import numpy as np
def restrict_w2v(w2v, restricted_word_set):
new_vectors = []
new_vocab = {}
new_index2entity = []
new_vectors_norm = []
for i in range(len(w2v.vocab)):
word = w2v.index2entity[i]
vec = w2v.vectors[i]
vocab = w2v.vocab[word]
vec_norm = w2v.vectors_norm[i]
if word in restricted_word_set:
vocab.index = len(new_index2entity)
new_index2entity.append(word)
new_vocab[word] = vocab
new_vectors.append(vec)
new_vectors_norm.append(vec_norm)
w2v.vocab = new_vocab
w2v.vectors = np.array(new_vectors)
w2v.index2entity = np.array(new_index2entity)
w2v.index2word = np.array(new_index2entity)
w2v.vectors_norm = np.array(new_vectors_norm)
警告:首次创建模型时,请按
vectors_norm == None
如果在此使用此功能,将会得到一个错误.vectors_norm
首次使用后,将获得类型为numpy.ndarray
的值.所以 在使用该功能之前,请尝试类似most_similar("cat")
这样的操作vectors_norm
不等于None
.
WARNING: when you first create the model the
vectors_norm == None
so you will get an error if you use this function there.vectors_norm
will get a value of the typenumpy.ndarray
after the first use. so before using the function try something likemost_similar("cat")
so thatvectors_norm
not be equal toNone
.
它根据 Word2VecKeyedVectors .
用法:
w2v = KeyedVectors.load_word2vec_format("GoogleNews-vectors-negative300.bin.gz", binary=True)
w2v.most_similar("beer")
[('啤酒',0.8409687876701355),
("lager",0.7733745574951172),
(啤酒",0.71753990650177),
(饮料",0.668931245803833),
("lagers",0.6570086479187012),
('Yuengling_Lager',0.655455470085144),
('microbrew',0.6534324884414673),
('Brooklyn_Lager',0.6501551866531372),
('suds',0.6497018337249756),
("brewed_beer",0.6490240097045898)
[('beers', 0.8409687876701355),
('lager', 0.7733745574951172),
('Beer', 0.71753990650177),
('drinks', 0.668931245803833),
('lagers', 0.6570086479187012),
('Yuengling_Lager', 0.655455470085144),
('microbrew', 0.6534324884414673),
('Brooklyn_Lager', 0.6501551866531372),
('suds', 0.6497018337249756),
('brewed_beer', 0.6490240097045898)]
restricted_word_set = {"beer", "wine", "computer", "python", "bash", "lagers"}
restrict_w2v(w2v, restricted_word_set)
w2v.most_similar("beer")
[('lagers',0.6570085287094116),
(葡萄酒",0.6217695474624634),
("bash",0.20583480596542358),
(计算机",0.06677375733852386),
("python",0.005948573350906372)]
[('lagers', 0.6570085287094116),
('wine', 0.6217695474624634),
('bash', 0.20583480596542358),
('computer', 0.06677375733852386),
('python', 0.005948573350906372)]
它也可以用于删除一些单词.
it can be used for removing some words either.
这篇关于Gensim Word2Vec从预训练模型中选择次要词向量集的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!