在keras中使用预训练的gensim Word2vec嵌入 [英] Using pretrained gensim Word2vec embedding in keras

查看:559
本文介绍了在keras中使用预训练的gensim Word2vec嵌入的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已经在gensim中训练了word2vec.在Keras中,我想使用该单词嵌入单词来构成句子矩阵.由于存储所有句子的矩阵非常空间和内存效率低下.因此,我想在Keras中制作嵌入层以实现此目的,以便可以在其他层(LSTM)中使用它.您能详细告诉我该怎么做吗?

I have trained word2vec in gensim. In Keras, I want to use it to make matrix of sentence using that word embedding. As storing the matrix of all the sentences is very space and memory inefficient. So, I want to make embedding layer in Keras to achieve this so that It can be used in further layers(LSTM). Can you tell me in detail how to do this?

PS:它与其他问题不同,因为我使用gensim而不是keras来进行word2vec训练.

PS: It is different from other questions because I am using gensim for word2vec training instead of keras.

推荐答案

假设您有以下数据需要编码

Let's say you have following data that you need to encode

docs = ['Well done!',
        'Good work',
        'Great effort',
        'nice work',
        'Excellent!',
        'Weak',
        'Poor effort!',
        'not good',
        'poor work',
        'Could have done better.']

然后您必须使用Keras的Tokenizer将其标记化,并找到vocab_size

You must then tokenize it using the Tokenizer from Keras like this and find the vocab_size

t = Tokenizer()
t.fit_on_texts(docs)
vocab_size = len(t.word_index) + 1

然后您可以将其包裹在这样的序列中

You can then enocde it to sequences like this

encoded_docs = t.texts_to_sequences(docs)
print(encoded_docs)

然后您可以填充序列,以使所有序列具有固定的长度

You can then pad the sequences so that all the sequences are of a fixed length

max_length = 4
padded_docs = pad_sequences(encoded_docs, maxlen=max_length, padding='post')

然后使用word2vec模型制作嵌入矩阵

Then use the word2vec model to make embedding matrix

# load embedding as a dict
def load_embedding(filename):
    # load embedding into memory, skip first line
    file = open(filename,'r')
    lines = file.readlines()[1:]
    file.close()
    # create a map of words to vectors
    embedding = dict()
    for line in lines:
        parts = line.split()
        # key is string word, value is numpy array for vector
        embedding[parts[0]] = asarray(parts[1:], dtype='float32')
    return embedding

# create a weight matrix for the Embedding layer from a loaded embedding
def get_weight_matrix(embedding, vocab):
    # total vocabulary size plus 0 for unknown words
    vocab_size = len(vocab) + 1
    # define weight matrix dimensions with all 0
    weight_matrix = zeros((vocab_size, 100))
    # step vocab, store vectors using the Tokenizer's integer mapping
    for word, i in vocab.items():
        weight_matrix[i] = embedding.get(word)
    return weight_matrix

# load embedding from file
raw_embedding = load_embedding('embedding_word2vec.txt')
# get vectors in the right order
embedding_vectors = get_weight_matrix(raw_embedding, t.word_index)

一旦有了嵌入矩阵,就可以像这样在Embedding层中使用它

Once you have the embedding matrix you can use it in Embedding layer like this

e = Embedding(vocab_size, 100, weights=[embedding_vectors], input_length=4, trainable=False)

该层可用于制作这样的模型

This layer can be used in making a model like this

model = Sequential()
e = Embedding(vocab_size, 100, weights=[embedding_matrix], input_length=4, trainable=False)
model.add(e)
model.add(Flatten())
model.add(Dense(1, activation='sigmoid'))
# compile the model
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['acc'])
# summarize the model
print(model.summary())
# fit the model
model.fit(padded_docs, labels, epochs=50, verbose=0)

所有代码均改编自很棒的博客邮政.跟随它以了解有关使用手套的嵌入的更多信息

All the codes are adapted from this awesome blog post. follow it to know more about Embeddings using Glove

有关使用word2vec的信息,请参见帖子

For using word2vec see this post

这篇关于在keras中使用预训练的gensim Word2vec嵌入的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆