使用预训练的 word2vec 和 LSTM 进行单词生成 [英] Using pre-trained word2vec with LSTM for word generation

查看:31
本文介绍了使用预训练的 word2vec 和 LSTM 进行单词生成的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

LSTM/RNN 可用于文本生成.这个展示了对 Keras 模型使用预训练的 GloVe 词嵌入.

LSTM/RNN can be used for text generation. This shows way to use pre-trained GloVe word embeddings for Keras model.

  1. 如何在 Keras LSTM 中使用预训练的 Word2Vec 词嵌入模型?这篇帖子确实有帮助.
  2. 当模型以单词序列作为输入时,如何预测/生成下一个单词?
  1. How to use pre-trained Word2Vec word embeddings with Keras LSTM model? This post did help.
  2. How to predict / generate next word when the model is provided with the sequence of words as its input?

尝试的示例方法:

# Sample code to prepare word2vec word embeddings    
import gensim
documents = ["Human machine interface for lab abc computer applications",
             "A survey of user opinion of computer system response time",
             "The EPS user interface management system",
             "System and human system engineering testing of EPS",
             "Relation of user perceived response time to error measurement",
             "The generation of random binary unordered trees",
             "The intersection graph of paths in trees",
             "Graph minors IV Widths of trees and well quasi ordering",
             "Graph minors A survey"]
sentences = [[word for word in document.lower().split()] for document in documents]

word_model = gensim.models.Word2Vec(sentences, size=200, min_count = 1, window = 5)

# Code tried to prepare LSTM model for word generation
from keras.layers.recurrent import LSTM
from keras.layers.embeddings import Embedding
from keras.models import Model, Sequential
from keras.layers import Dense, Activation

embedding_layer = Embedding(input_dim=word_model.syn0.shape[0], output_dim=word_model.syn0.shape[1], weights=[word_model.syn0])

model = Sequential()
model.add(embedding_layer)
model.add(LSTM(word_model.syn0.shape[1]))
model.add(Dense(word_model.syn0.shape[0]))   
model.add(Activation('softmax'))
model.compile(optimizer='sgd', loss='mse')

用于训练 LSTM 和预测的示例代码/伪代码将不胜感激.

Sample code / psuedocode to train LSTM and predict will be appreciated.

推荐答案

我已经创建了一个 gist 和一个简单的生成器,它建立在你的初始想法:这是一个连接到预训练 word2vec 嵌入的 LSTM 网络,经过训练以预测句子中的下一个单词.数据是来自arXiv网站的摘要列表.

I've created a gist with a simple generator that builds on top of your initial idea: it's an LSTM network wired to the pre-trained word2vec embeddings, trained to predict the next word in a sentence. The data is the list of abstracts from arXiv website.

我将在这里强调最重要的部分.

I'll highlight the most important parts here.

你的代码很好,除了训练它的迭代次数.默认的 iter=5 看起来相当低.此外,这绝对不是瓶颈——LSTM 训练需要更长的时间.iter=100 看起来更好.

Your code is fine, except for the number of iterations to train it. The default iter=5 seems rather low. Besides, it's definitely not the bottleneck -- LSTM training takes much longer. iter=100 looks better.

word_model = gensim.models.Word2Vec(sentences, vector_size=100, min_count=1, 
                                    window=5, iter=100)
pretrained_weights = word_model.wv.syn0
vocab_size, emdedding_size = pretrained_weights.shape
print('Result embedding shape:', pretrained_weights.shape)
print('Checking similar words:')
for word in ['model', 'network', 'train', 'learn']:
  most_similar = ', '.join('%s (%.2f)' % (similar, dist) 
                           for similar, dist in word_model.most_similar(word)[:8])
  print('  %s -> %s' % (word, most_similar))

def word2idx(word):
  return word_model.wv.vocab[word].index
def idx2word(idx):
  return word_model.wv.index2word[idx]

结果嵌入矩阵保存到pretrained_weights数组中,数组的形状为(vocab_size, emdedding_size).

The result embedding matrix is saved into pretrained_weights array which has a shape (vocab_size, emdedding_size).

你的代码几乎是正确的,除了损失函数.由于模型预测下一个词,这是一个分类任务,因此损失应该是 categorical_crossentropysparse_categorical_crossentropy.出于效率原因,我选择了后者:这样可以避免单热编码,这对于大词汇量来说非常昂贵.

Your code is almost correct, except for the loss function. Since the model predicts the next word, it's a classification task, hence the loss should be categorical_crossentropy or sparse_categorical_crossentropy. I've chosen the latter for efficiency reasons: this way it avoids one-hot encoding, which is pretty expensive for a big vocabulary.

model = Sequential()
model.add(Embedding(input_dim=vocab_size, output_dim=emdedding_size, 
                    weights=[pretrained_weights]))
model.add(LSTM(units=emdedding_size))
model.add(Dense(units=vocab_size))
model.add(Activation('softmax'))
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy')

注意将预训练的权重传递给 weights.

Note passing the pre-trained weights to weights.

为了处理 sparse_categorical_crossentropy 损失,句子和标签都必须是词索引.短句必须用零填充到公共长度.

In order to work with sparse_categorical_crossentropy loss, both sentences and labels must be word indices. Short sentences must be padded with zeros to the common length.

train_x = np.zeros([len(sentences), max_sentence_len], dtype=np.int32)
train_y = np.zeros([len(sentences)], dtype=np.int32)
for i, sentence in enumerate(sentences):
  for t, word in enumerate(sentence[:-1]):
    train_x[i, t] = word2idx(word)
  train_y[i] = word2idx(sentence[-1])

样本生成

这非常简单:模型输出概率向量,其中下一个单词被采样并附加到输入中.请注意,如果下一个单词是 sampled,而不是 picked 作为 argmax,那么生成的文本会更好、更多样化.我使用的基于温度的随机抽样是 此处描述.

Sample generation

This is pretty straight-forward: the model outputs the vector of probabilities, of which the next word is sampled and appended to the input. Note that the generated text would be better and more diverse if the next word is sampled, rather than picked as argmax. The temperature based random sampling I've used is described here.

def sample(preds, temperature=1.0):
  if temperature <= 0:
    return np.argmax(preds)
  preds = np.asarray(preds).astype('float64')
  preds = np.log(preds) / temperature
  exp_preds = np.exp(preds)
  preds = exp_preds / np.sum(exp_preds)
  probas = np.random.multinomial(1, preds, 1)
  return np.argmax(probas)

def generate_next(text, num_generated=10):
  word_idxs = [word2idx(word) for word in text.lower().split()]
  for i in range(num_generated):
    prediction = model.predict(x=np.array(word_idxs))
    idx = sample(prediction[-1], temperature=0.7)
    word_idxs.append(idx)
  return ' '.join(idx2word(idx) for idx in word_idxs)

生成的文本示例

deep convolutional... -> deep convolutional arithmetic initialization step unbiased effectiveness
simple and effective... -> simple and effective family of variables preventing compute automatically
a nonconvex... -> a nonconvex technique compared layer converges so independent onehidden markov
a... -> a function parameterization necessary both both intuitions with technique valpola utilizes

没有太大意义,但能够产生至少在语法上看起来合理的句子(有时).

Doesn't make too much sense, but is able to produce sentences that look at least grammatically sound (sometimes).

完整的可运行脚本的链接.

这篇关于使用预训练的 word2vec 和 LSTM 进行单词生成的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆