如何在整个词汇的预测中拥有LSTM自动编码器模型,同时将单词呈现为嵌入 [英] how to have a LSTM Autoencoder model over the whole vocab prediction while presenting words as embedding

查看:95
本文介绍了如何在整个词汇的预测中拥有LSTM自动编码器模型,同时将单词呈现为嵌入的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

所以我一直在研究LSTM Autoencoder model.我还创建了该模型的各种版本.

So I have been working on LSTM Autoencoder model. I have also created various version of this model.

1..使用已经受过训练的单词嵌入来创建模型: 在这种情况下,我使用已经训练好的手套矢量的权重作为特征(文本数据)的权重. 这是结构:

1. create the model using the already trained word embedding: in this scenario, I used the weights of already trained Glove vector, as the weight of features(text data). This is the structure:

inputs = Input(shape=(SEQUENCE_LEN, EMBED_SIZE), name="input")
    encoded = Bidirectional(LSTM(LATENT_SIZE), merge_mode="sum", name="encoder_lstm")(inputs)
    encoded =Lambda(rev_entropy)(encoded)
    decoded = RepeatVector(SEQUENCE_LEN, name="repeater")(encoded)
    decoded = Bidirectional(LSTM(EMBED_SIZE, return_sequences=True), merge_mode="sum", name="decoder_lstm")(decoded)
    autoencoder = Model(inputs, decoded)
    autoencoder.compile(optimizer="sgd", loss='mse')
    autoencoder.summary()
    checkpoint = ModelCheckpoint(filepath='checkpoint/{epoch}.hdf5')
    history = autoencoder.fit_generator(train_gen, steps_per_epoch=num_train_steps, epochs=NUM_EPOCHS, validation_data=test_gen, validation_steps=num_test_steps, callbacks=[checkpoint])

  1. 在第二种情况下,我在模型本身中实现了单词嵌入层:

这是结构:

inputs = Input(shape=(SEQUENCE_LEN, ), name="input")
embedding = Embedding(input_dim=VOCAB_SIZE, output_dim=EMBED_SIZE, input_length=SEQUENCE_LEN,trainable=False)(inputs)
encoded = Bidirectional(LSTM(LATENT_SIZE), merge_mode="sum", name="encoder_lstm")(embedding)
decoded = RepeatVector(SEQUENCE_LEN, name="repeater")(encoded)
decoded = LSTM(EMBED_SIZE, return_sequences=True)(decoded)
autoencoder = Model(inputs, decoded)
autoencoder.compile(optimizer="sgd", loss='categorical_crossentropy')
autoencoder.summary()   
checkpoint = ModelCheckpoint(filepath=os.path.join('Data/', "simple_ae_to_compare"))
history = autoencoder.fit_generator(train_gen, steps_per_epoch=num_train_steps, epochs=NUM_EPOCHS,  validation_steps=num_test_steps)

    在第三种情况下,
  1. 我没有使用任何嵌入技术,而是使用one hot encoding作为功能.这是模型的结构:

  1. in the third scenario, I did not use any embedding techniques but used the one hot encoding for the features. and this is the structure of the model:

`inputs = Input(shape=(SEQUENCE_LEN, VOCAB_SIZE), name="input")
encoded = Bidirectional(LSTM(LATENT_SIZE, kernel_initializer="glorot_normal",), merge_mode="sum", name="encoder_lstm")(inputs)
encoded = Lambda(score_cooccurance,  name='Modified_layer')(encoded)
decoded = RepeatVector(SEQUENCE_LEN, name="repeater")(encoded)
decoded = LSTM(VOCAB_SIZE, return_sequences=True)(decoded)
autoencoder = Model(inputs, decoded)
sgd = optimizers.SGD(lr=0.1, decay=1e-6, momentum=0.9, nesterov=True)
autoencoder.compile(optimizer=sgd, loss='categorical_crossentropy')
autoencoder.summary()   
checkpoint = ModelCheckpoint(filepath='checkpoint/50/{epoch}.hdf5')
history = autoencoder.fit_generator(train_gen, steps_per_epoch=num_train_steps, epochs=NUM_EPOCHS, callbacks=[checkpoint])`

如您所见,在第一个和第二个模型中,在decoding中的Embed_size是该层中神经元的数量.这会导致编码器层的输出形状变为[Latent_size, Embed_size].

As you see, in the first and second model Embed_size in the decoding is the number of neurons in that layer. it causes the output shape of encoder layer becomes [Latent_size, Embed_size].

在第三个模型中,编码器的输出形状为[Latent_size, Vocab_size].

in the third model, the output shape of the encoder is [Latent_size, Vocab_size].

现在我的问题

是否可以以我用来嵌入单词的方式更改模型的结构,同时在解码器层中包含vocab_size?

Is it doable to change the structure of the model in a way I have embedding for representing my words to the model, and at the same time having vocab_size in the decoder layer?

我需要将编码器层的output_shape设置为[Latent_size, Vocab_size],同时出于明显的原因,我不想将自己的特征表示为one_hot encoding.

I need to have output_shape of encoder layer be [Latent_size, Vocab_size] and at the same time I don't want to represent my features as the one_hot encoding for the obvious reason.

如果您能与我分享您的想法,我表示感谢. 一个想法可能是增加更多的层,考虑到我不希望在最后一层加上Embed_size的任何代价.

I appreciate if you can share your idea with me. One idea could be adding more layers, consider that with any cost I don't want to have Embed_size in the last layer.

推荐答案

您的问题:

是否可以用我嵌入的方式来更改模型的结构,以表示我的话语到模型中,同时在解码器层中具有vocab_size?

Is it doable to change the structure of the model in a way I have embedding for representing my words to the model, and at the same time having vocab_size in the decoder layer?

我喜欢将Tensorflow变压器模型用作参考: https://github.com/tensorflow/models/tree/master/official/变压器

I like to use as reference the Tensorflow transformer model: https://github.com/tensorflow/models/tree/master/official/transformer

在语言翻译任务中,模型输入往往是令牌索引,然后对其进行嵌入查找,从而形成(sequence_length,embedding_dims)形状;编码器本身就可以在这种形状上工作. 解码器输出也趋向于也具有(sequence_length,embedding_dims)的形状.例如上述模型,然后通过在输出和嵌入矢量之间进行点积运算,将解码器输出转换为logit.这是他们使用的转换: https://github.com/tensorflow/models/blob/master/official/transformer/model/embedding_layer.py#L94

In language translation tasks the model input tends to be a token index, which then is subject to an embedding lookup resulting in a shape of (sequence_length, embedding_dims); the encoder itself works on this shape. The decoder output tends to be in the shape of (sequence_length, embedding_dims) also. For instance the model above, then transforms the decoder output into logits by doing a dot product between the output and the embedding vectors. This is the transformation they use: https://github.com/tensorflow/models/blob/master/official/transformer/model/embedding_layer.py#L94

我建议一种类似于语言翻译模型的方法:

I would recommend an approach similar to the language translation models:

  • 前阶段:
    • input_shape =(sequence_length,1)[即[0 .. vocab_size中的token_index)
    • pre-stage:
      • input_shape=(sequence_length, 1) [ i.e. token_index in [0.. vocab_size)
      • input_shape =(sequence_length,embedding_dims)
      • output_shape =(潜伏数)
      • input_shape = {latent_dims)
      • output_shape =(sequence_length,embedding_dims)

      预处理将令牌索引转换为embedding_dims.这可用于生成编码器输入和解码器目标.

      Pre-processing converts token indexes into embedding_dims. This can be used to generate both the encoder input as well as the decoder targets.

      后处理以将embedding_dims转换为logits(在vocab_index空间中).

      Post processing to convert embedding_dims to logits (in the vocab_index space).

      我需要使编码器层的output_shape为[Latent_size,Vocab_size],同时出于明显的原因,我不想将特征表示为one_hot编码.

      I need to have output_shape of encoder layer be [Latent_size, Vocab_size] and at the same time I don't want to represent my features as the one_hot encoding for the obvious reason.

      听起来不对.通常,使用自动编码器尝试实现的目标是为句子提供嵌入向量.因此,编码器的输出通常为[latent_dims].解码器的输出需要可转换为[sequence_length,vocab_index(1)],通常是通过从嵌入空间转换为logits,然后采用argmax转换为令牌索引来完成的.

      That doesn't sound right. Typically what one is trying to achieve with an auto-encoder is to have a embedding vector for the sentence. So the output of the encoder in typically [latent_dims]. The output of the decoder needs to be translatable into [sequence_length, vocab_index (1) ] which is typically done by converting from embedding space to logits and then taking the argmax to convert to token index.

      这篇关于如何在整个词汇的预测中拥有LSTM自动编码器模型,同时将单词呈现为嵌入的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆