如何在将单词呈现为嵌入的同时在整个词汇预测上拥有 LSTM Autoencoder 模型 [英] how to have a LSTM Autoencoder model over the whole vocab prediction while presenting words as embedding

查看:18
本文介绍了如何在将单词呈现为嵌入的同时在整个词汇预测上拥有 LSTM Autoencoder 模型的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

所以我一直在研究 LSTM Autoencoder 模型.我还创建了这个模型的各种版本.

So I have been working on LSTM Autoencoder model. I have also created various version of this model.

1. 使用已经训练好的词嵌入创建模型:在这种情况下,我使用已经训练好的手套向量的权重作为特征(文本数据)的权重.这是结构:

1. create the model using the already trained word embedding: in this scenario, I used the weights of already trained Glove vector, as the weight of features(text data). This is the structure:

inputs = Input(shape=(SEQUENCE_LEN, EMBED_SIZE), name="input")
    encoded = Bidirectional(LSTM(LATENT_SIZE), merge_mode="sum", name="encoder_lstm")(inputs)
    encoded =Lambda(rev_entropy)(encoded)
    decoded = RepeatVector(SEQUENCE_LEN, name="repeater")(encoded)
    decoded = Bidirectional(LSTM(EMBED_SIZE, return_sequences=True), merge_mode="sum", name="decoder_lstm")(decoded)
    autoencoder = Model(inputs, decoded)
    autoencoder.compile(optimizer="sgd", loss='mse')
    autoencoder.summary()
    checkpoint = ModelCheckpoint(filepath='checkpoint/{epoch}.hdf5')
    history = autoencoder.fit_generator(train_gen, steps_per_epoch=num_train_steps, epochs=NUM_EPOCHS, validation_data=test_gen, validation_steps=num_test_steps, callbacks=[checkpoint])

  1. 在第二个场景中,我在模型本身中实现了词嵌入层:

这是结构:

inputs = Input(shape=(SEQUENCE_LEN, ), name="input")
embedding = Embedding(input_dim=VOCAB_SIZE, output_dim=EMBED_SIZE, input_length=SEQUENCE_LEN,trainable=False)(inputs)
encoded = Bidirectional(LSTM(LATENT_SIZE), merge_mode="sum", name="encoder_lstm")(embedding)
decoded = RepeatVector(SEQUENCE_LEN, name="repeater")(encoded)
decoded = LSTM(EMBED_SIZE, return_sequences=True)(decoded)
autoencoder = Model(inputs, decoded)
autoencoder.compile(optimizer="sgd", loss='categorical_crossentropy')
autoencoder.summary()   
checkpoint = ModelCheckpoint(filepath=os.path.join('Data/', "simple_ae_to_compare"))
history = autoencoder.fit_generator(train_gen, steps_per_epoch=num_train_steps, epochs=NUM_EPOCHS,  validation_steps=num_test_steps)

  1. 在第三个场景中,我没有使用任何嵌入技术,而是使用了one hot encoding 来处理特征.这是模型的结构:

  1. in the third scenario, I did not use any embedding techniques but used the one hot encoding for the features. and this is the structure of the model:

`inputs = Input(shape=(SEQUENCE_LEN, VOCAB_SIZE), name="input")
encoded = Bidirectional(LSTM(LATENT_SIZE, kernel_initializer="glorot_normal",), merge_mode="sum", name="encoder_lstm")(inputs)
encoded = Lambda(score_cooccurance,  name='Modified_layer')(encoded)
decoded = RepeatVector(SEQUENCE_LEN, name="repeater")(encoded)
decoded = LSTM(VOCAB_SIZE, return_sequences=True)(decoded)
autoencoder = Model(inputs, decoded)
sgd = optimizers.SGD(lr=0.1, decay=1e-6, momentum=0.9, nesterov=True)
autoencoder.compile(optimizer=sgd, loss='categorical_crossentropy')
autoencoder.summary()   
checkpoint = ModelCheckpoint(filepath='checkpoint/50/{epoch}.hdf5')
history = autoencoder.fit_generator(train_gen, steps_per_epoch=num_train_steps, epochs=NUM_EPOCHS, callbacks=[checkpoint])`

如您所见,在第一个和第二个模型中,decoding 中的 Embed_size 是该层中的神经元数量.它导致编码器层的输出形状变为[Latent_size, Embed_size].

As you see, in the first and second model Embed_size in the decoding is the number of neurons in that layer. it causes the output shape of encoder layer becomes [Latent_size, Embed_size].

在第三个模型中,编码器的输出形状为[Latent_size, Vocab_size].

in the third model, the output shape of the encoder is [Latent_size, Vocab_size].

现在我的问题

是否可以通过嵌入将我的单词表示到模型中的方式更改模型的结构,同时在解码器层具有 vocab_size?

Is it doable to change the structure of the model in a way I have embedding for representing my words to the model, and at the same time having vocab_size in the decoder layer?

我需要将编码器层的 output_shape 设为 [Latent_size, Vocab_size],同时我不想将我的特征表示为 one_hot 编码 显而易见的原因.

I need to have output_shape of encoder layer be [Latent_size, Vocab_size] and at the same time I don't want to represent my features as the one_hot encoding for the obvious reason.

如果您能与我分享您的想法,我将不胜感激.一个想法可能是添加更多层,考虑到无论如何我都不希望在最后一层有 Embed_size.

I appreciate if you can share your idea with me. One idea could be adding more layers, consider that with any cost I don't want to have Embed_size in the last layer.

推荐答案

您的问题:

是否可以通过嵌入将我的单词表示到模型中的方式更改模型的结构,同时在解码器层中使用 vocab_size?

Is it doable to change the structure of the model in a way I have embedding for representing my words to the model, and at the same time having vocab_size in the decoder layer?

我喜欢使用 Tensorflow 转换器模型作为参考:https://github.com/tensorflow/models/tree/master/official/变压器

I like to use as reference the Tensorflow transformer model: https://github.com/tensorflow/models/tree/master/official/transformer

在语言翻译任务中,模型输入往往是一个标记索引,然后进行嵌入查找,结果为 (sequence_length, embedding_dims);编码器本身适用于这种形状.解码器的输出也往往是 (sequence_length, embedding_dims) 的形状.例如上面的模型,然后通过在输出和嵌入向量之间进行点积将解码器输出转换为 logits.这是他们使用的转换:https:///github.com/tensorflow/models/blob/master/official/transformer/model/embedding_layer.py#L94

In language translation tasks the model input tends to be a token index, which then is subject to an embedding lookup resulting in a shape of (sequence_length, embedding_dims); the encoder itself works on this shape. The decoder output tends to be in the shape of (sequence_length, embedding_dims) also. For instance the model above, then transforms the decoder output into logits by doing a dot product between the output and the embedding vectors. This is the transformation they use: https://github.com/tensorflow/models/blob/master/official/transformer/model/embedding_layer.py#L94

我会推荐一种类似于语言翻译模型的方法:

I would recommend an approach similar to the language translation models:

  • 前期:
    • input_shape=(sequence_length, 1) [ 即 token_index in [0.. vocab_size)
    • input_shape=(sequence_length, embedding_dims)
    • output_shape=(latent_dims)
    • input_shape=(latent_dims)
    • output_shape=(sequence_length, embedding_dims)

    预处理将令牌索引转换为embedding_dims.这可用于生成编码器输入和解码器目标.

    Pre-processing converts token indexes into embedding_dims. This can be used to generate both the encoder input as well as the decoder targets.

    将 embedding_dims 转换为 logits 的后处理(在 vocab_index 空间中).

    Post processing to convert embedding_dims to logits (in the vocab_index space).

    我需要将编码器层的 output_shape 设为 [Latent_size, Vocab_size],同时出于显而易见的原因,我不想将我的特征表示为 one_hot 编码.

    I need to have output_shape of encoder layer be [Latent_size, Vocab_size] and at the same time I don't want to represent my features as the one_hot encoding for the obvious reason.

    这听起来不对.通常,人们试图用自动编码器实现的是为句子提供一个嵌入向量.所以编码器的输出通常在 [latent_dims] 中.解码器的输出需要可转换为 [sequence_length, vocab_index (1) ],这通常通过从嵌入空间转换为 logits,然后将 argmax 转换为令牌索引来完成.

    That doesn't sound right. Typically what one is trying to achieve with an auto-encoder is to have a embedding vector for the sentence. So the output of the encoder in typically [latent_dims]. The output of the decoder needs to be translatable into [sequence_length, vocab_index (1) ] which is typically done by converting from embedding space to logits and then taking the argmax to convert to token index.

    这篇关于如何在将单词呈现为嵌入的同时在整个词汇预测上拥有 LSTM Autoencoder 模型的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆