将嵌入层添加到 lstm 自动编码器时出错 [英] Getting error while adding embedding layer to lstm autoencoder

查看:35
本文介绍了将嵌入层添加到 lstm 自动编码器时出错的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个运行良好的 seq2seq 模型.我想在这个网络中添加一个嵌入层,但我遇到了错误.

I have a seq2seq model which is working fine. I want to add an embedding layer in this network which I faced with an error.

这是我使用预训练词嵌入的架构,效果很好(实际上代码几乎与可用代码相同这里,但我想在模型中包含嵌入层而不是使用预训练的嵌入向量):

this is my architecture using pretrained word embedding which is working fine(Actually the code is almost the same code available here, but I want to include the Embedding layer in the model rather than using the pretrained embedding vectors):

LATENT_SIZE = 20

inputs = Input(shape=(SEQUENCE_LEN, EMBED_SIZE), name="input")

encoded = Bidirectional(LSTM(LATENT_SIZE), merge_mode="sum", name="encoder_lstm")(inputs)
encoded = Lambda(rev_ent)(encoded)
decoded = RepeatVector(SEQUENCE_LEN, name="repeater")(encoded)
decoded = Bidirectional(LSTM(EMBED_SIZE, return_sequences=True), merge_mode="sum", name="decoder_lstm")(decoded)
autoencoder = Model(inputs, decoded)
autoencoder.compile(optimizer="sgd", loss='mse')
autoencoder.summary()
NUM_EPOCHS = 1

num_train_steps = len(Xtrain) // BATCH_SIZE
num_test_steps = len(Xtest) // BATCH_SIZE

checkpoint = ModelCheckpoint(filepath=os.path.join('Data/', "simple_ae_to_compare"), save_best_only=True)
history = autoencoder.fit_generator(train_gen, steps_per_epoch=num_train_steps, epochs=NUM_EPOCHS, validation_data=test_gen, validation_steps=num_test_steps, callbacks=[checkpoint])

这是摘要:

Layer (type)                 Output Shape              Param #   
=================================================================
input (InputLayer)           (None, 45, 50)            0         
_________________________________________________________________
encoder_lstm (Bidirectional) (None, 20)                11360     
_________________________________________________________________
lambda_1 (Lambda)            (512, 20)                 0         
_________________________________________________________________
repeater (RepeatVector)      (512, 45, 20)             0         
_________________________________________________________________
decoder_lstm (Bidirectional) (512, 45, 50)             28400  

当我更改代码以添加这样的嵌入层时:

when I change the code to add the embedding layer like this:

inputs = Input(shape=(SEQUENCE_LEN,), name="input")

embedding = Embedding(output_dim=EMBED_SIZE, input_dim=VOCAB_SIZE, input_length=SEQUENCE_LEN, trainable=True)(inputs)
encoded = Bidirectional(LSTM(LATENT_SIZE), merge_mode="sum", name="encoder_lstm")(embedding)

我收到此错误:

expected decoder_lstm to have 3 dimensions, but got array with shape (512, 45)

那么我的问题是,我的模型有什么问题?

So my question, what is wrong with my model?

更新

所以,这个错误是在训练阶段引发的.我还检查了提供给模型的数据的维度,它是 (61598, 45) 显然没有特征数量,或者在这里,Embed_dim.

So, this error is raised in the training phase. I also checked the dimension of the data being fed to the model, it is (61598, 45) which clearly do not have the number of features or here, Embed_dim.

但是为什么在解码器部分会出现此错误?因为在编码器部分我已经包含了嵌入层,所以完全没问题.但是当它到达解码器部分时,它没有嵌入层,因此无法正确地将其重塑为三维.

But why this error raises in the decoder part? because in the encoder part I have included the Embedding layer, so it is totally fine. though when it reached the decoder part and it does not have the embedding layer so it can not correctly reshape it to three dimensional.

现在问题来了,为什么在类似的代码中没有发生这种情况?这是我的观点,如果我错了,请纠正我.因为 Seq2Seq 代码通常用于翻译、总结.并且在这些代码中,在解码器部分也有输入(在翻译的情况下,解码器有其他语言输入,因此在解码器部分嵌入的想法是有道理的).最后,这里我没有单独的输入,这就是为什么我不需要在解码器部分中进行任何单独的嵌入.但是,我不知道如何解决这个问题,我只知道为什么会这样:|

Now the question comes why this is not happening in a similar code? this is my view, correct me if I'm wrong. because Seq2Seq code usually being used for Translation, summarization. and in those codes, in the decoder part also there is input (in the translation case, there is the other language input to the decoder, so the idea of having embedding in the decoder part makes sense). Finally, here I do not have seperate input, that's why I do not need any separate embedding in the decoder part. However, I don't know how to fix the problem, I just know why this is happening:|

更新 2

这是我提供给模型的数据:

this is my data being fed to the model:

   sent_wids = np.zeros((len(parsed_sentences),SEQUENCE_LEN),'int32')
sample_seq_weights = np.zeros((len(parsed_sentences),SEQUENCE_LEN),'float')
for index_sentence in range(len(parsed_sentences)):
    temp_sentence = parsed_sentences[index_sentence]
    temp_words = nltk.word_tokenize(temp_sentence)
    for index_word in range(SEQUENCE_LEN):
        if index_word < sent_lens[index_sentence]:
            sent_wids[index_sentence,index_word] = lookup_word2id(temp_words[index_word])
        else:
            sent_wids[index_sentence, index_word] = lookup_word2id('PAD')

def sentence_generator(X,embeddings, batch_size, sample_weights):
    while True:
        # loop once per epoch
        num_recs = X.shape[0]
        indices = np.random.permutation(np.arange(num_recs))
        # print(embeddings.shape)
        num_batches = num_recs // batch_size
        for bid in range(num_batches):
            sids = indices[bid * batch_size : (bid + 1) * batch_size]
            temp_sents = X[sids, :]
            Xbatch = embeddings[temp_sents]
            weights = sample_weights[sids, :]
            yield Xbatch, Xbatch
LATENT_SIZE = 60

train_size = 0.95
split_index = int(math.ceil(len(sent_wids)*train_size))
Xtrain = sent_wids[0:split_index, :]
Xtest = sent_wids[split_index:, :]
train_w = sample_seq_weights[0: split_index, :]
test_w = sample_seq_weights[split_index:, :]
train_gen = sentence_generator(Xtrain, embeddings, BATCH_SIZE,train_w)
test_gen = sentence_generator(Xtest, embeddings , BATCH_SIZE,test_w)

和 parsed_sentences 是 61598 个被填充的句子.

and parsed_sentences is 61598 sentences which are padded.

此外,这是我在模型中作为 Lambda 层的层,我只是在此处添加以防它有任何影响:

Also, this is the layer I have in the model as Lambda layer, I just added here in case it has any effect ever:

def rev_entropy(x):
        def row_entropy(row):
            _, _, count = tf.unique_with_counts(row)
            count = tf.cast(count,tf.float32)
            prob = count / tf.reduce_sum(count)
            prob = tf.cast(prob,tf.float32)
            rev = -tf.reduce_sum(prob * tf.log(prob))
            return rev

        nw = tf.reduce_sum(x,axis=1)
        rev = tf.map_fn(row_entropy, x)
        rev = tf.where(tf.is_nan(rev), tf.zeros_like(rev), rev)
        rev = tf.cast(rev, tf.float32)
        max_entropy = tf.log(tf.clip_by_value(nw,2,LATENT_SIZE))
        concentration = (max_entropy/(1+rev))
        new_x = x * (tf.reshape(concentration, [BATCH_SIZE, 1]))
        return new_x

感谢任何帮助:)

推荐答案

我在 Google colab (TensorFlow version 1.13.1) 上尝试了以下示例,

I tried the following example on Google colab (TensorFlow version 1.13.1),

from tensorflow.python import keras
import numpy as np

SEQUENCE_LEN = 45
LATENT_SIZE = 20
EMBED_SIZE = 50
VOCAB_SIZE = 100

inputs = keras.layers.Input(shape=(SEQUENCE_LEN,), name="input")

embedding = keras.layers.Embedding(output_dim=EMBED_SIZE, input_dim=VOCAB_SIZE, input_length=SEQUENCE_LEN, trainable=True)(inputs)

encoded = keras.layers.Bidirectional(keras.layers.LSTM(LATENT_SIZE), merge_mode="sum", name="encoder_lstm")(embedding)
decoded = keras.layers.RepeatVector(SEQUENCE_LEN, name="repeater")(encoded)
decoded = keras.layers.Bidirectional(keras.layers.LSTM(EMBED_SIZE, return_sequences=True), merge_mode="sum", name="decoder_lstm")(decoded)
autoencoder = keras.models.Model(inputs, decoded)
autoencoder.compile(optimizer="sgd", loss='mse')
autoencoder.summary()

然后使用一些随机数据训练模型,

And then trained the model using some random data,


x = np.random.randint(0, 90, size=(10, 45))
y = np.random.normal(size=(10, 45, 50))
history = autoencoder.fit(x, y, epochs=NUM_EPOCHS)

此解决方案运行良好.我觉得问题可能在于您为 MSE 计算输入标签/输出的方式.

This solution worked fine. I feel like the issue might be the way you are feeding in labels/outputs for MSE calculation.

在原始问题中,您尝试使用 seq2seq 模型重建词嵌入,其中嵌入是固定的和预训练的.但是,您想使用可训练的嵌入层作为模型的一部分,对这个问题进行建模变得非常困难.因为您没有固定的目标(即目标在优化的每次迭代中都会发生变化,因为您的嵌入层正在发生变化).此外,这将导致一个非常不稳定的优化问题,因为目标一直在变化.

In the original problem, you are attempting to reconstruct word embeddings using a seq2seq model, where embeddings are fixed and pre-trained. However you want to use a trainable embedding layer as a part of the model it becomes very difficult to model this problem. Because you don't have fixed targets (i.e. targets change every single iteration of the optimization because your embedding layer is changing). Furthermore this will lead to a very unstable optimization problem, because the targets are changing all the time.

如果您执行以下操作,您应该能够使代码正常工作.这里embeddings是预训练的GloVe向量numpy.ndarray.

If you do the following you should be able to get the code working. Here embeddings is the pre-trained GloVe vector numpy.ndarray.

def sentence_generator(X, embeddings, batch_size):
    while True:
        # loop once per epoch
        num_recs = X.shape[0]
        embed_size = embeddings.shape[1]
        indices = np.random.permutation(np.arange(num_recs))
        # print(embeddings.shape)
        num_batches = num_recs // batch_size
        for bid in range(num_batches):
            sids = indices[bid * batch_size : (bid + 1) * batch_size]
            # Xbatch is a [batch_size, seq_length] array
            Xbatch = X[sids, :] 

            # Creating the Y targets
            Xembed = embeddings[Xbatch.reshape(-1),:]
            # Ybatch will be [batch_size, seq_length, embed_size] array
            Ybatch = Xembed.reshape(batch_size, -1, embed_size)
            yield Xbatch, Ybatch

这篇关于将嵌入层添加到 lstm 自动编码器时出错的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆