带有Keras的单词级Seq2Seq [英] Word-level Seq2Seq with Keras

查看：110 发布时间：2020/4/25 10:17:55 machine-learning deep-learning keras machine-translation

本文介绍了带有Keras的单词级Seq2Seq的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在关注

I was following the Keras Seq2Seq tutorial, and wit works fine. However, this is a character-level model, and I would like to adopt it to a word-level model. The authors even include a paragraph with require changes but all my current attempts result in an error regarding wring dimensions.

如果您遵循字符级模型，则输入数据为3个暗淡部分:#sequences，#max_seq_len，#num_char，因为每个字符都是一次热编码的.当我按照本教程中的说明绘制模型的摘要时，我得到:

If you follow the character-level model, the input data is of 3 dims: #sequences, #max_seq_len, #num_char since each character is one-hot encoded. When I plot the summary for the model as used in the tutorial, I get:

Layer (type)                    Output Shape         Param #     Connected to                     
==================================================================================================
input_1 (InputLayer)            (None, None, 71)     0                                            
_____________________________________________________________________________ __________________
input_2 (InputLayer)            (None, None, 94)     0                                            
__________________________________________________________________________________________________
lstm_1 (LSTM)                   [(None, 256), (None, 335872      input_1[0][0]                    
__________________________________________________________________________________________________
lstm_2 (LSTM)                   [(None, None, 256),  359424      input_2[0][0]                    
                                                                 lstm_1[0][1]                     
                                                                 lstm_1[0][2]                     
__________________________________________________________________________________________________
dense_1 (Dense)                 (None, None, 94)     24158       lstm_2[0][0]                     
==================================================================================================

这样编译和训练就可以了.

This compiles and trains just fine.

现在，本教程的内容为如果我想对整数序列使用单词级模型该怎么办?"而且我尝试遵循这些更改.首先，我使用单词索引对所有序列进行编码.这样，输入和目标数据现在是2个暗淡的:#sequences，#max_seq_len，因为我不再进行单次热编码，而是现在使用嵌入层.

Now this tutorial has section "What if I want to use a word-level model with integer sequences?" And I've tried to follow those changes. Firstly, I encode all sequences using a word index. As such, the input and target data is now 2 dims: #sequences, #max_seq_len since I no longer one-hot encode but use now Embedding layers.

encoder_input_data_train.shape   =>  (90000, 9)
decoder_input_data_train.shape   =>  (90000, 16)
decoder_target_data_train.shape  =>  (90000, 16)

例如，一个序列可能看起来像这样:

For example, a sequence might look like this:

[ 826.  288. 2961. 3127. 1260. 2108.    0.    0.    0.]

当我使用列出的代码时:

When I use the listed code:

# encoder
encoder_inputs = Input(shape=(None, ))
x = Embedding(num_encoder_tokens, latent_dim)(encoder_inputs)
x, state_h, state_c = LSTM(latent_dim, return_state=True)(x)
encoder_states = [state_h, state_c]

# decoder
decoder_inputs = Input(shape=(None,))
x = Embedding(num_decoder_tokens, latent_dim)(decoder_inputs)
x = LSTM(latent_dim, return_sequences=True)(x, initial_state=encoder_states)
decoder_outputs = Dense(num_decoder_tokens, activation='softmax')(x)

model = Model([encoder_inputs, decoder_inputs], decoder_outputs)

模型将编译并如下所示:

the model compiles and looks like this:

Layer (type)                    Output Shape         Param #     Connected to                     
==================================================================================================
input_35 (InputLayer)           (None, None)         0                                            
__________________________________________________________________________________________________
input_36 (InputLayer)           (None, None)         0                                            
__________________________________________________________________________________________________
embedding_32 (Embedding)        (None, None, 256)    914432      input_35[0][0]                   
__________________________________________________________________________________________________
embedding_33 (Embedding)        (None, None, 256)    914432      input_36[0][0]                   
__________________________________________________________________________________________________
lstm_32 (LSTM)                  [(None, 256), (None, 525312      embedding_32[0][0]               
__________________________________________________________________________________________________
lstm_33 (LSTM)                  (None, None, 256)    525312      embedding_33[0][0]               
                                                                 lstm_32[0][1]                    
                                                                 lstm_32[0][2]                    
__________________________________________________________________________________________________
dense_21 (Dense)                (None, None, 3572)   918004      lstm_33[0][0]

在编译，培训的同时

model.fit([encoder_input_data, decoder_input_data], decoder_target_data, batch_size=32, epochs=1, validation_split=0.2)

失败，并出现以下错误:ValueError: Error when checking target: expected dense_21 to have 3 dimensions, but got array with shape (90000, 16)，后者是解码器输入/目标的形状.为什么Dense层是解码器输入数据形状的数组?

fails with the following error: ValueError: Error when checking target: expected dense_21 to have 3 dimensions, but got array with shape (90000, 16) with the latter being the shape of the decoder input/target. Why does the Dense layer an array of the shape of the decoder input data?

我尝试过的事情:

我发现解码器LSTM具有return_sequences=True有点奇怪，因为我认为我无法给Dense层提供序列(并且原始字符级模型的解码器未说明这一点).但是，简单地删除或设置return_sequences=False并没有帮助.当然，Dense层现在的输出形状为(None, 3572).

我不太需要Input层.我将它们分别设置为shape=(max_input_seq_len, )和shape=(max_target_seq_len, )，以便摘要不显示(None, None)，而是显示各自的值，例如(None, 16).没变化.

在 Keras文档中，我读到input_length应该使用嵌入层，否则上游的Dense层无法计算其输出.但是同样，当我相应地设置input_length时，仍然会出错.

I find it a bit strange that the decoder LSTM has a return_sequences=True since I thought I cannot give a sequences to a Dense layer (and the decoder of the original character-level model does not state this). However, simply removing or setting return_sequences=False did not help. Of course, the Dense layer now has an output shape of (None, 3572).
I don' quite get the need for the Input layers. I've set them to shape=(max_input_seq_len, ) and shape=(max_target_seq_len, ) respectively so that the summary doesn't show (None, None) but the respective values, e.g., (None, 16). No change.
In the Keras Docs I've read that an Embedding layer should be used with input_length, otherwise a Dense layer upstream cannot compute its outputs. But again, still errors when I set input_length accordingly.

我有点陷入僵局了吧?我是否在正确的道路上，还是从根本上错过了一些东西?我的数据形状不对吗?为什么最后一个Dense层获得形状为(90000, 16)的数组?看来还不错.

I'm a bit at a deadlock right? Am I even on the right track or do I missing something more fundamentally. Is the shape of my data wrong? Why does the last Dense layer get array with shape (90000, 16)? That seems rather off.

更新:我发现问题似乎出在decoder_target_data上，该形状当前具有(#sample, max_seq_len)形状，例如(90000, 16).但是我假设我需要针对词汇表对目标输出进行一次热编码:(#sample, max_seq_len, vocab_size)，例如，(90000, 16, 3572).

UPDATE: I figured out that the problem seems to be decoder_target_data which currently has the shape (#sample, max_seq_len), e.g., (90000, 16). But I assume I need to one-hot encode the target output with respect to the vocabulary: (#sample, max_seq_len, vocab_size), e.g., (90000, 16, 3572).

不幸的是，这引发了Memory错误.但是，当我出于调试目的进行操作时，即假设词汇量为10:

Unfortunately, this throws a Memory error. However, when I do for debugging purposes, i.e., assume a vocabulary size of 10:

decoder_target_data = np.zeros((len(input_sequences), max_target_seq_len, 10), dtype='float32')

，然后在解码器模型中:

and later in the decoder model:

x = Dense(10, activation='softmax')(x)

然后模型训练无误.万一这确实是我的问题，我必须使用手动生成的批次来训练模型，以便我可以保留词汇量，但可以将#samples减小为例如每个形状为(1000, 16, 3572)的90个批次.我在这里吗?

then the model trains without error. In case that's indeed my issue, I have to train the model with manually generate batches so I can keep the vocabulary size but reduce the #samples, e.g., to 90 batches each of shape (1000, 16, 3572). Am I on the right track here?

推荐答案

最近我也遇到了这个问题.没有其他解决方案，然后在generator中创建小批次，例如batch_size=64，然后代替model.fit执行model.fit_generator.我在下面附加了generate_batch代码:

Recently I was also facing this problem. There is no other solution then creating small batches say batch_size=64 in a generator and then instead of model.fit do model.fit_generator. I have attached my generate_batch code below:

def generate_batch(X, y, batch_size=64):
    ''' Generate a batch of data '''
    while True:
        for j in range(0, len(X), batch_size):
            encoder_input_data = np.zeros((batch_size, max_encoder_seq_length),dtype='float32')
            decoder_input_data = np.zeros((batch_size, max_decoder_seq_length+2),dtype='float32')
            decoder_target_data = np.zeros((batch_size, max_decoder_seq_length+2, num_decoder_tokens),dtype='float32')

            for i, (input_text_seq, target_text_seq) in enumerate(zip(X[j:j+batch_size], y[j:j+batch_size])):
                for t, word_index in enumerate(input_text_seq):
                    encoder_input_data[i, t] = word_index # encoder input seq

                for t, word_index in enumerate(target_text_seq):
                    decoder_input_data[i, t] = word_index
                    if (t>0)&(word_index<=num_decoder_tokens):
                        decoder_target_data[i, t-1, word_index-1] = 1.

            yield([encoder_input_data, decoder_input_data], decoder_target_data)

然后像这样进行训练:

batch_size = 64
epochs = 2

# Run training
model.compile(optimizer='rmsprop', loss='categorical_crossentropy', metrics=['accuracy'])

model.fit_generator(
    generator=generate_batch(X=X_train_sequences, y=y_train_sequences, batch_size=batch_size),
    steps_per_epoch=math.ceil(len(X_train_sequences)/batch_size),
    epochs=epochs,
    verbose=1,
    validation_data=generate_batch(X=X_val_sequences, y=y_val_sequences, batch_size=batch_size),
    validation_steps=math.ceil(len(X_val_sequences)/batch_size),
    workers=1,
    )

X_train_sequences是[[23,34,56], [2, 33544, 6, 10]]之类的列表的列表.
其他人也一样.

X_train_sequences is list of lists like [[23,34,56], [2, 33544, 6, 10]].
Similarly others.

还从此博客获得了帮助-

Also took help from this blog - word-level-english-to-marathi-nmt

这篇关于带有Keras的单词级Seq2Seq的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

带有Keras的单词级Seq2Seq [英] Word-level Seq2Seq with Keras

问题描述

推荐答案

相关文章

AI人工智能最新文章

热门教程

热门工具

登录关闭

带有Keras的单词级Seq2Seq [英] Word-level Seq2Seq with Keras

问题描述

推荐答案

相关文章

AI人工智能最新文章

热门教程

热门工具

登录 关闭

登录关闭