在 4600000 行数据上训练 keras 模型时出现内存错误 [英] Memory error while training the keras model on 4600000 rows data

查看:29
本文介绍了在 4600000 行数据上训练 keras 模型时出现内存错误的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在研究基于 LSTM 的 Encoder-Decoder 拼写校正模型,该模型提供了 4600000 行的训练数据.训练文件由两列组成 - 正确和不正确的句子.当数据小到 200000 时,模型运行良好.但是当我增加它时,训练不会超过 2 个时期.它有时会在抛出 std::bad_alloc 实例后调用 terminate 的错误,有时训练停止而没有任何错误或警告.我尝试使用它,但没有用.也许我用错了.

I am working on an LSTM-based Encoder-Decoder spelling correction model which is provided with the training data of 4600000 rows. The training file consists of two columns - correct and incorrect sentences. The model was working fine when the data was as small as 200000. But when I increased it the training doesn't go beyond 2 epochs. It sometimes gives the error of terminate called after throwing an instance of std::bad_alloc and sometimes the training stops without any error or warning. I tried using this but it didn't work. Maybe I used it incorrectly.

keras.clear_session() 

我也尝试将latent_dim 和batch_size 的值降低到128,64,32,16,8,4,1,但它们都不适用于如此大的数据.另外由于数据量很大所以我替换了

I have also tried reducing the value of latent_dim and batch_size to 128,64,32,16,8,4,1 but none of them worked for such large data. Also since the data was huge so I replaced the value of

steps_per_epoch = train_samples//batch_size

steps_per_epoch = 2000

我清除了缓存以释放内存,但训练仍未完成.有人可以建议一种方法来训练我的模型吗?

I cleared the cache to free up ram but still, the training doesn't complete. Can someone suggest a way to train my model?

def generate_batch(X = X_train, y = y_train, batch_size = 128):
    # Generate a batch of data 
    while True:
        for j in range(0, len(X), batch_size):
            encoder_input_data = np.zeros((batch_size, max_length_src),dtype='float32')
            decoder_input_data = np.zeros((batch_size, max_length_tar),dtype='float32')
            decoder_target_data = np.zeros((batch_size, max_length_tar, num_decoder_tokens),dtype='float32')
            for i, (input_text, target_text) in enumerate(zip(X[j:j+batch_size], y[j:j+batch_size])):
                for t, word in enumerate(input_text.split()):
                    encoder_input_data[i, t] = input_token_index[word] # encoder input seq
                for t, word in enumerate(target_text.split()):
                    if t<len(target_text.split())-1:
                        decoder_input_data[i, t] = target_token_index[word] # decoder input seq
                    if t>0:
                        # decoder target sequence (one hot encoded)
                        # does not include the START_ token
                        # Offset by one timestep
                        decoder_target_data[i, t - 1, target_token_index[word]] = 1.
            yield([encoder_input_data, decoder_input_data], decoder_target_data)

latent_dim = 50

# Encoder
encoder_inputs = Input(shape=(None,))
enc_emb =  Embedding(num_encoder_tokens+1, latent_dim, mask_zero = True)(encoder_inputs)
encoder_lstm = LSTM(latent_dim, return_state=True)
encoder_outputs, state_h, state_c = encoder_lstm(enc_emb)
# We discard `encoder_outputs` and only keep the states.
encoder_states = [state_h, state_c]

# Set up the decoder, using `encoder_states` as initial state.
decoder_inputs = Input(shape=(None,))
dec_emb_layer = Embedding(num_decoder_tokens, latent_dim, mask_zero = True)
dec_emb = dec_emb_layer(decoder_inputs)
# We set up our decoder to return full output sequences,
# and to return internal states as well. We don't use the
# return states in the training model, but we will use them in inference.
decoder_lstm = LSTM(latent_dim, return_sequences=True, return_state=True)
decoder_outputs, _, _ = decoder_lstm(dec_emb,
                                     initial_state=encoder_states)
decoder_dense = Dense(num_decoder_tokens, activation='softmax')
decoder_outputs = decoder_dense(decoder_outputs)

# Define the model that will turn
# `encoder_input_data` & `decoder_input_data` into `decoder_target_data`
model = Model([encoder_inputs, decoder_inputs], decoder_outputs)

model.compile(optimizer='rmsprop', loss='categorical_crossentropy', metrics=['accuracy'])

train_samples = len(X_train)
val_samples = len(X_test)
batch_size = 128
epochs = 50

from tensorflow.keras.callbacks import EarlyStopping, ModelCheckpoint
from keras.layers import Conv2D, MaxPooling2D
from keras import backend as K
from keras.callbacks import ModelCheckpoint

keras_callbacks   = [
      EarlyStopping(monitor ="val_loss", mode ="min", patience = 5, restore_best_weights = True),
      ModelCheckpoint('checkpoints.hdf5', monitor='val_loss', verbose=1, save_best_only=True, mode='min', save_freq=1)
]

model.fit_generator(generator = generate_batch(X_train, y_train, batch_size = batch_size),
                    #steps_per_epoch = train_samples//batch_size,
                    steps_per_epoch = 2000,
                    epochs=epochs,
                    verbose=1,
                    validation_data = generate_batch(X_test, y_test, batch_size = batch_size),
                    validation_steps = val_samples//batch_size,
                    callbacks=keras_callbacks)

model.save_weights('weights.h5')

推荐答案

出现内存错误是因为您使用的是big";generate_batch 函数中的全局范围变量 input_token_index.此变量将在内存中多次复制以生成数据.

The memory error occurs because you're using "big" global scope variable input_token_index in generate_batch function. This variable will be copied many times in memory to generate your data.

然而,与其解决这个特定问题,我建议您使用原生 TF 功能进行文本标记化、矢量化和批处理,而不是编写自己的实现.

However, instead of fixing this particular problem, I would suggest you using native TF functionality for text tokenisation, vectorisation and batching instead of writing your own implementations.

您可以在此处找到有关文本标记化和矢量化的更多信息 - 官方 Tensorflow 教程).具体来说,您可以利用 Tensorflow 文本矢量化功能,结合了标记化和填充.或者,您可以使用更成熟的 Tokeniser 和一般pad_sequences

You can find more info about text tokenisation and vectorisation here - official Tensorflow tutorial). Specifically, you can utilise Tensorflow text vectorisation functionality, which combines tokenisation and padding. Alternatively, you can use more mature Tokeniser and general pad_sequences

对于SGD 类型的训练来说,创建批处理是一项超级基本和通用的任务,所以不要重新发明轮子,只需使用 model.fit,它会自动批处理您的免费提供数据.

Creating batches is a super basic and general task for SGD type of training so don't reinvent the wheel and just use model.fit, it will automatically batch your data for free.

这篇关于在 4600000 行数据上训练 keras 模型时出现内存错误的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆