Keras multi_gpu_model导致系统崩溃 [英] Keras multi_gpu_model causes system to crash

查看:321
本文介绍了Keras multi_gpu_model导致系统崩溃的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试在大型数据集上训练相当大的LSTM,并具有4个GPU来分配负载.如果我尝试只训练其中之一(其中的任何一个,我都尝试过),则它可以正常运行,但是在添加multi_gpu_model代码后,当我尝试运行它时,它会使我的整个系统崩溃. 这是我的多GPU代码

I am trying to train a rather large LSTM on a large dataset and have 4 GPUs to distribute the load. If I try to train on just one of them (any of them, I've tried each) it functions correctly, but after adding the multi_gpu_model code it crashes my entire system when I try to run it. Here is my multi-gpu code

batch_size = 8
model = Sequential()
model.add(Masking(mask_value=0., input_shape=(len(inputData[0]), len(inputData[0][0])) ))
model.add(LSTM(256,  return_sequences=True))
model.add(Dropout(.2))
model.add(LSTM(128, return_sequences=True))
model.add(Dropout(.2))
model.add(LSTM(128, return_sequences=True))
model.add(Dropout(.2))
model.add(Dense(len(outputData[0][0]),  activation='softmax'))
rms = RMSprop()
p_model = multi_gpu_model(model, gpus=4)
p_model.compile(loss='categorical_crossentropy',optimizer=rms, metrics=['categorical_accuracy'])

print("Fitting")
p_model.fit_generator(songBatchGenerator(songList,batch_size), epochs=250,  verbose=1,  shuffle=False, steps_per_epoch=math.ceil(len(songList)/batch_size))
pickleSave('kerasTrained.pickle', parallel_model)
print("Saved")

将此更改为

batch_size = 8
model = Sequential()
model.add(Masking(mask_value=0., input_shape=(len(inputData[0]), len(inputData[0][0])) ))
model.add(LSTM(256,  return_sequences=True))
model.add(Dropout(.2))
model.add(LSTM(128, return_sequences=True))
model.add(Dropout(.2))
model.add(LSTM(128, return_sequences=True))
model.add(Dropout(.2))
model.add(Dense(len(outputData[0][0]),  activation='softmax'))
rms = RMSprop()

model.compile(loss='categorical_crossentropy',optimizer=rms, metrics=['categorical_accuracy'])

print("Fitting")
model.fit_generator(songBatchGenerator(songList,batch_size), epochs=250,  verbose=1,  shuffle=False, steps_per_epoch=math.ceil(len(songList)/batch_size))
pickleSave('kerasTrained.pickle', parallel_model)
print("Saved")

功能完美

3个GPU是Nvidia 1060 3GB,1个是6GB,系统具有大约4GB的内存(尽管我怀疑这是问题,因为我使用的是发生器).

3 of the GPUs are Nvidia 1060 3GB and 1 is a 6GB, and the system has about 4GB of memory (although I doubt that's the issue since I'm using a generator).

推荐答案

Keras使用所有4个GPU计算,并且可以使用CPU进行代码编译.您可以尝试以下代码.有关更多信息,请参见tensorflow网站链接 https://www .tensorflow.org/api_docs/python/tf/keras/utils/multi_gpu_model

Keras uses all the 4 GPUs computation and the code compilation can be made with CPU. You can try the below code. For more information have a look at the tensorflow website link https://www.tensorflow.org/api_docs/python/tf/keras/utils/multi_gpu_model

def create_model():
   batch_size = 8
   model = Sequential()
   model.add(Masking(mask_value=0., input_shape=(len(inputData[0]), len(inputData[0][0])) ))
   model.add(LSTM(256,  return_sequences=True))
   model.add(Dropout(.2))
   model.add(LSTM(128, return_sequences=True))
   model.add(Dropout(.2))
   model.add(LSTM(128, return_sequences=True))
   model.add(Dropout(.2))
   model.add(Dense(len(outputData[0][0]),  activation='softmax'))

   return model


# we'll store a copy of the model on *every* GPU and then combine
# the results from the gradient updates on the CPU
# initialize the model
with tf.device("/cpu:0"):
     model = create_model()

# make the model parallel
p_model = multi_gpu_model(model, gpus=4)


rms = RMSprop()
p_model.compile(loss='categorical_crossentropy',optimizer=rms, metrics=['categorical_accuracy'])
print("Fitting")
p_model.fit_generator(songBatchGenerator(songList,batch_size), epochs=250,  verbose=1,  shuffle=False, steps_per_epoch=math.ceil(len(songList)/batch_size))
pickleSave('kerasTrained.pickle', parallel_model)
print("Saved")

这篇关于Keras multi_gpu_model导致系统崩溃的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆