在Keras上使用multiple_gpu_model-导致资源耗尽 [英] Using multiple_gpu_model on keras - causing resource exhaustion

查看:219
本文介绍了在Keras上使用multiple_gpu_model-导致资源耗尽的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我通过以下方式建立了我的网络:

I built my network the following way:

# Build U-Net model
inputs = Input((IMG_HEIGHT, IMG_WIDTH, IMG_CHANNELS))
s = Lambda(lambda x: x / 255) (inputs)
width = 64
c1 = Conv2D(width, (3, 3), activation='relu', padding='same') (s)
c1 = Conv2D(width, (3, 3), activation='relu', padding='same') (c1)
p1 = MaxPooling2D((2, 2)) (c1)

c2 = Conv2D(width*2, (3, 3), activation='relu', padding='same') (p1)
c2 = Conv2D(width*2, (3, 3), activation='relu', padding='same') (c2)
p2 = MaxPooling2D((2, 2)) (c2)

c3 = Conv2D(width*4, (3, 3), activation='relu', padding='same') (p2)
c3 = Conv2D(width*4, (3, 3), activation='relu', padding='same') (c3)
p3 = MaxPooling2D((2, 2)) (c3)

c4 = Conv2D(width*8, (3, 3), activation='relu', padding='same') (p3)
c4 = Conv2D(width*8, (3, 3), activation='relu', padding='same') (c4)
p4 = MaxPooling2D(pool_size=(2, 2)) (c4)

c5 = Conv2D(width*16, (3, 3), activation='relu', padding='same') (p4)
c5 = Conv2D(width*16, (3, 3), activation='relu', padding='same') (c5)

u6 = Conv2DTranspose(width*8, (2, 2), strides=(2, 2), padding='same') (c5)
u6 = concatenate([u6, c4])
c6 = Conv2D(width*8, (3, 3), activation='relu', padding='same') (u6)
c6 = Conv2D(width*8, (3, 3), activation='relu', padding='same') (c6)

u7 = Conv2DTranspose(width*4, (2, 2), strides=(2, 2), padding='same') (c6)
u7 = concatenate([u7, c3])
c7 = Conv2D(width*4, (3, 3), activation='relu', padding='same') (u7)
c7 = Conv2D(width*4, (3, 3), activation='relu', padding='same') (c7)

u8 = Conv2DTranspose(width*2, (2, 2), strides=(2, 2), padding='same') (c7)
u8 = concatenate([u8, c2])
c8 = Conv2D(width*2, (3, 3), activation='relu', padding='same') (u8)
c8 = Conv2D(width*2, (3, 3), activation='relu', padding='same') (c8)

u9 = Conv2DTranspose(width, (2, 2), strides=(2, 2), padding='same') (c8)
u9 = concatenate([u9, c1], axis=3)
c9 = Conv2D(width, (3, 3), activation='relu', padding='same') (u9)
c9 = Conv2D(width, (3, 3), activation='relu', padding='same') (c9)

outputs = Conv2D(1, (1, 1), activation='sigmoid') (c9)
with tf.device('/cpu:0'):
    model = Model(inputs=[inputs], outputs=[outputs])

sgd = optimizers.SGD(lr=0.03, decay=1e-6, momentum=0.9, nesterov=True)
parallel_model = multi_gpu_model(model, gpus=8)
parallel_model.compile(optimizer=sgd, loss='binary_crossentropy', metrics=[mean_iou])
model.summary()

请注意,我正在按照 keras文档的建议在CPU上实例化基本模型.然后,我使用以下几行来运行网络:

Notice that I am instantiating the base model on the CPU as suggested by keras documentation. Then, I run the network using the following lines:

# Fit model
earlystopper = EarlyStopping(patience=20, verbose=1)
checkpointer = ModelCheckpoint('test.h5', verbose=1, save_best_only=True)
results = parallel_model.fit(X_train, Y_train, validation_split=0.05, batch_size = 256, verbose=1, epochs=100, 
                    callbacks=[earlystopper, checkpointer])

但是,即使我正在使用multiple_gpu_model,我的代码仍然会导致以下错误:

However, even though, I am using the multiple_gpu_model, my code still results in the following error:

分配形状为[32,128,256,256]的张量并在/job:localhost/replica:0/task:0/device:GPU:0上通过分配器GPU_0_bfc输入float类型

OOM when allocating tensor with shape[32,128,256,256] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc

哪个表示网络正在尝试在单个GPU而不是8上运行256的批处理大小?我不能正确实现这一点吗?我是否需要在示例中使用Xception?

Which indicates that the network is trying to run the batch size of 256 on just a single GPU instead of 8. Am I not implementing this properly? Do I need to use Xception as in the example?

推荐答案

张量的第一个暗数是batch_size,因此一切都很好.您已将batch_size指定为256,并且使用了8 gpus.因此,您得到的batch_size为32,如错误所述. 该错误还表明您的模型仍然过大,批处理大小为32,gpu无法处理.

The first dim of the tensor is the batch_size, so everthing is fine in your case. You have specified your batch_size as 256 and you use 8 gpus. So your resulting batch_size is 32 as stated in the error. Also the error suggest that your model still is too big with a batch_size of 32 for your gpus to handle.

这篇关于在Keras上使用multiple_gpu_model-导致资源耗尽的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆