tensorflow GPU崩溃,批处理大小为0 CUDNN_STATUS_BAD_PARAM [英] tensorflow GPU crashes for 0 batch size CUDNN_STATUS_BAD_PARAM

查看:197
本文介绍了tensorflow GPU崩溃,批处理大小为0 CUDNN_STATUS_BAD_PARAM的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这个问题似乎已经存在很长时间了,很多用户都在面对这个问题。

This issue seem to be existing for a long time and lots of users are facing the issue.

stream_executor / cuda / cuda_dnn.cc:444]无法转换BatchDescriptor {计数:0 feature_map_count:64空间:7 264 value_min:0.000000 value_max:0.000000布局:BatchDepthYX} t
o cudnn张量描述符:CUDNN_STATUS_BAD_PARAM

stream_executor/cuda/cuda_dnn.cc:444] could not convert BatchDescriptor {count: 0 feature_map_count: 64 spatial: 7 264 value_min: 0.000000 value_max: 0.000000 layout: BatchDepthYX} t o cudnn tensor descriptor: CUDNN_STATUS_BAD_PARAM

该消息是如此神秘,以至于我不知道代码中发生了什么,但是,我的代码在CPU张量流中工作正常

The message is so mysterious that I do not know what happened in my code, however, my code works fine on CPU tensorflow.

我听说我们可以使用tf.cond来解决这个问题,但是我是tensorflow-gpu的新手,所以有人可以帮助我吗?我的代码使用Keras并接受生成器之类的输入,这是为了避免出现内存不足的问题。生成器由一会儿True循环构建,该循环以一定的批处理大小吐出数据。

I heard that we can use tf.cond to get around this, but I'm new to tensorflow-gpu, so can someone please help me? My code uses Keras and takes generator like input, this is to avoid any out-of-memory issue. The generator is built by a while True loop that spits out data by some batch size.

def resnet_model(bin_multiple):
    #input and reshape
    inputs = Input(shape=input_shape)
    reshape = Reshape(input_shape_channels)(inputs)
    #normal convnet layer (have to do one initially to get 64 channels)
    conv = Conv2D(64,(1,bin_multiple*note_range),padding="same",activation='relu')(reshape)
    pool = MaxPooling2D(pool_size=(1,2))(conv)
    for i in range(int(np.log2(bin_multiple))-1):
        print( i)
        #residual block
        bn = BatchNormalization()(pool)
        re = Activation('relu')(bn)
        freq_range = int((bin_multiple/(2**(i+1)))*note_range)
        print(freq_range)
        conv = Conv2D(64,(1,freq_range),padding="same",activation='relu')(re)
        #add and downsample
        ad = add([pool,conv])
        pool = MaxPooling2D(pool_size=(1,2))(ad)
    flattened = Flatten()(pool)
    fc = Dense(1024, activation='relu')(flattened)
    do = Dropout(0.5)(fc)
    fc = Dense(512, activation='relu')(do)
    do = Dropout(0.5)(fc)
    outputs = Dense(note_range, activation='sigmoid')(do)
    model = Model(inputs=inputs, outputs=outputs)
    return model

model = resnet_model(bin_multiple)
init_lr = float(args['init_lr'])
    model.compile(loss='binary_crossentropy',
              optimizer=SGD(lr=init_lr,momentum=0.9), metrics=['accuracy', 'mae', 'categorical_accuracy'])
model.summary()
history = model.fit_generator(trainGen.next(),trainGen.steps(), epochs=epochs,     
verbose=1,validation_data=valGen.next(),validation_steps=valGen.steps(),callbacks=callbacks, workers=8, use_multiprocessing=True)


推荐答案

问题是当您的模型收到0个批处理大小时。对我来说,我遇到了错误,因为我有1000个示例,并且我在批处理大小等于32的多个GPus(2个GPU)上运行它,并且在我的图中,我将批处理大小除以最小批处理大小,因此每个GPU都采用16个示例。在步骤31(31 * 32),我将完成992个示例,因此仅剩下8个示例,它将转到GPU 1,GPU2将以零批大小结束,这就是我在上面收到您的错误的原因。

The problem is when you model received 0 batch size. For me I had the error because I have 1000 example and I run it on multiple GPus ( 2 GPU) with batch size equal to 32 .And in My graph I divided the batch size to mini batch size to so each GPU take 16 example. At step 31 ( 31 * 32) I will finished 992 examples , so there is only 8 example left, it will go to GPU 1 and GPU2 will end with zero batch size that's why I received your error above.

仍然无法解决,仍在寻找适当的解决方案。
希望这有助于您发现在代码中何时收到零批处理大小。

Still couldn't solve it and still searching about proper solution. I hope this help you to discover when in your code you received zero batch size.

这篇关于tensorflow GPU崩溃,批处理大小为0 CUDNN_STATUS_BAD_PARAM的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆