分配张量时OOM [英] OOM when allocating tensor

查看:15
本文介绍了分配张量时OOM的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如何解决ResourceExhaustedError: OOM分配张量的问题?

How do I solve the problem of ResourceExhaustedError: OOM when allocating tensor?

ResourceExhaustedError(回溯见上文):分配时OOM形状为[10000,32,28,28]

ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[10000,32,28,28]

我包含了几乎所有的代码

I included nearly all the code

learning_rate = 0.0001
epochs = 10
batch_size = 50

# declare the training data placeholders
# input x - for 28 x 28 pixels = 784 - this is the flattened image data that is drawn from
# mnist.train.nextbatch()
x = tf.placeholder(tf.float32, [None, 784])
# dynamically reshape the input
x_shaped = tf.reshape(x, [-1, 28, 28, 1])
# now declare the output data placeholder - 10 digits
y = tf.placeholder(tf.float32, [None, 10])
def create_new_conv_layer(input_data, num_input_channels, num_filters, filter_shape, pool_shape, name):
    # setup the filter input shape for tf.nn.conv_2d
    conv_filt_shape = [filter_shape[0], filter_shape[1], num_input_channels,
                      num_filters]

    # initialise weights and bias for the filter
    weights = tf.Variable(tf.truncated_normal(conv_filt_shape, stddev=0.03),
                                      name=name+'_W')
    bias = tf.Variable(tf.truncated_normal([num_filters]), name=name+'_b')

    # setup the convolutional layer operation
    out_layer = tf.nn.conv2d(input_data, weights, [1, 1, 1, 1], padding='SAME')

    # add the bias
    out_layer += bias

    # apply a ReLU non-linear activation
    out_layer = tf.nn.relu(out_layer)

    # now perform max pooling
    ksize = [1, 2, 2, 1]
    strides = [1, 2, 2, 1]
    out_layer = tf.nn.max_pool(out_layer, ksize=ksize, strides=strides,
                               padding='SAME')

    return out_layer
# create some convolutional layers
layer1 = create_new_conv_layer(x_shaped, 1, 32, [5, 5], [2, 2], name='layer1')
layer2 = create_new_conv_layer(layer1, 32, 64, [5, 5], [2, 2], name='layer2')

flattened = tf.reshape(layer2, [-1, 7 * 7 * 64])

# setup some weights and bias values for this layer, then activate with ReLU
wd1 = tf.Variable(tf.truncated_normal([7 * 7 * 64, 1000], stddev=0.03), name='wd1')
bd1 = tf.Variable(tf.truncated_normal([1000], stddev=0.01), name='bd1')
dense_layer1 = tf.matmul(flattened, wd1) + bd1
dense_layer1 = tf.nn.relu(dense_layer1)

# another layer with softmax activations
wd2 = tf.Variable(tf.truncated_normal([1000, 10], stddev=0.03), name='wd2')
bd2 = tf.Variable(tf.truncated_normal([10], stddev=0.01), name='bd2')
dense_layer2 = tf.matmul(dense_layer1, wd2) + bd2
y_ = tf.nn.softmax(dense_layer2)
cross_entropy = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits=dense_layer2, labels=y))


# add an optimiser
optimiser = tf.train.AdamOptimizer(learning_rate=learning_rate).minimize(cross_entropy)

# define an accuracy assessment operation
correct_prediction = tf.equal(tf.argmax(y, 1), tf.argmax(y_, 1))
accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))

# setup the initialisation operator
init_op = tf.global_variables_initializer() 



 with tf.Session() as sess:
            # initialise the variables
            sess.run(init_op)
            total_batch = int(len(mnist.train.labels) / batch_size)
            for epoch in range(epochs):
                avg_cost = 0
                for i in range(total_batch):
                    batch_x, batch_y = mnist.train.next_batch(batch_size=batch_size)
                    _, c = sess.run([optimiser, cross_entropy], feed_dict={x: 
         batch_x, 
            y: batch_y})
                    avg_cost += c / total_batch
                test_acc = sess.run(accuracy,feed_dict={x: mnist.test.images, y: 
            mnist.test.labels})
                print("Epoch:", (epoch + 1), "cost =", "{:.3f}".format(avg_cost), "  
            test accuracy: {:.3f}".format(test_acc))

            print("
Training complete!")
            print(sess.run(accuracy, feed_dict={x: mnist.test.images, y: 
            mnist.test.labels}))

错误中引用的那些行是:create_new_conv_layer - 函数

and those lines referenced in the error are : create_new_conv_layer - function

sess.run .. 在训练循环中

sess.run .. in the training loop

下面列出了我从调试器输出中复制的更多错误(有更多行,但我认为这些是主要的,而其他的则是由此引起的..)

More errors I copied from the debuggers output are listed is below (there were more lines but i think these ones are main one and the others are caused by this..)

tensorflow.python.framework.errors_impl.ResourceExhaustedError:在分配形状为[10000,32,28,28] [[Node: Conv2D =Conv2D[T=DT_FLOAT, data_format="NHWC", padding="SAME", strides=[1, 1,1, 1], use_cudnn_on_gpu=true,_device="/job:localhost/replica:0/task:0/gpu:0"](Reshape, layer1_W/read)]]

tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[10000,32,28,28] [[Node: Conv2D = Conv2D[T=DT_FLOAT, data_format="NHWC", padding="SAME", strides=[1, 1, 1, 1], use_cudnn_on_gpu=true, _device="/job:localhost/replica:0/task:0/gpu:0"](Reshape, layer1_W/read)]]

我第二次运行它时出现以下错误我同时拥有 cpu 和 GPU,如下面的输出所示,我可以理解与 cpu 问题相关的一些错误可能是因为我的 tensorflow 没有被编译为使用那些功能,我在 Windows 10 上安装了 cuda 8 和 cudnn 6、python 3.5、tensorflow 1.3.0.

The second time i run it is issued the following error I have both cpu and GPU as can be seen in the output below , I can understand some of the errors related to cpu issues might be becuase my tensorflow wasnt compiled to use those features , I installed cuda 8 and cudnn 6 , python 3.5 , tensorflow 1.3.0 on windows 10.

2017-10-03 03:53:58.944371:WC: f_jenkinshomeworkspace el-winMwindows-gpuPY35 ensorflowcoreplatformcpu_feature_guard.cc:45]TensorFlow 库没有被编译为使用 AVX 指令,但是这些在您的机器上可用,可以加速 CPU计算.2017-10-03 03:53:58.945563:WC: f_jenkinshomeworkspace el-winMwindows-gpuPY35 ensorflowcoreplatformcpu_feature_guard.cc:45]TensorFlow 库没有被编译为使用 AVX2 指令,但是这些在您的机器上可用,可以加速 CPU计算.2017-10-03 03:53:59.230761:我C: f_jenkinshomeworkspace el-winMwindows-gpuPY35 ensorflowcorecommon_runtimegpugpu_device.cc:955]找到具有以下属性的设备 0:名称:Quadro K620 主要:5 次要:0 memoryClockRate (GHz) 1.124 pciBusID 0000:01:00.0 总内存:2.00GiB 可用内存:1.66GiB2017-10-03 03:53:59.231109:我C: f_jenkinshomeworkspace el-winMwindows-gpuPY35 ensorflowcorecommon_runtimegpugpu_device.cc:976]DMA:0 2017-10-03 03:53:59.231229:我C: f_jenkinshomeworkspace el-winMwindows-gpuPY35 ensorflowcorecommon_runtimegpugpu_device.cc:986]0: 是 2017-10-03 03:53:59.231363: 我C: f_jenkinshomeworkspace el-winMwindows-gpuPY35 ensorflowcorecommon_runtimegpugpu_device.cc:1045]创建 TensorFlow 设备 (/gpu:0) ->(设备:0,名称:Quadro K620,pci 总线 ID:0000:01:00.0) 2017-10-03 03:54:01.511141: EC: f_jenkinshomeworkspace el-winMwindows-gpuPY35 ensorflowstream_executorcudacuda_dnn.cc:371]无法创建 cudnn 句柄:CUDNN_STATUS_NOT_INITIALIZED 2017-10-03 03:54:01.511372: EC: f_jenkinshomeworkspace el-winMwindows-gpuPY35 ensorflowstream_executorcudacuda_dnn.cc:375]错误检索驱动程序版本:未实现:内核报告驱动程序版本未在 Windows 上实现 2017-10-0303:54:01.511862:EC: f_jenkinshomeworkspace el-winMwindows-gpuPY35 ensorflowstream_executorcudacuda_dnn.cc:338]无法销毁 cudnn 句柄:CUDNN_STATUS_BAD_PARAM 2017-10-03 03:54:01.512074: FC: f_jenkinshomeworkspace el-winMwindows-gpuPY35 ensorflowcorekernelsconv_ops.cc:672]检查失败:stream->parent()->GetConvolveAlgorithms(conv_parameters.ShouldIncludeWinogradNonfusedAlgo(), &algorithms)

2017-10-03 03:53:58.944371: W C: f_jenkinshomeworkspace el-winMwindows-gpuPY35 ensorflowcoreplatformcpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations. 2017-10-03 03:53:58.945563: W C: f_jenkinshomeworkspace el-winMwindows-gpuPY35 ensorflowcoreplatformcpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX2 instructions, but these are available on your machine and could speed up CPU computations. 2017-10-03 03:53:59.230761: I C: f_jenkinshomeworkspace el-winMwindows-gpuPY35 ensorflowcorecommon_runtimegpugpu_device.cc:955] Found device 0 with properties: name: Quadro K620 major: 5 minor: 0 memoryClockRate (GHz) 1.124 pciBusID 0000:01:00.0 Total memory: 2.00GiB Free memory: 1.66GiB 2017-10-03 03:53:59.231109: I C: f_jenkinshomeworkspace el-winMwindows-gpuPY35 ensorflowcorecommon_runtimegpugpu_device.cc:976] DMA: 0 2017-10-03 03:53:59.231229: I C: f_jenkinshomeworkspace el-winMwindows-gpuPY35 ensorflowcorecommon_runtimegpugpu_device.cc:986] 0: Y 2017-10-03 03:53:59.231363: I C: f_jenkinshomeworkspace el-winMwindows-gpuPY35 ensorflowcorecommon_runtimegpugpu_device.cc:1045] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Quadro K620, pci bus id: 0000:01:00.0) 2017-10-03 03:54:01.511141: E C: f_jenkinshomeworkspace el-winMwindows-gpuPY35 ensorflowstream_executorcudacuda_dnn.cc:371] could not create cudnn handle: CUDNN_STATUS_NOT_INITIALIZED 2017-10-03 03:54:01.511372: E C: f_jenkinshomeworkspace el-winMwindows-gpuPY35 ensorflowstream_executorcudacuda_dnn.cc:375] error retrieving driver version: Unimplemented: kernel reported driver version not implemented on Windows 2017-10-03 03:54:01.511862: E C: f_jenkinshomeworkspace el-winMwindows-gpuPY35 ensorflowstream_executorcudacuda_dnn.cc:338] could not destroy cudnn handle: CUDNN_STATUS_BAD_PARAM 2017-10-03 03:54:01.512074: F C: f_jenkinshomeworkspace el-winMwindows-gpuPY35 ensorflowcorekernelsconv_ops.cc:672] Check failed: stream->parent()->GetConvolveAlgorithms( conv_parameters.ShouldIncludeWinogradNonfusedAlgo(), &algorithms)

推荐答案

该过程因内存不足 (OOM) 而失败,因为您一次推送了整个测试集进行评估(请参阅 这个问题).很容易看出,10000 * 32 * 28 * 28 * 4 几乎是 1Gb,而你的 GPU 总共只有 1.66Gb 可用,而且大部分已经被网络本身占用了.

The process failed with out-of-memory (OOM) because you pushed the whole test set for evaluation at once (see this question). It's easy to see that 10000 * 32 * 28 * 28 * 4 is almost 1Gb, while your GPU has only 1.66Gb available in total and most of it is already taken by the network itself.

解决方案是提供神经网络批次,不仅用于训练,还用于测试.结果准确度将是所有批次的平均值.而且,您不需要在每个 epoch 之后都这样做:您真的对所有中间网络的测试结果感兴趣吗?

The solution is to feed the neural network batches not only for training, but for testing as well. The result accuracy is going to be an average across all batches. Moreover, you don't need to do this after each epoch: are you really interested in test results of all intermediate networks?

您的第二条错误消息很可能是之前失败的结果,因为 CUDNN 驱动程序似乎不再工作了.我建议重启你的机器.

Your second error message is very likely a result of the previous failures, because CUDNN driver doesn't seem to work anymore. I'd suggest to restart your machine.

这篇关于分配张量时OOM的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆