资源耗尽:仅在 GPU 上分配张量时 OOM [英] Resource exhausted: OOM when allocating tensor only on gpu

查看:31
本文介绍了资源耗尽:仅在 GPU 上分配张量时 OOM的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试运行几种不同的 ML 架构,
都是普通的,没有任何修改(git clone -> python train.py).
而结果总是相同的 - segmentation fault,或 资源耗尽:分配张量时 OOM.
仅在我的 CPU 上运行时,程序成功完成
我正在使用

I'm trying to run several different ML architectures,
all vanilla, without any modification (git clone -> python train.py).
while the result is always the same- segmentation fault, or Resource exhausted: OOM when allocating tensor.
When running only on my CPU, the program finishes successfully
I'm running the session with

    config.gpu_options.per_process_gpu_memory_fraction=0.33
    config.gpu_options.allow_growth = True
    config.allow_soft_placement = True
    config.log_device_placement = True

然而,结果是

2019-03-11 20:23:26.845851: W tensorflow/core/common_runtime/bfc_allocator.cc:271] ***************************************************************x**********____**********____**_____*
2019-03-11 20:23:26.845885: W tensorflow/core/framework/op_kernel.cc:1273] OP_REQUIRES failed at cwise_ops_common.cc:70 : Resource exhausted: OOM when allocating tensor with shape[32,128,1024,40] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
Traceback (most recent call last):

2019-03-11 20:23:16.841149: W tensorflow/core/common_runtime/bfc_allocator.cc:211] Allocator (GPU_0_bfc) ran out of memory trying to allocate 1.59GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2019-03-11 20:23:16.841191: W tensorflow/core/common_runtime/bfc_allocator.cc:211] Allocator (GPU_0_bfc) ran out of memory trying to allocate 1.59GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2019-03-11 20:23:26.841486: W tensorflow/core/common_runtime/bfc_allocator.cc:267] Allocator (GPU_0_bfc) ran out of memory trying to allocate 640.00MiB.  Current allocation summary follows.
2019-03-11 20:23:26.841566: I tensorflow/core/common_runtime/bfc_allocator.cc:597] Bin (256):   Total Chunks: 195, Chunks in use: 195. 48.8KiB allocated for chunks. 48.8KiB in use in bin. 23.3KiB client-requested in use in bin.

ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[32,128,1024,40] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
     [[node transform_net1/tconv2/bn/moments/SquaredDifference (defined at /home/dvir/CLionProjects/gml/Dvir/FlexKernels/utils/tf_util.py:504)  = SquaredDifference[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:GPU:0"](transform_net1/tconv2/BiasAdd, transform_net1/tconv2/bn/moments/mean)]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
     [[{{node div/_113}} = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_1730_div", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

我在跑步

tensorflow-gpu 1.12
tensorflow 1.13

GPU 是

GeForce RTX 2080TI

模型为Dynamic Graph CNN for Learning on Point Clouds,在另一台机器上测试成功1080 ti.

The model is Dynamic Graph CNN for Learning on Point Clouds, and was tested successfully on another machine with 1080 ti.

推荐答案

对于 TensorFlow 2.2.0 这个脚本有效 -

For TensorFlow 2.2.0 this script works -

if tf.config.list_physical_devices('GPU'):
    physical_devices = tf.config.list_physical_devices('GPU')
    tf.config.experimental.set_memory_growth(physical_devices[0], enable=True)
    tf.config.experimental.set_virtual_device_configuration(physical_devices[0], [tf.config.experimental.VirtualDeviceConfiguration(memory_limit=4000)])

https://stackoverflow.com/a/63123354/5884380

这篇关于资源耗尽:仅在 GPU 上分配张量时 OOM的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆