在 GPU 上使用 Tensorflow 时出错 [英] Error using Tensorflow with GPU

查看：61 发布时间：2021/12/9 22:54:33 gpgpu tensorflow

本文介绍了在 GPU 上使用 Tensorflow 时出错的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我尝试了许多不同的 Tensorflow 示例，它们在 CPU 上运行良好，但当我尝试在 GPU 上运行它们时会产生相同的错误.一个小例子是这样的:

I've tried a bunch of different Tensorflow examples, which works fine on the CPU but generates the same error when I'm trying to run them on the GPU. One little example is this:

import tensorflow as tf

# Creates a graph.
a = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[2, 3], name='a')
b = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[3, 2], name='b')
c = tf.matmul(a, b)
# Creates a session with log_device_placement set to True.
sess = tf.Session(config=tf.ConfigProto(log_device_placement=True))
# Runs the op.
print sess.run(c)

错误总是一样的，CUDA_ERROR_OUT_OF_MEMORY:

The error is always the same, CUDA_ERROR_OUT_OF_MEMORY:

I tensorflow/stream_executor/dso_loader.cc:101] successfully opened CUDA library libcublas.so.7.0 locally
I tensorflow/stream_executor/dso_loader.cc:101] successfully opened CUDA library libcudnn.so.6.5 locally
I tensorflow/stream_executor/dso_loader.cc:101] successfully opened CUDA library libcufft.so.7.0 locally
I tensorflow/stream_executor/dso_loader.cc:101] successfully opened CUDA library libcuda.so locally
I tensorflow/stream_executor/dso_loader.cc:101] successfully opened CUDA library libcurand.so.7.0 locally
I tensorflow/core/common_runtime/local_device.cc:40] Local device intra op parallelism threads: 24
I tensorflow/core/common_runtime/gpu/gpu_init.cc:103] Found device 0 with properties: 
name: Tesla K80
major: 3 minor: 7 memoryClockRate (GHz) 0.8235
pciBusID 0000:0a:00.0
Total memory: 11.25GiB
Free memory: 105.73MiB
I tensorflow/core/common_runtime/gpu/gpu_init.cc:103] Found device 1 with properties: 
name: Tesla K80
major: 3 minor: 7 memoryClockRate (GHz) 0.8235
pciBusID 0000:0b:00.0
Total memory: 11.25GiB
Free memory: 133.48MiB
I tensorflow/core/common_runtime/gpu/gpu_init.cc:127] DMA: 0 1 
I tensorflow/core/common_runtime/gpu/gpu_init.cc:137] 0:   Y Y 
I tensorflow/core/common_runtime/gpu/gpu_init.cc:137] 1:   Y Y 
I tensorflow/core/common_runtime/gpu/gpu_device.cc:702] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Tesla K80, pci bus id: 0000:0a:00.0)
I tensorflow/core/common_runtime/gpu/gpu_device.cc:702] Creating TensorFlow device (/gpu:1) -> (device: 1, name: Tesla K80, pci bus id: 0000:0b:00.0)
I tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:42] Allocating 105.48MiB bytes.
E tensorflow/stream_executor/cuda/cuda_driver.cc:932] failed to allocate 105.48M (110608384 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
F tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:47] Check failed: gpu_mem != nullptr  Could not allocate GPU device memory for device 0. Tried to allocate 105.48MiB
Aborted (core dumped)

我想问题与我的配置有关，而不是这个小例子的内存使用情况.有谁有想法吗?

I guess that the problem has to do with my configuration rather than the memory usage of this tiny example. Does anyone have any idea?

我发现问题可能与其他人在同一 GPU 上运行作业一样简单，这可以解释为什么可用内存很少.在这种情况下:抱歉占用您的时间...

I've found out that the problem may be as simple as someone else running a job on the same GPU, which would explain the little amount of free memory. In that case: sorry for taking up your time...

推荐答案

这里似乎有两个问题:

默认情况下，当您创建 tf.Session 时，TensorFlow 会分配大部分 (95%) 可用 GPU 内存(在每个 GPU 设备上).它使用 that为系统"用途保留 200MB 的 GPU 内存，但如果可用内存量为比那个小.

By default, TensorFlow allocates a large fraction (95%) of the available GPU memory (on each GPU device) when you create a tf.Session. It uses a heuristic that reserves 200MB of GPU memory for "system" uses, but doesn't set this aside if the amount of free memory is smaller than that.

看起来您的任一 GPU 设备(105.73MiB 和 133.48MiB)上的可用 GPU 内存都很少.这意味着 TensorFlow 将尝试分配可能应该为系统保留的内存，因此分配失败.

It looks like you have very little free GPU memory on either of your GPU devices (105.73MiB and 133.48MiB). This means that TensorFlow will attempt to allocate memory that should probably be reserved for the system, and hence the allocation fails.

在您尝试运行此程序时，是否有可能正在运行另一个 TensorFlow 进程(或其他一些需要 GPU 的代码)?例如，具有开放会话的 Python 解释器——即使它不使用 GPU——也会尝试分配几乎整个 GPU 内存.

Is it possible that you have another TensorFlow process (or some other GPU-hungry code) running while you attempt to run this program? For example, a Python interpreter with an open session—even if it is not using the GPU—will attempt to allocate almost the entire GPU memory.

目前，限制 TensorFlow 使用的 GPU 内存量的唯一方法是以下配置选项(来自这个问题):

Currently, the only way to restrict the amount of GPU memory that TensorFlow uses is the following configuration option (from this question):

# Assume that you have 12GB of GPU memory and want to allocate ~4GB:
gpu_options = tf.GPUOptions(per_process_gpu_memory_fraction=0.333)

sess = tf.Session(config=tf.ConfigProto(gpu_options=gpu_options))

这篇关于在 GPU 上使用 Tensorflow 时出错的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

在 GPU 上使用 Tensorflow 时出错 [英] Error using Tensorflow with GPU

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

在 GPU 上使用 Tensorflow 时出错 [英] Error using Tensorflow with GPU

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭