张量流中的 CUDA_ERROR_OUT_OF_MEMORY [英] CUDA_ERROR_OUT_OF_MEMORY in tensorflow

查看:46
本文介绍了张量流中的 CUDA_ERROR_OUT_OF_MEMORY的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

当我开始训练某个神经网络时,它遇到了CUDA_ERROR_OUT_OF_MEMORY,但训练可以继续进行而不会出错.因为我想使用真正需要的gpu内存,所以我设置了gpu_options.allow_growth = True.日志如下:

When I started to train some neural network, it met the CUDA_ERROR_OUT_OF_MEMORY but the training could go on without error. Because I wanted to use gpu memory as it really needs, so I set the gpu_options.allow_growth = True.The logs are as follows:

I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcublas.so locally
I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcudnn.so locally
I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcufft.so locally
I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcurand.so locally
I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:925] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
I tensorflow/core/common_runtime/gpu/gpu_device.cc:951] Found device 0 with properties:
name: GeForce GTX 1080
major: 6 minor: 1 memoryClockRate (GHz) 1.7335
pciBusID 0000:01:00.0
Total memory: 7.92GiB
Free memory: 7.81GiB
I tensorflow/core/common_runtime/gpu/gpu_device.cc:972] DMA: 0
I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] 0:   Y
I tensorflow/core/common_runtime/gpu/gpu_device.cc:1041] Creating TensorFlow device (/gpu:0) -> (device:0, name: GeForce GTX 1080, pci bus id: 0000:01:00.0)
E tensorflow/stream_executor/cuda/cuda_driver.cc:965] failed to allocate 4.00G (4294967296 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
Iter 20, Minibatch Loss= 40491.636719
...

使用nvidia-smi命令后,得到:

+-----------------------------------------------------------------------------+   
| NVIDIA-SMI 367.27                 Driver Version: 367.27                            
|-------------------------------+----------------------+----------------------+   
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |  
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M.
|===============================+======================+======================|
|   0  GeForce GTX 1080    Off  | 0000:01:00.0     Off |                  N/A |   
| 40%   61C    P2    46W / 180W |   8107MiB /  8111MiB |     96%      Default |   
+-------------------------------+----------------------+----------------------+   
|   1  GeForce GTX 1080    Off  | 0000:02:00.0     Off |                  N/A |   
|  0%   40C    P0    40W / 180W |      0MiB /  8113MiB |      0%      Default |   
+-------------------------------+----------------------+----------------------+   
                                                                              │
+-----------------------------------------------------------------------------+   
| Processes:                                                       GPU Memory |   
|  GPU       PID  Type  Process name                               Usage      |   
|=============================================================================|   
|    0     22932    C   python                                        8105MiB |
+-----------------------------------------------------------------------------+ 

在我评论gpu_options.allow_growth = True后,我再次训练网络,一切正常.没有CUDA_ERROR_OUT_OF_MEMORY的问题.最后,运行 nvidia-smi 命令,得到:

After I commented the gpu_options.allow_growth = True, I trained the net again and everything was normal. There was no the problem of CUDA_ERROR_OUT_OF_MEMORY. Finally, ran the nvidia-smi command, it gets:

+-----------------------------------------------------------------------------+   
| NVIDIA-SMI 367.27                 Driver Version: 367.27                            
|-------------------------------+----------------------+----------------------+   
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |  
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M.
|===============================+======================+======================|
|   0  GeForce GTX 1080    Off  | 0000:01:00.0     Off |                  N/A |   
| 40%   61C    P2    46W / 180W |   7793MiB /  8111MiB |     99%      Default |   
+-------------------------------+----------------------+----------------------+   
|   1  GeForce GTX 1080    Off  | 0000:02:00.0     Off |                  N/A |   
|  0%   40C    P0    40W / 180W |      0MiB /  8113MiB |      0%      Default |   
+-------------------------------+----------------------+----------------------+   
                                                                              │
+-----------------------------------------------------------------------------+   
| Processes:                                                       GPU Memory |   
|  GPU       PID  Type  Process name                               Usage      |   
|=============================================================================|   
|    0     22932    C   python                                        7791MiB |
+-----------------------------------------------------------------------------+ 

我有两个问题.为什么CUDA_OUT_OF_MEMORY出来,程序正常进行?为什么注释allow_growth = True后内存使用量变小了.

I have two questions about it. Why did the CUDA_OUT_OF_MEMORY come out and the procedure went on normally? why did the memory usage become smaller after commenting allow_growth = True.

推荐答案

如果它仍然与某人相关,我在第一次运行中止后第二次尝试运行 Keras/Tensorflow 时遇到了这个问题.似乎GPU内存仍在分配,因此无法再次分配.解决方法是手动结束所有使用 GPU 的 Python 进程,或者关闭现有终端并在新的终端窗口中再次运行.

In case it's still relevant for someone, I encountered this issue when trying to run Keras/Tensorflow for the second time, after a first run was aborted. It seems the GPU memory is still allocated, and therefore cannot be allocated again. It was solved by manually ending all python processes that use the GPU, or alternatively, closing the existing terminal and running again in a new terminal window.

这篇关于张量流中的 CUDA_ERROR_OUT_OF_MEMORY的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆