张量流中的 CUDA_ERROR_OUT_OF_MEMORY [英] CUDA_ERROR_OUT_OF_MEMORY in tensorflow
问题描述
当我开始训练某个神经网络时,它遇到了CUDA_ERROR_OUT_OF_MEMORY
,但训练可以继续进行而不会出错.因为我想使用真正需要的gpu内存,所以我设置了gpu_options.allow_growth = True
.日志如下:
When I started to train some neural network, it met the CUDA_ERROR_OUT_OF_MEMORY
but the training could go on without error. Because I wanted to use gpu memory as it really needs, so I set the gpu_options.allow_growth = True
.The logs are as follows:
I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcublas.so locally
I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcudnn.so locally
I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcufft.so locally
I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcurand.so locally
I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:925] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
I tensorflow/core/common_runtime/gpu/gpu_device.cc:951] Found device 0 with properties:
name: GeForce GTX 1080
major: 6 minor: 1 memoryClockRate (GHz) 1.7335
pciBusID 0000:01:00.0
Total memory: 7.92GiB
Free memory: 7.81GiB
I tensorflow/core/common_runtime/gpu/gpu_device.cc:972] DMA: 0
I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] 0: Y
I tensorflow/core/common_runtime/gpu/gpu_device.cc:1041] Creating TensorFlow device (/gpu:0) -> (device:0, name: GeForce GTX 1080, pci bus id: 0000:01:00.0)
E tensorflow/stream_executor/cuda/cuda_driver.cc:965] failed to allocate 4.00G (4294967296 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
Iter 20, Minibatch Loss= 40491.636719
...
使用nvidia-smi
命令后,得到:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 367.27 Driver Version: 367.27
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M.
|===============================+======================+======================|
| 0 GeForce GTX 1080 Off | 0000:01:00.0 Off | N/A |
| 40% 61C P2 46W / 180W | 8107MiB / 8111MiB | 96% Default |
+-------------------------------+----------------------+----------------------+
| 1 GeForce GTX 1080 Off | 0000:02:00.0 Off | N/A |
| 0% 40C P0 40W / 180W | 0MiB / 8113MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
│
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 22932 C python 8105MiB |
+-----------------------------------------------------------------------------+
在我评论gpu_options.allow_growth = True
后,我再次训练网络,一切正常.没有CUDA_ERROR_OUT_OF_MEMORY
的问题.最后,运行 nvidia-smi
命令,得到:
After I commented the gpu_options.allow_growth = True
, I trained the net again and everything was normal. There was no the problem of CUDA_ERROR_OUT_OF_MEMORY
. Finally, ran the nvidia-smi
command, it gets:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 367.27 Driver Version: 367.27
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M.
|===============================+======================+======================|
| 0 GeForce GTX 1080 Off | 0000:01:00.0 Off | N/A |
| 40% 61C P2 46W / 180W | 7793MiB / 8111MiB | 99% Default |
+-------------------------------+----------------------+----------------------+
| 1 GeForce GTX 1080 Off | 0000:02:00.0 Off | N/A |
| 0% 40C P0 40W / 180W | 0MiB / 8113MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
│
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 22932 C python 7791MiB |
+-----------------------------------------------------------------------------+
我有两个问题.为什么CUDA_OUT_OF_MEMORY
出来,程序正常进行?为什么注释allow_growth = True
后内存使用量变小了.
I have two questions about it. Why did the CUDA_OUT_OF_MEMORY
come out and the procedure went on normally? why did the memory usage become smaller after commenting allow_growth = True
.
推荐答案
如果它仍然与某人相关,我在第一次运行中止后第二次尝试运行 Keras/Tensorflow 时遇到了这个问题.似乎GPU内存仍在分配,因此无法再次分配.解决方法是手动结束所有使用 GPU 的 Python 进程,或者关闭现有终端并在新的终端窗口中再次运行.
In case it's still relevant for someone, I encountered this issue when trying to run Keras/Tensorflow for the second time, after a first run was aborted. It seems the GPU memory is still allocated, and therefore cannot be allocated again. It was solved by manually ending all python processes that use the GPU, or alternatively, closing the existing terminal and running again in a new terminal window.
这篇关于张量流中的 CUDA_ERROR_OUT_OF_MEMORY的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!