长期运行后,张量流抛出 CUDA_ERROR_LAUNCH_FAILED [英] tensorflow throw CUDA_ERROR_LAUNCH_FAILED after a long run

查看:38
本文介绍了长期运行后,张量流抛出 CUDA_ERROR_LAUNCH_FAILED的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在训练一个 CNN.以下错误在本周出现 3 次.它们都在长时间运行后出现(例如,419140 步).

I am training a CNN. the following error appear 3 time in this week. they all appear after a long run ( eg, 419140 steps ).

这是部分日志:

2017-09-15 11:16:03.515396:步骤 419120,损失 = 0.30 (4427.4示例/秒;0.029 秒/批)2017-09-15 11:16:03.766922:步骤419130,损失 = 0.38(5089.0 个样本/秒;0.025 秒/批次)2017-09-1511:16:04.073978:步骤 419140,损失 = 0.40(4168.5 个样本/秒;0.031秒/批) 2017-09-15 20:48:03.734101: Etensorflow/stream_executor/cuda/cuda_event.cc:49] 轮询错误事件状态:无法查询事件:CUDA_ERROR_LAUNCH_FAILED2017-09-15 20:48:03.734133:Ftensorflow/core/common_runtime/gpu/gpu_event_mgr.cc:203] 意外事件状态:1

2017-09-15 11:16:03.515396: step 419120, loss = 0.30 (4427.4 examples/sec; 0.029 sec/batch) 2017-09-15 11:16:03.766922: step 419130, loss = 0.38 (5089.0 examples/sec; 0.025 sec/batch) 2017-09-15 11:16:04.073978: step 419140, loss = 0.40 (4168.5 examples/sec; 0.031 sec/batch) 2017-09-15 20:48:03.734101: E tensorflow/stream_executor/cuda/cuda_event.cc:49] Error polling for event status: failed to query event: CUDA_ERROR_LAUNCH_FAILED 2017-09-15 20:48:03.734133: F tensorflow/core/common_runtime/gpu/gpu_event_mgr.cc:203] Unexpected Event status: 1

如果我重新开始训练,tensorflow 将不会使用 GPU,这是相关的日志:

If I restart the training, tensorflow will not utilize the GPU, here is the relevant log:

2017-09-15 21:54:38.681074:Etensorflow/stream_executor/cuda/cuda_driver.cc:406] 调用失败cuInit: CUDA_ERROR_UNKNOWN

2017-09-15 21:54:38.681074: E tensorflow/stream_executor/cuda/cuda_driver.cc:406] failed call to cuInit: CUDA_ERROR_UNKNOWN

要让 GPU 重新工作,我必须重新启动计算机.

To make GPU work again, I have to restart my computer.

错误似乎发生在我不熟悉的 c++ 文件中.有人能给我一些有关如何调试或解决此错误的建议吗?

It appears the error happened in a c++ file which I am not familiar. Can some one give me some advice about how to debug or workaround this error?

推荐答案

我遇到了同样的问题,我在这里找到了一个关于它为什么发生的建议:https://devtalk.nvidia.com/default/topic/1046479/gpu-occasionally-gets-lost-when-running-tensorflow-/

I faced the same problem and I found a suggestion on why it's happening here : https://devtalk.nvidia.com/default/topic/1046479/gpu-occasionally-gets-lost-when-running-tensorflow-/

显然,当 Nvidia GPU 过热时,它会抛出此错误!

Apparently, when Nvidia GPU overheats it throws this error!

这篇关于长期运行后,张量流抛出 CUDA_ERROR_LAUNCH_FAILED的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆