长期运行后，张量流抛出 CUDA_ERROR_LAUNCH_FAILED [英] tensorflow throw CUDA_ERROR_LAUNCH_FAILED after a long run

查看：38 发布时间：2021/9/5 19:33:23 tensorflow

本文介绍了长期运行后，张量流抛出 CUDA_ERROR_LAUNCH_FAILED的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在训练一个 CNN.以下错误在本周出现 3 次.它们都在长时间运行后出现(例如，419140 步).

I am training a CNN. the following error appear 3 time in this week. they all appear after a long run ( eg, 419140 steps ).

这是部分日志:

2017-09-15 11:16:03.515396:步骤 419120，损失 = 0.30 (4427.4示例/秒；0.029 秒/批)2017-09-15 11:16:03.766922:步骤419130，损失 = 0.38(5089.0 个样本/秒；0.025 秒/批次)2017-09-1511:16:04.073978:步骤 419140，损失 = 0.40(4168.5 个样本/秒；0.031秒/批) 2017-09-15 20:48:03.734101: Etensorflow/stream_executor/cuda/cuda_event.cc:49] 轮询错误事件状态:无法查询事件:CUDA_ERROR_LAUNCH_FAILED2017-09-15 20:48:03.734133:Ftensorflow/core/common_runtime/gpu/gpu_event_mgr.cc:203] 意外事件状态:1

2017-09-15 11:16:03.515396: step 419120, loss = 0.30 (4427.4 examples/sec; 0.029 sec/batch) 2017-09-15 11:16:03.766922: step 419130, loss = 0.38 (5089.0 examples/sec; 0.025 sec/batch) 2017-09-15 11:16:04.073978: step 419140, loss = 0.40 (4168.5 examples/sec; 0.031 sec/batch) 2017-09-15 20:48:03.734101: E tensorflow/stream_executor/cuda/cuda_event.cc:49] Error polling for event status: failed to query event: CUDA_ERROR_LAUNCH_FAILED 2017-09-15 20:48:03.734133: F tensorflow/core/common_runtime/gpu/gpu_event_mgr.cc:203] Unexpected Event status: 1

如果我重新开始训练，tensorflow 将不会使用 GPU，这是相关的日志:

If I restart the training, tensorflow will not utilize the GPU, here is the relevant log:

2017-09-15 21:54:38.681074:Etensorflow/stream_executor/cuda/cuda_driver.cc:406] 调用失败cuInit: CUDA_ERROR_UNKNOWN

2017-09-15 21:54:38.681074: E tensorflow/stream_executor/cuda/cuda_driver.cc:406] failed call to cuInit: CUDA_ERROR_UNKNOWN

要让 GPU 重新工作，我必须重新启动计算机.

To make GPU work again, I have to restart my computer.

错误似乎发生在我不熟悉的 c++ 文件中.有人能给我一些有关如何调试或解决此错误的建议吗?

It appears the error happened in a c++ file which I am not familiar. Can some one give me some advice about how to debug or workaround this error?

长期运行后，张量流抛出 CUDA_ERROR_LAUNCH_FAILED [英] tensorflow throw CUDA_ERROR_LAUNCH_FAILED after a long run

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

长期运行后，张量流抛出 CUDA_ERROR_LAUNCH_FAILED [英] tensorflow throw CUDA_ERROR_LAUNCH_FAILED after a long run

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭