Tensorflow GPU 停止工作 [英] Tensorflow GPU stopped working

查看:45
本文介绍了Tensorflow GPU 停止工作的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

重现问题

几天前我运行了 tensorflow,但它停止工作了.使用教程代码对其进行测试后,mnist_softmaxmnist_deep 失败.Tensorflow 成功地运行了简单的 helloworld 内容.

我的尝试

  • delton137 一样,我尝试设置 allow_growth 为 True 或 per_process_gpu_memory_fraction 为 0.1,但这无济于事.
  • 我已经尝试重新安装我的 cudnn 文件.

附加说明

我不记得对我的 Tensorflow 安装或我的 CUDA/cuDNN 设置进行任何更改,所以我最好的猜测是这可能是自动更新的驱动程序的问题.

系统信息

  • 我是否编写了自定义代码(而不是使用 TensorFlow 中提供的库存示例脚本):否.使用 tensorflow 教程中的代码可以重现问题.
  • 操作系统平台和发行版(例如 Linux Ubuntu 16.04):Ubuntu 16.04.3 LTS
  • 从(源代码或二进制文件)安装的 TensorFlow:源代码
  • TensorFlow 版本(使用下面的命令):v1.3.0-rc2-20-g0787eee 1.3.0
  • Python 版本:Python 3.5.2(默认,2017 年 8 月 18 日,17:48:00)
  • Bazel 版本(如果从源代码编译):不适用
  • CUDA/cuDNN 版本:CUDA 版本 8.0,V8.0.61/libcudnn.so.6.0.21
  • GPU 型号和内存:GeForce GTX 1080,8GB,384.90 驱动程序

源代码/日志

对于helloworld REPL 中的代码

<预><代码>>>>将张量流导入为 tf>>>hello = tf.constant('你好,TensorFlow!')>>>sess = tf.Session()2017-10-26 21:56:00.418991: W tensorflow/core/platform/cpu_feature_guard.cc:45] TensorFlow 库没有被编译为使用 SSE4.1 指令,但这些在你的机器上可用并且可以加速 CPU计算.2017-10-26 21:56:00.419027: W tensorflow/core/platform/cpu_feature_guard.cc:45] TensorFlow 库没有被编译为使用 SSE4.2 指令,但这些在您的机器上可用并且可以加速 CPU计算.2017-10-26 21:56:00.419036: W tensorflow/core/platform/cpu_feature_guard.cc:45] TensorFlow 库没有被编译为使用 AVX 指令,但这些在您的机器上可用,可以加速 CPU 计算.2017-10-26 21:56:00.419046: W tensorflow/core/platform/cpu_feature_guard.cc:45] TensorFlow 库没有被编译为使用 AVX2 指令,但这些在您的机器上可用并且可以加速 CPU 计算.2017-10-26 21:56:00.419054: W tensorflow/core/platform/cpu_feature_guard.cc:45] TensorFlow 库没有被编译为使用 FMA 指令,但这些在您的机器上可用并且可以加速 CPU 计算.2017-10-26 21:56:00.565143: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:893] 成功从 SysFS 读取的 NUMA 节点具有负值 (-1),但必须至少有一个 NUMA 节点,因此返回NUMA 节点零2017-10-26 21:56:00.565417:我 tensorflow/core/common_runtime/gpu/gpu_device.cc:955] 发现设备 0 的属性:名称:GeForce GTX 1080主要:6 次要:1 memoryClockRate (GHz) 1.7335pciBusID 0000:01:00.0总内存:7.92GiB可用内存:6.48GiB2017-10-26 21:56:00.565432:我张量流/核心/common_runtime/gpu/gpu_device.cc:976] DMA:02017-10-26 21:56:00.565437: I tensorflow/core/common_runtime/gpu/gpu_device.cc:986] 0: Y2017-10-26 21:56:00.565447: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1045] 创建 TensorFlow 设备 (/gpu:0) ->(设备:0,名称:GeForce GTX 1080,pci 总线 ID:0000:01:00.0)>>>打印(sess.run(你好))b'你好,TensorFlow!

对于python3 mnist_deep.py

2017-10-26 21:37:56.993479: E tensorflow/stream_executor/cuda/cuda_dnn.cc:371] 无法创建 cudnn 句柄:CUDNN_STATUS_INTERNAL_ERROR2017-10-26 21:37:56.993560: E tensorflow/stream_executor/cuda/cuda_dnn.cc:338] 无法销毁 cudnn 句柄:CUDNN_STATUS_BAD_PARAM2017-10-26 21:37:56.993580: F tensorflow/core/kernels/conv_ops.cc:672] 检查失败:stream->parent()->GetConvolveAlgorithms(conv_parameters.ShouldIncludeWinogradT,Non<()算法)

对于python3 mnist_softmax.py

名称:GeForce GTX 1080主要:6 次要:1 memoryClockRate (GHz) 1.7335pciBusID 0000:01:00.0总内存:7.92GiB可用内存:6.50GiB2017-10-26 21:53:16.150706:我张量流/核心/common_runtime/gpu/gpu_device.cc:976] DMA:02017-10-26 21:53:16.150712: I tensorflow/core/common_runtime/gpu/gpu_device.cc:986] 0: Y2017-10-26 21:53:16.150723: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1045] 创建 TensorFlow 设备 (/gpu:0) ->(设备:0,名称:GeForce GTX 1080,pci 总线 ID:0000:01:00.0)2017-10-26 21:53:16.422081: E tensorflow/stream_executor/cuda/cuda_blas.cc:366] 未能创建 cublas 句柄:CUBLAS_STATUS_NOT_INITIALIZED2017-10-26 21:53:16.422132: W tensorflow/stream_executor/stream.cc:1756] 尝试在没有 BLAS 支持的情况下使用 StreamExecutor 执行 BLAS 操作回溯(最近一次调用最后一次):_do_call 中的文件/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py",第 1327 行返回 fn(*args)文件/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py",第 1306 行,在 _run_fn状态,运行元数据)文件/usr/lib/python3.5/contextlib.py",第 66 行,在 __exit__ 中下一个(self.gen)文件/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/errors_impl.py",第466行,raise_exception_on_not_ok_statuspywrap_tensorflow.TF_GetCode(status))tensorflow.python.framework.errors_impl.InternalError:Blas GEMM 启动失败:a.shape=(100, 784), b.shape=(784, 10), m=100, n=10, k=784[[节点:MatMul = MatMul[T=DT_FLOAT, transpose_a=false, transpose_b=false, _device="/job:localhost/replica:0/task:0/gpu:0"](_arg_Placeholder_0_0/_9, 变量/读取)]]在处理上述异常的过程中,又发生了一个异常:回溯(最近一次调用最后一次):文件mnist_softmax.py",第 78 行,在 <module> 中.tf.app.run(main=main, argv=[sys.argv[0]] + 未解析)运行中的文件/usr/local/lib/python3.5/dist-packages/tensorflow/python/platform/app.py",第 48 行_sys.exit(main(_sys.argv[:1] + flags_passthrough))文件mnist_softmax.py",第 65 行,在主目录中sess.run(train_step,feed_dict={x:batch_xs,y_:batch_ys})运行中的文件/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py",第 895 行run_metadata_ptr)文件/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py",第 1124 行,在 _runfeed_dict_tensor,选项,run_metadata)_do_run 中的文件/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py",第 1321 行选项,run_metadata)_do_call 中的文件/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py",第 1340 行引发类型(e)(节点定义,操作,消息)tensorflow.python.framework.errors_impl.InternalError:Blas GEMM 启动失败:a.shape=(100, 784), b.shape=(784, 10), m=100, n=10, k=784[[节点:MatMul = MatMul[T=DT_FLOAT, transpose_a=false, transpose_b=false, _device="/job:localhost/replica:0/task:0/gpu:0"](_arg_Placeholder_0_0/_9, 变量/读取)]]由 op 'MatMul' 引起,定义于:文件mnist_softmax.py",第 78 行,在 <module> 中.tf.app.run(main=main, argv=[sys.argv[0]] + 未解析)运行中的文件/usr/local/lib/python3.5/dist-packages/tensorflow/python/platform/app.py",第 48 行_sys.exit(main(_sys.argv[:1] + flags_passthrough))文件mnist_softmax.py",第 42 行,在主目录中y = tf.matmul(x, W) + b文件/usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/math_ops.py",第1844行,在matmul中a, b, transpose_a=transpose_a, transpose_b=transpose_b, name=name)文件/usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/gen_math_ops.py",第 1289 行,在 _mat_mul 中transpose_b=transpose_b,名称=名称)文件/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/op_def_library.py",第767行,在apply_opop_def=op_def)文件/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/ops.py",第 2630 行,在 create_op 中original_op=self._default_original_op, op_def=op_def)文件/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/ops.py",第 1204 行,在 __init__ 中self._traceback = self._graph._extract_stack() # pylint: disable=protected-accessInternalError(回溯见上文):Blas GEMM 启动失败:a.shape=(100, 784), b.shape=(784, 10), m=100, n=10, k=784[[节点:MatMul = MatMul[T=DT_FLOAT, transpose_a=false, transpose_b=false, _device="/job:localhost/replica:0/task:0/gpu:0"](_arg_Placeholder_0_0/_9, 变量/读取)]]

这是nvidia-smi

的输出

+-----------------------------------------------------------------------------+|NVIDIA-SMI 384.90 驱动程序版本:384.90 ||-----------------------------------------+----------------------+----------------------+|GPU名称持久性-M|Bus-Id Disp.A |挥发性未校正.ECC ||风扇温度性能 Pwr:Usage/Cap|内存使用 |GPU-Util 计算 M.|==========================================================+========================||0 GeForce GTX 1080 关闭 |00000000:01:00.0 开 |不适用 ||34% 51C P0 35W/180W |1340MiB/8110MiB |0% 默认 |+-------------------------------+----------------------+----------------------++------------------------------------------------------------------------------+|进程:GPU 内存 ||GPU PID 类型进程名称用法||================================================================================||0 1250 G/usr/lib/xorg/Xorg 785MiB ||0 2426 G compiz 359MiB ||0 3840 G ...-token=44A975F4EE134A1BF9C8CD1C7223C977 107MiB ||0 4944 G ...-token=4F87ADEE5575E9B5125D41E08D43BF0E 83MiB |+------------------------------------------------------------------------------+

解决方案

尝试关闭在其他进程中活动的会话.请关注此主题 -

TensorFlow:内部错误:Blas SGEMM 启动失败

Reproducing the issue

I had tensorflow running a few days ago, but it stopped working. Upon testing it with the tutorial code, both mnist_softmax and mnist_deep fail. Tensorflow is succeeding in running the simple helloworld content.

What I've tried

  • As with delton137, I've tried setting allow_growth to True or the per_process_gpu_memory_fraction to 0.1, but this does not help.
  • I've tried reinstalling my cudnn files.

Additional notes

I don't remember making any changes to my Tensorflow installation or my CUDA/cuDNN setup, so my best guess is that this might be an issue with a driver that auto-updated.

System information

  • Have I written custom code (as opposed to using a stock example script provided in TensorFlow): No. Issue is reproducible using code from tensorflow tutorials.
  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu 16.04.3 LTS
  • TensorFlow installed from (source or binary): source
  • TensorFlow version (use command below): v1.3.0-rc2-20-g0787eee 1.3.0
  • Python version: Python 3.5.2 (default, Aug 18 2017, 17:48:00)
  • Bazel version (if compiling from source): N/A
  • CUDA/cuDNN version: CUDA release 8.0, V8.0.61 / libcudnn.so.6.0.21
  • GPU model and memory: GeForce GTX 1080, 8GB, on 384.90 driver

Source code / logs

For helloworld code in REPL

>>> import tensorflow as tf
>>> hello = tf.constant('Hello, TensorFlow!')
>>> sess = tf.Session()
2017-10-26 21:56:00.418991: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations.
2017-10-26 21:56:00.419027: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.
2017-10-26 21:56:00.419036: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
2017-10-26 21:56:00.419046: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX2 instructions, but these are available on your machine and could speed up CPU computations.
2017-10-26 21:56:00.419054: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use FMA instructions, but these are available on your machine and could speed up CPU computations.
2017-10-26 21:56:00.565143: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:893] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2017-10-26 21:56:00.565417: I tensorflow/core/common_runtime/gpu/gpu_device.cc:955] Found device 0 with properties: 
name: GeForce GTX 1080
major: 6 minor: 1 memoryClockRate (GHz) 1.7335
pciBusID 0000:01:00.0
Total memory: 7.92GiB
Free memory: 6.48GiB
2017-10-26 21:56:00.565432: I tensorflow/core/common_runtime/gpu/gpu_device.cc:976] DMA: 0 
2017-10-26 21:56:00.565437: I tensorflow/core/common_runtime/gpu/gpu_device.cc:986] 0:   Y 
2017-10-26 21:56:00.565447: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1045] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX 1080, pci bus id: 0000:01:00.0)
>>> print(sess.run(hello))
b'Hello, TensorFlow!'

For python3 mnist_deep.py

2017-10-26 21:37:56.993479: E tensorflow/stream_executor/cuda/cuda_dnn.cc:371] could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
2017-10-26 21:37:56.993560: E tensorflow/stream_executor/cuda/cuda_dnn.cc:338] could not destroy cudnn handle: CUDNN_STATUS_BAD_PARAM
2017-10-26 21:37:56.993580: F tensorflow/core/kernels/conv_ops.cc:672] Check failed: stream->parent()->GetConvolveAlgorithms( conv_parameters.ShouldIncludeWinogradNonfusedAlgo<T>(), &algorithms) 

For python3 mnist_softmax.py

name: GeForce GTX 1080
major: 6 minor: 1 memoryClockRate (GHz) 1.7335
pciBusID 0000:01:00.0
Total memory: 7.92GiB
Free memory: 6.50GiB
2017-10-26 21:53:16.150706: I tensorflow/core/common_runtime/gpu/gpu_device.cc:976] DMA: 0 
2017-10-26 21:53:16.150712: I tensorflow/core/common_runtime/gpu/gpu_device.cc:986] 0:   Y 
2017-10-26 21:53:16.150723: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1045] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX 1080, pci bus id: 0000:01:00.0)
2017-10-26 21:53:16.422081: E tensorflow/stream_executor/cuda/cuda_blas.cc:366] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
2017-10-26 21:53:16.422132: W tensorflow/stream_executor/stream.cc:1756] attempting to perform BLAS operation using StreamExecutor without BLAS support
Traceback (most recent call last):
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1327, in _do_call
    return fn(*args)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1306, in _run_fn
    status, run_metadata)
  File "/usr/lib/python3.5/contextlib.py", line 66, in __exit__
    next(self.gen)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/errors_impl.py", line 466, in raise_exception_on_not_ok_status
    pywrap_tensorflow.TF_GetCode(status))
tensorflow.python.framework.errors_impl.InternalError: Blas GEMM launch failed : a.shape=(100, 784), b.shape=(784, 10), m=100, n=10, k=784
         [[Node: MatMul = MatMul[T=DT_FLOAT, transpose_a=false, transpose_b=false, _device="/job:localhost/replica:0/task:0/gpu:0"](_arg_Placeholder_0_0/_9, Variable/read)]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "mnist_softmax.py", line 78, in <module>
    tf.app.run(main=main, argv=[sys.argv[0]] + unparsed)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/platform/app.py", line 48, in run
    _sys.exit(main(_sys.argv[:1] + flags_passthrough))
  File "mnist_softmax.py", line 65, in main
    sess.run(train_step, feed_dict={x: batch_xs, y_: batch_ys})
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 895, in run
    run_metadata_ptr)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1124, in _run
    feed_dict_tensor, options, run_metadata)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1321, in _do_run
    options, run_metadata)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1340, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InternalError: Blas GEMM launch failed : a.shape=(100, 784), b.shape=(784, 10), m=100, n=10, k=784
         [[Node: MatMul = MatMul[T=DT_FLOAT, transpose_a=false, transpose_b=false, _device="/job:localhost/replica:0/task:0/gpu:0"](_arg_Placeholder_0_0/_9, Variable/read)]]

Caused by op 'MatMul', defined at:
  File "mnist_softmax.py", line 78, in <module>
    tf.app.run(main=main, argv=[sys.argv[0]] + unparsed)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/platform/app.py", line 48, in run
    _sys.exit(main(_sys.argv[:1] + flags_passthrough))
  File "mnist_softmax.py", line 42, in main
    y = tf.matmul(x, W) + b
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/math_ops.py", line 1844, in matmul
    a, b, transpose_a=transpose_a, transpose_b=transpose_b, name=name)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/gen_math_ops.py", line 1289, in _mat_mul
    transpose_b=transpose_b, name=name)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/op_def_library.py", line 767, in apply_op
    op_def=op_def)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/ops.py", line 2630, in create_op
    original_op=self._default_original_op, op_def=op_def)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/ops.py", line 1204, in __init__
    self._traceback = self._graph._extract_stack()  # pylint: disable=protected-access

InternalError (see above for traceback): Blas GEMM launch failed : a.shape=(100, 784), b.shape=(784, 10), m=100, n=10, k=784
         [[Node: MatMul = MatMul[T=DT_FLOAT, transpose_a=false, transpose_b=false, _device="/job:localhost/replica:0/task:0/gpu:0"](_arg_Placeholder_0_0/_9, Variable/read)]]

Here is the output of nvidia-smi

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.90                 Driver Version: 384.90                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 1080    Off  | 00000000:01:00.0  On |                  N/A |
| 34%   51C    P0    35W / 180W |   1340MiB /  8110MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0      1250      G   /usr/lib/xorg/Xorg                           785MiB |
|    0      2426      G   compiz                                       359MiB |
|    0      3840      G   ...-token=44A975F4EE134A1BF9C8CD1C7223C977   107MiB |
|    0      4944      G   ...-token=4F87ADEE5575E9B5125D41E08D43BF0E    83MiB |
+-----------------------------------------------------------------------------+

解决方案

Try to close sessions active in other processes. Please follow this thread -

TensorFlow: InternalError: Blas SGEMM launch failed

这篇关于Tensorflow GPU 停止工作的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆