tf.Session()上的分段错误(核心已转储) [英] Segmentation fault (core dumped) on tf.Session()

查看:87
本文介绍了tf.Session()上的分段错误(核心已转储)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是TensorFlow的新手.

I am new with TensorFlow.

我刚安装TensorFlow并测试安装,我尝试了以下代码,并且在启动TF会话后,我遇到了 Segmentation错误(核心已转储) 错误.

I just installed TensorFlow and to test the installation, I tried the following code and as soon as I initiate the TF Session, I am getting the Segmentation fault (core dumped) error.

bafhf@remote-server:~$ python
Python 3.6.5 |Anaconda, Inc.| (default, Apr 29 2018, 16:14:56) 
[GCC 7.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import tensorflow as tf
/home/bafhf/anaconda3/envs/ismll/lib/python3.6/site-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
  from ._conv import register_converters as _register_converters
>>> tf.Session()
2018-05-15 12:04:15.461361: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1349] Found device 0 with properties: 
name: Tesla K80 major: 3 minor: 7 memoryClockRate(GHz): 0.8235
pciBusID: 0000:04:00.0
totalMemory: 11.17GiB freeMemory: 11.10GiB
Segmentation fault (core dumped)

我的 nvidia-smi 是:

Tue May 15 12:12:26 2018       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 390.30                 Driver Version: 390.30                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K80           On   | 00000000:04:00.0 Off |                    0 |
| N/A   38C    P8    26W / 149W |      0MiB / 11441MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla K80           On   | 00000000:05:00.0 Off |                    2 |
| N/A   31C    P8    29W / 149W |      0MiB / 11441MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

nvcc --version 是:

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2017 NVIDIA Corporation
Built on Fri_Sep__1_21:08:03_CDT_2017
Cuda compilation tools, release 9.0, V9.0.176

gcc --version 也是:

gcc (Ubuntu 5.4.0-6ubuntu1~16.04.9) 5.4.0 20160609
Copyright (C) 2015 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

以下是我的 PATH :

/home/bafhf/bin:/home/bafhf/.local/bin:/usr/local/cuda/bin:/usr/local/cuda/lib:/usr/local/cuda/extras/CUPTI/lib:/home/bafhf/anaconda3/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin

LD_LIBRARY_PATH :

/usr/local/cuda/bin:/usr/local/cuda/lib:/usr/local/cuda/extras/CUPTI/lib


我正在服务器上运行此程序,但我没有root特权.我仍然按照官方网站上的说明安装了所有内容.


I am running this on a server and I don't have root privileges. Still I managed to install everything as per the instructions on the official website.

新观察结果:

好像GPU正在为进程分配内存一秒钟,然后引发核心分段转储错误:

Seems like the GPU is allocating memory for the process for a second and then the core segmentation dumped error is thrown:

Edit2:更改了张量流版本

我将我的tensorflow版本从v1.8降级到v1.5.问题仍然存在.

I downgraded my tensorflow version from v1.8 to v1.5. The issue still remains.


有什么办法解决或调试此问题?


Is there any way address or debug this issue?

推荐答案

如果您可以看到 nvidia-smi 输出,则第二个GPU具有 ECC代码2.无论CUDA版本或TF版本错误如何,该错误都会显示出来,通常是段错误,有时带有 CUDA_ERROR_ECC_UNCORRECTABLE 标志堆栈跟踪.

If you can see the nvidia-smi output, the second GPU has an ECC code of 2. This error manifests itself irrespective of a CUDA version or TF version error, and usually as a segfault, and sometimes, with the CUDA_ERROR_ECC_UNCORRECTABLE flag in the stack trace.

我从这篇帖子中得出了这个结论:

I got to this conclusion from this post:

不可纠正的ECC错误"通常是指硬件故障.ECC是纠错码,一种检测和纠正位错误的方法存储在RAM中.宇宙射线可能会破坏RAM中存储的一位每隔一段时间,但是无法纠正的ECC错误"表示内存中有几位出现错误"信息-太多了ECC以恢复原始位值.

"Uncorrectable ECC error" usually refers to a hardware failure. ECC is Error Correcting Code, a means to detect and correct errors in bits stored in RAM. A stray cosmic ray can disrupt one bit stored in RAM every once in a great while, but "uncorrectable ECC error" indicates that several bits are coming out of RAM storage "wrong" - too many for the ECC to recover the original bit values.

这可能意味着您的GPU中的RAM单元有故障或边缘不足设备内存.

This could mean that you have a bad or marginal RAM cell in your GPU device memory.

任何种类的边缘电路可能不会100%失效,但更有可能在大量使用的压力下失败-随之而来的是温度.

Marginal circuits of any kind may not fail 100%, but are more likely to fail under the stress of heavy use - and associated rise in temperature.

重新启动通常可以消除 ECC 错误.如果没有,似乎唯一的选择就是更改硬件.

A reboot usually is supposed to take away the ECC error. If not, seems like the only option is to change the hardware.

  1. 我使用NVIDIA 1050 Ti在单独的机器上测试了我的代码机器和我的代码执行得很好.
  2. 我使代码仅在具有 ECC 的第一张卡上运行价值是正常的,只是为了缩小问题的范围.我做到了以下,帖子,将 CUDA_VISIBLE_DEVICES 环境变量.
  3. 然后我请求对Tesla-K80服务器进行重启进行检查重新启动是否可以解决此问题,他们花了一段时间,但然后重新启动服务器

  1. I tested my code a on a separate machcine with NVIDIA 1050 Ti machine and my code executed perfectly fine.
  2. I made the code run only on the first card for which the ECC value was normal, just to narrow down the issue. This I did following, this post, setting the CUDA_VISIBLE_DEVICES environment variable.
  3. I then requested for restart of the Tesla-K80 server to check whether a restart can fix this issue, they took a while but the server was then restarted

现在问题不再存在,我可以同时运行两张卡了张量流蕴涵.

Now the issue is no more and I can run both the cards for my tensorflow implemntations.

这篇关于tf.Session()上的分段错误(核心已转储)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆