TensorFlow 中的内存错误;和“从 SysFS 读取的成功 NUMA 节点具有负值 (-1)";与 xen [英] MemoryError in TensorFlow; and "successful NUMA node read from SysFS had negative value (-1)" with xen

查看:19
本文介绍了TensorFlow 中的内存错误;和“从 SysFS 读取的成功 NUMA 节点具有负值 (-1)";与 xen的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用张量流版本:

I am using tensor flow version :

0.12.1

Cuda 工具集版本为 8.

Cuda tool set version is 8.

lrwxrwxrwx  1 root root   19 May 28 17:27 cuda -> /usr/local/cuda-8.0

此处所述,我已经下载并安装了cuDNN.但是在从我的 python 脚本执行以下行时,我收到标题中提到的错误消息:

As documented here I have downloaded and installed cuDNN. But while execeting following line from my python script I am getting error messages mentioned in header:

  model.fit_generator(train_generator,
   steps_per_epoch= len(train_samples),
   validation_data=validation_generator, 
   validation_steps=len(validation_samples),
   epochs=9)

详细错误信息如下:

Using TensorFlow backend. 
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcublas.so locally 
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcudnn.so locally 
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcufft.so locally 
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcuda.so.1 locally 
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcurand.so locally 
Epoch 1/9 Exception in thread Thread-1: Traceback (most recent call last):   File " lib/python3.5/threading.py", line 914, in _bootstrap_inner
    self.run()   File " lib/python3.5/threading.py", line 862, in run
    self._target(*self._args, **self._kwargs)   File " lib/python3.5/site-packages/keras/engine/training.py", line 612, in data_generator_task
    generator_output = next(self._generator) StopIteration

I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), 
 but there must be at least one NUMA node, so returning NUMA node zero 
I tensorflow/core/common_runtime/gpu/gpu_device.cc:885] 
Found device 0 with properties: name: GRID K520 major: 3 minor: 0 memoryClockRate (GHz) 0.797 pciBusID 0000:00:03.0 Total memory: 3.94GiB Free memory:
3.91GiB 
I tensorflow/core/common_runtime/gpu/gpu_device.cc:906] DMA: 0 
I tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 0:   Y 
I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] 
 Creating TensorFlow device (/gpu:0) -> (device: 0, name: GRID K520, pci bus id: 0000:00:03.0) 
Traceback (most recent call last):   File "model_new.py", line 82, in <module>
    model.fit_generator(train_generator, steps_per_epoch= len(train_samples),validation_data=validation_generator, validation_steps=len(validation_samples),epochs=9)   File " lib/python3.5/site-packages/keras/legacy/interfaces.py", line 88, in wrapper
    return func(*args, **kwargs)   File " lib/python3.5/site-packages/keras/models.py", line 1110, in fit_generator
    initial_epoch=initial_epoch)   File " lib/python3.5/site-packages/keras/legacy/interfaces.py", line 88, in wrapper
    return func(*args, **kwargs)   File " lib/python3.5/site-packages/keras/engine/training.py", line 1890, in fit_generator
    class_weight=class_weight)   File " lib/python3.5/site-packages/keras/engine/training.py", line 1633, in train_on_batch
    outputs = self.train_function(ins)   File " lib/python3.5/site-packages/keras/backend/tensorflow_backend.py", line 2229, in __call__
    feed_dict=feed_dict)   File " lib/python3.5/site-packages/tensorflow/python/client/session.py", line 766, in run
    run_metadata_ptr)   File " lib/python3.5/site-packages/tensorflow/python/client/session.py", line 937, in _run
    np_val = np.asarray(subfeed_val, dtype=subfeed_dtype)   File " lib/python3.5/site-packages/numpy/core/numeric.py", line 531, in asarray
    return array(a, dtype, copy=False, order=order) MemoryError

如果有任何解决此错误的建议不胜感激.

If any suggestion to resolve this error is appreciated.

问题是致命的.

uname -a
Linux ip-172-31-76-109 4.4.0-78-generic #99-Ubuntu SMP
Thu Apr 27 15:29:09 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

sudo lshw -short
[sudo] password for carnd:
H/W path    Device  Class      Description
==========================================
                    system     HVM domU
/0                  bus        Motherboard
/0/0                memory     96KiB BIOS
/0/401              processor  Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz
/0/402              processor  CPU
/0/403              processor  CPU
/0/404              processor  CPU
/0/405              processor  CPU
/0/406              processor  CPU
/0/407              processor  CPU
/0/408              processor  CPU
/0/1000             memory     15GiB System Memory
/0/1000/0           memory     15GiB DIMM RAM
/0/100              bridge     440FX - 82441FX PMC [Natoma]
/0/100/1            bridge     82371SB PIIX3 ISA [Natoma/Triton II]
/0/100/1.1          storage    82371SB PIIX3 IDE [Natoma/Triton II]
/0/100/1.3          bridge     82371AB/EB/MB PIIX4 ACPI
/0/100/2            display    GD 5446
/0/100/3            display    GK104GL [GRID K520]
/0/100/1f           generic    Xen Platform Device
/1          eth0    network    Ethernet interface

编辑 2:

这是亚马逊云中的一个 EC2 实例.以及所有值为 -1 的文件.

This is an EC2 instance in Amazon cloud. And all the files holding value -1.

:/sys$ find . -name numa_node -exec cat '{}' ;
find: ‘./fs/fuse/connections/39’: Permission denied
-1
-1
-1
-1
-1
-1
-1
find: ‘./kernel/debug’: Permission denied

更新 numa_nod 文件后,NUMA 相关错误消失.但是上面列出的所有其他以前的错误仍然存​​在.我又犯了一个致命错误.

After updating the numa_nod files NUMA related error is disappeared. But all other previous errors listed above is remaining. And again I got a fatal error.

Using TensorFlow backend.
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcublas.so locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcudnn.so locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcufft.so locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcurand.so locally
Epoch 1/9
Exception in thread Thread-1:
Traceback (most recent call last):
  File " lib/python3.5/threading.py", line 914, in _bootstrap_inner
    self.run()
  File " lib/python3.5/threading.py", line 862, in run
    self._target(*self._args, **self._kwargs)
  File " lib/python3.5/site-packages/keras/engine/training.py", line 612, in data_generator_task
    generator_output = next(self._generator)
StopIteration

I tensorflow/core/common_runtime/gpu/gpu_device.cc:885] Found device 0 with properties:
name: GRID K520
major: 3 minor: 0 memoryClockRate (GHz) 0.797
pciBusID 0000:00:03.0
Total memory: 3.94GiB
Free memory: 3.91GiB
I tensorflow/core/common_runtime/gpu/gpu_device.cc:906] DMA: 0
I tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 0:   Y
I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GRID K520, pci bus id: 0000:00:03.0)
Traceback (most recent call last):
  File "model_new.py", line 85, in <module>
    model.fit_generator(train_generator, steps_per_epoch= len(train_samples),validation_data=validation_generator, validation_steps=len(validation_samples),epochs=9)
  File " lib/python3.5/site-packages/keras/legacy/interfaces.py", line 88, in wrapper
    return func(*args, **kwargs)
  File " lib/python3.5/site-packages/keras/models.py", line 1110, in fit_generator
    initial_epoch=initial_epoch)
  File " lib/python3.5/site-packages/keras/legacy/interfaces.py", line 88, in wrapper
    return func(*args, **kwargs)
  File " lib/python3.5/site-packages/keras/engine/training.py", line 1890, in fit_generator
    class_weight=class_weight)
  File " lib/python3.5/site-packages/keras/engine/training.py", line 1633, in train_on_batch
    outputs = self.train_function(ins)
  File " lib/python3.5/site-packages/keras/backend/tensorflow_backend.py", line 2229, in __call__
    feed_dict=feed_dict)
  File " lib/python3.5/site-packages/tensorflow/python/client/session.py", line 766, in run
    run_metadata_ptr)
  File " lib/python3.5/site-packages/tensorflow/python/client/session.py", line 937, in _run
    np_val = np.asarray(subfeed_val, dtype=subfeed_dtype)
  File " lib/python3.5/site-packages/numpy/core/numeric.py", line 531, in asarray
    return array(a, dtype, copy=False, order=order)
MemoryError

推荐答案

有代码打印消息successful NUMA node read from SysFS has negative value (-1)",这不是致命错误,是只是警告.真正的错误是 MemoryErrorFile "model_new.py", line 85, in .我们需要更多的来源来检查这个错误.尝试使您的模型更小或在具有更多 RAM 的服务器上运行.

There is the code which prints the message "successful NUMA node read from SysFS had negative value (-1)", and it is not Fatal Error, it is just warning. Real error is MemoryError in your File "model_new.py", line 85, in <module>. We need more sources to check this error. Try to make your model smaller or run on server with more RAM.

关于 NUMA 节点警告:

About NUMA node warning:

张量流/张量流/blob/e4296aefff97e6edd3d7cee9a09b9dd77da4c034/tensorflow/stream_executor/cuda/cuda_gpu_executor.cc#L855

// Attempts to read the NUMA node corresponding to the GPU device's PCI bus out
// of SysFS. Returns -1 if it cannot...
static int TryToReadNumaNode(const string &pci_bus_id, int device_ordinal) 
{...
  string filename =
      port::Printf("/sys/bus/pci/devices/%s/numa_node", pci_bus_id.c_str());
  FILE *file = fopen(filename.c_str(), "r");
  if (file == nullptr) {
    LOG(ERROR) << "could not open file to read NUMA node: " << filename
               << "
Your kernel may have been built without NUMA support.";
    return kUnknownNumaNode;
  } ...
  if (port::safe_strto32(content, &value)) {
    if (value < 0) {  // See http://b/18228951 for details on this path.
      LOG(INFO) << "successful NUMA node read from SysFS had negative value ("
                << value << "), but there must be at least one NUMA node"
                            ", so returning NUMA node zero";
      fclose(file);
      return 0;
    }

TensorFlow 能够打开 /sys/bus/pci/devices/%s/numa_node 文件,其中 %s 是 GPU PCI 卡的 id (string :: CU_DAD7CEE9A09b9dd77da4c034.你的电脑不是多路的,只有单 CPU 插座安装了 8 核至强 E5-2670,所以这个 id 应该是0"(单个 NUMA 节点在 Linux 中编号为 0),但错误消息说它是-1 此文件中的值!

TensorFlow was able to open /sys/bus/pci/devices/%s/numa_node file where %s is id of GPU PCI card (string pci_bus_id = CUDADriver::GetPCIBusID(device_)). Your PC is not multisocket, there is only single CPU socket with 8-core Xeon E5-2670 installed, so this id should be '0' (single NUMA node is numbered as 0 in Linux), but the error message says that it was -1 value in this file!

所以,我们知道 sysfs 挂载到 /sys,有 numa_node 特殊文件,在你的 Linux 内核配置中启用了 CONFIG_NUMA (zgrep NUMA/引导/配置*/proc/config*).实际上它已启用:CONFIG_NUMA=y - 在 你的 x86_64 4.4.0-78 通用内核的 deb

So, we know that sysfs is mounted into /sys, there is numa_node special file, CONFIG_NUMA is enabled in your Linux Kernel config (zgrep NUMA /boot/config* /proc/config*). Actually it is enabled: CONFIG_NUMA=y - in the deb of your x86_64 4.4.0-78-generic kernel

特殊文件 numa_node 记录在 https://www.kernel.org/doc/Documentation/ABI/testing/sysfs-bus-pci(您的 PC 的 ACPI 是否错误?)

The special file numa_node is documented in https://www.kernel.org/doc/Documentation/ABI/testing/sysfs-bus-pci (is the ACPI of your PC wrong?)

What:       /sys/bus/pci/devices/.../numa_node
Date:       Oct 2014
Contact:    Prarit Bhargava <prarit@redhat.com>
Description:
        This file contains the NUMA node to which the PCI device is
        attached, or -1 if the node is unknown.  The initial value
        comes from an ACPI _PXM method or a similar firmware
        source.  If that is missing or incorrect, this file can be
        written to override the node.  In that case, please report
        a firmware bug to the system vendor.  Writing to this file
        taints the kernel with TAINT_FIRMWARE_WORKAROUND, which
        reduces the supportability of your system.

有一个快速的 (kludge) 解决方法来解决这个错误:找到 numa_node 您的 GPU 和 root 帐户在每次启动后执行此命令,其中 NNNNN 是您卡的 PCI id(在 lspci 输出和 /sys/bus/pci/devices 中搜索/ 目录)

There is quick (kludge) workaround for this error: find the numa_node of your GPU and with root account do after every boot this command where NNNNN is the PCI id of your card (search in lspci output and in /sys/bus/pci/devices/ directory)

echo 0 | sudo tee -a /sys/bus/pci/devices/NNNNN/numa_node

或者只是将它回显到每个这样的文件中,它应该是相当安全的:

Or just echo it into every such file, it should be rather safe:

for a in /sys/bus/pci/devices/*; do echo 0 | sudo tee -a $a/numa_node; done

还有你的 lshw 显示它不是 PC,而是 Xen 虚拟访客.Xen 平台 (ACPI) 仿真和 Linux PCI 总线 NUMA 支持代码之间存在问题.

Also your lshw shows that it is not PC, but Xen virtual guest. There is something wrong between Xen platform (ACPI) emulation and Linux PCI bus NUMA-support code.

这篇关于TensorFlow 中的内存错误;和“从 SysFS 读取的成功 NUMA 节点具有负值 (-1)";与 xen的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆