TensorFlow 中的内存错误;和“从 SysFS 读取的成功 NUMA 节点具有负值 (-1)";与 xen [英] MemoryError in TensorFlow; and "successful NUMA node read from SysFS had negative value (-1)" with xen
问题描述
我正在使用张量流版本:
I am using tensor flow version :
0.12.1
Cuda 工具集版本为 8.
Cuda tool set version is 8.
lrwxrwxrwx 1 root root 19 May 28 17:27 cuda -> /usr/local/cuda-8.0
如此处所述,我已经下载并安装了cuDNN.但是在从我的 python 脚本执行以下行时,我收到标题中提到的错误消息:
As documented here I have downloaded and installed cuDNN. But while execeting following line from my python script I am getting error messages mentioned in header:
model.fit_generator(train_generator,
steps_per_epoch= len(train_samples),
validation_data=validation_generator,
validation_steps=len(validation_samples),
epochs=9)
详细错误信息如下:
Using TensorFlow backend.
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcublas.so locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcudnn.so locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcufft.so locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcurand.so locally
Epoch 1/9 Exception in thread Thread-1: Traceback (most recent call last): File " lib/python3.5/threading.py", line 914, in _bootstrap_inner
self.run() File " lib/python3.5/threading.py", line 862, in run
self._target(*self._args, **self._kwargs) File " lib/python3.5/site-packages/keras/engine/training.py", line 612, in data_generator_task
generator_output = next(self._generator) StopIteration
I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1),
but there must be at least one NUMA node, so returning NUMA node zero
I tensorflow/core/common_runtime/gpu/gpu_device.cc:885]
Found device 0 with properties: name: GRID K520 major: 3 minor: 0 memoryClockRate (GHz) 0.797 pciBusID 0000:00:03.0 Total memory: 3.94GiB Free memory:
3.91GiB
I tensorflow/core/common_runtime/gpu/gpu_device.cc:906] DMA: 0
I tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 0: Y
I tensorflow/core/common_runtime/gpu/gpu_device.cc:975]
Creating TensorFlow device (/gpu:0) -> (device: 0, name: GRID K520, pci bus id: 0000:00:03.0)
Traceback (most recent call last): File "model_new.py", line 82, in <module>
model.fit_generator(train_generator, steps_per_epoch= len(train_samples),validation_data=validation_generator, validation_steps=len(validation_samples),epochs=9) File " lib/python3.5/site-packages/keras/legacy/interfaces.py", line 88, in wrapper
return func(*args, **kwargs) File " lib/python3.5/site-packages/keras/models.py", line 1110, in fit_generator
initial_epoch=initial_epoch) File " lib/python3.5/site-packages/keras/legacy/interfaces.py", line 88, in wrapper
return func(*args, **kwargs) File " lib/python3.5/site-packages/keras/engine/training.py", line 1890, in fit_generator
class_weight=class_weight) File " lib/python3.5/site-packages/keras/engine/training.py", line 1633, in train_on_batch
outputs = self.train_function(ins) File " lib/python3.5/site-packages/keras/backend/tensorflow_backend.py", line 2229, in __call__
feed_dict=feed_dict) File " lib/python3.5/site-packages/tensorflow/python/client/session.py", line 766, in run
run_metadata_ptr) File " lib/python3.5/site-packages/tensorflow/python/client/session.py", line 937, in _run
np_val = np.asarray(subfeed_val, dtype=subfeed_dtype) File " lib/python3.5/site-packages/numpy/core/numeric.py", line 531, in asarray
return array(a, dtype, copy=False, order=order) MemoryError
如果有任何解决此错误的建议不胜感激.
If any suggestion to resolve this error is appreciated.
问题是致命的.
uname -a
Linux ip-172-31-76-109 4.4.0-78-generic #99-Ubuntu SMP
Thu Apr 27 15:29:09 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
sudo lshw -short
[sudo] password for carnd:
H/W path Device Class Description
==========================================
system HVM domU
/0 bus Motherboard
/0/0 memory 96KiB BIOS
/0/401 processor Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz
/0/402 processor CPU
/0/403 processor CPU
/0/404 processor CPU
/0/405 processor CPU
/0/406 processor CPU
/0/407 processor CPU
/0/408 processor CPU
/0/1000 memory 15GiB System Memory
/0/1000/0 memory 15GiB DIMM RAM
/0/100 bridge 440FX - 82441FX PMC [Natoma]
/0/100/1 bridge 82371SB PIIX3 ISA [Natoma/Triton II]
/0/100/1.1 storage 82371SB PIIX3 IDE [Natoma/Triton II]
/0/100/1.3 bridge 82371AB/EB/MB PIIX4 ACPI
/0/100/2 display GD 5446
/0/100/3 display GK104GL [GRID K520]
/0/100/1f generic Xen Platform Device
/1 eth0 network Ethernet interface
编辑 2:
这是亚马逊云中的一个 EC2 实例.以及所有值为 -1 的文件.
This is an EC2 instance in Amazon cloud. And all the files holding value -1.
:/sys$ find . -name numa_node -exec cat '{}' ;
find: ‘./fs/fuse/connections/39’: Permission denied
-1
-1
-1
-1
-1
-1
-1
find: ‘./kernel/debug’: Permission denied
更新 numa_nod 文件后,NUMA 相关错误消失.但是上面列出的所有其他以前的错误仍然存在.我又犯了一个致命错误.
After updating the numa_nod files NUMA related error is disappeared. But all other previous errors listed above is remaining. And again I got a fatal error.
Using TensorFlow backend.
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcublas.so locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcudnn.so locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcufft.so locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcurand.so locally
Epoch 1/9
Exception in thread Thread-1:
Traceback (most recent call last):
File " lib/python3.5/threading.py", line 914, in _bootstrap_inner
self.run()
File " lib/python3.5/threading.py", line 862, in run
self._target(*self._args, **self._kwargs)
File " lib/python3.5/site-packages/keras/engine/training.py", line 612, in data_generator_task
generator_output = next(self._generator)
StopIteration
I tensorflow/core/common_runtime/gpu/gpu_device.cc:885] Found device 0 with properties:
name: GRID K520
major: 3 minor: 0 memoryClockRate (GHz) 0.797
pciBusID 0000:00:03.0
Total memory: 3.94GiB
Free memory: 3.91GiB
I tensorflow/core/common_runtime/gpu/gpu_device.cc:906] DMA: 0
I tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 0: Y
I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GRID K520, pci bus id: 0000:00:03.0)
Traceback (most recent call last):
File "model_new.py", line 85, in <module>
model.fit_generator(train_generator, steps_per_epoch= len(train_samples),validation_data=validation_generator, validation_steps=len(validation_samples),epochs=9)
File " lib/python3.5/site-packages/keras/legacy/interfaces.py", line 88, in wrapper
return func(*args, **kwargs)
File " lib/python3.5/site-packages/keras/models.py", line 1110, in fit_generator
initial_epoch=initial_epoch)
File " lib/python3.5/site-packages/keras/legacy/interfaces.py", line 88, in wrapper
return func(*args, **kwargs)
File " lib/python3.5/site-packages/keras/engine/training.py", line 1890, in fit_generator
class_weight=class_weight)
File " lib/python3.5/site-packages/keras/engine/training.py", line 1633, in train_on_batch
outputs = self.train_function(ins)
File " lib/python3.5/site-packages/keras/backend/tensorflow_backend.py", line 2229, in __call__
feed_dict=feed_dict)
File " lib/python3.5/site-packages/tensorflow/python/client/session.py", line 766, in run
run_metadata_ptr)
File " lib/python3.5/site-packages/tensorflow/python/client/session.py", line 937, in _run
np_val = np.asarray(subfeed_val, dtype=subfeed_dtype)
File " lib/python3.5/site-packages/numpy/core/numeric.py", line 531, in asarray
return array(a, dtype, copy=False, order=order)
MemoryError
推荐答案
有代码打印消息successful NUMA node read from SysFS has negative value (-1)",这不是致命错误,是只是警告.真正的错误是 MemoryError
在 File "model_new.py", line 85, in
.我们需要更多的来源来检查这个错误.尝试使您的模型更小或在具有更多 RAM 的服务器上运行.
There is the code which prints the message "successful NUMA node read from SysFS had negative value (-1)", and it is not Fatal Error, it is just warning. Real error is MemoryError
in your File "model_new.py", line 85, in <module>
. We need more sources to check this error. Try to make your model smaller or run on server with more RAM.
关于 NUMA 节点警告:
About NUMA node warning:
// Attempts to read the NUMA node corresponding to the GPU device's PCI bus out
// of SysFS. Returns -1 if it cannot...
static int TryToReadNumaNode(const string &pci_bus_id, int device_ordinal)
{...
string filename =
port::Printf("/sys/bus/pci/devices/%s/numa_node", pci_bus_id.c_str());
FILE *file = fopen(filename.c_str(), "r");
if (file == nullptr) {
LOG(ERROR) << "could not open file to read NUMA node: " << filename
<< "
Your kernel may have been built without NUMA support.";
return kUnknownNumaNode;
} ...
if (port::safe_strto32(content, &value)) {
if (value < 0) { // See http://b/18228951 for details on this path.
LOG(INFO) << "successful NUMA node read from SysFS had negative value ("
<< value << "), but there must be at least one NUMA node"
", so returning NUMA node zero";
fclose(file);
return 0;
}
TensorFlow 能够打开 /sys/bus/pci/devices/%s/numa_node
文件,其中 %s 是 GPU PCI 卡的 id (string :: CU_DAD7CEE9A09b9dd77da4c034.你的电脑不是多路的,只有单 CPU 插座安装了 8 核至强 E5-2670,所以这个 id 应该是0"(单个 NUMA 节点在 Linux 中编号为 0),但错误消息说它是
-1
此文件中的值!
TensorFlow was able to open /sys/bus/pci/devices/%s/numa_node
file where %s is id of GPU PCI card (string pci_bus_id = CUDADriver::GetPCIBusID(device_)
). Your PC is not multisocket, there is only single CPU socket with 8-core Xeon E5-2670 installed, so this id should be '0' (single NUMA node is numbered as 0 in Linux), but the error message says that it was -1
value in this file!
所以,我们知道 sysfs 挂载到 /sys
,有 numa_node
特殊文件,在你的 Linux 内核配置中启用了 CONFIG_NUMA (zgrep NUMA/引导/配置*/proc/config*
).实际上它已启用:CONFIG_NUMA=y
- 在 你的 x86_64 4.4.0-78 通用内核的 deb
So, we know that sysfs is mounted into /sys
, there is numa_node
special file, CONFIG_NUMA is enabled in your Linux Kernel config (zgrep NUMA /boot/config* /proc/config*
). Actually it is enabled: CONFIG_NUMA=y
- in the deb of your x86_64 4.4.0-78-generic kernel
特殊文件 numa_node
记录在 https://www.kernel.org/doc/Documentation/ABI/testing/sysfs-bus-pci(您的 PC 的 ACPI 是否错误?)
The special file numa_node
is documented in https://www.kernel.org/doc/Documentation/ABI/testing/sysfs-bus-pci (is the ACPI of your PC wrong?)
What: /sys/bus/pci/devices/.../numa_node
Date: Oct 2014
Contact: Prarit Bhargava <prarit@redhat.com>
Description:
This file contains the NUMA node to which the PCI device is
attached, or -1 if the node is unknown. The initial value
comes from an ACPI _PXM method or a similar firmware
source. If that is missing or incorrect, this file can be
written to override the node. In that case, please report
a firmware bug to the system vendor. Writing to this file
taints the kernel with TAINT_FIRMWARE_WORKAROUND, which
reduces the supportability of your system.
有一个快速的 (kludge) 解决方法来解决这个错误:找到 numa_node
您的 GPU 和 root 帐户在每次启动后执行此命令,其中 NNNNN 是您卡的 PCI id(在 lspci
输出和 /sys/bus/pci/devices 中搜索/
目录)
There is quick (kludge) workaround for this error: find the numa_node
of your GPU and with root account do after every boot this command where NNNNN is the PCI id of your card (search in lspci
output and in /sys/bus/pci/devices/
directory)
echo 0 | sudo tee -a /sys/bus/pci/devices/NNNNN/numa_node
或者只是将它回显到每个这样的文件中,它应该是相当安全的:
Or just echo it into every such file, it should be rather safe:
for a in /sys/bus/pci/devices/*; do echo 0 | sudo tee -a $a/numa_node; done
还有你的 lshw
显示它不是 PC,而是 Xen 虚拟访客.Xen 平台 (ACPI) 仿真和 Linux PCI 总线 NUMA 支持代码之间存在问题.
Also your lshw
shows that it is not PC, but Xen virtual guest. There is something wrong between Xen platform (ACPI) emulation and Linux PCI bus NUMA-support code.
这篇关于TensorFlow 中的内存错误;和“从 SysFS 读取的成功 NUMA 节点具有负值 (-1)";与 xen的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!