TensorFlow中的MemoryError;并且“从SysFS读取的成功NUMA节点具有负值(-1)”。与xen [英] MemoryError in TensorFlow; and "successful NUMA node read from SysFS had negative value (-1)" with xen
问题描述
我正在使用张量流版本:
I am using tensor flow version :
0.12.1
0.12.1
Cuda工具集版本为8。
Cuda tool set version is 8.
lrwxrwxrwx 1 root root 19 May 28 17:27 cuda -> /usr/local/cuda-8.0
此处我已经下载并安装了cuDNN。但是,当从我的python脚本执行以下代码行时,我得到了标题中提到的错误消息:
As documented here I have downloaded and installed cuDNN. But while execeting following line from my python script I am getting error messages mentioned in header:
model.fit_generator(train_generator,
steps_per_epoch= len(train_samples),
validation_data=validation_generator,
validation_steps=len(validation_samples),
epochs=9)
详细的错误消息如下:
Using TensorFlow backend.
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcublas.so locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcudnn.so locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcufft.so locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcurand.so locally
Epoch 1/9 Exception in thread Thread-1: Traceback (most recent call last): File " lib/python3.5/threading.py", line 914, in _bootstrap_inner
self.run() File " lib/python3.5/threading.py", line 862, in run
self._target(*self._args, **self._kwargs) File " lib/python3.5/site-packages/keras/engine/training.py", line 612, in data_generator_task
generator_output = next(self._generator) StopIteration
I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1),
but there must be at least one NUMA node, so returning NUMA node zero
I tensorflow/core/common_runtime/gpu/gpu_device.cc:885]
Found device 0 with properties: name: GRID K520 major: 3 minor: 0 memoryClockRate (GHz) 0.797 pciBusID 0000:00:03.0 Total memory: 3.94GiB Free memory:
3.91GiB
I tensorflow/core/common_runtime/gpu/gpu_device.cc:906] DMA: 0
I tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 0: Y
I tensorflow/core/common_runtime/gpu/gpu_device.cc:975]
Creating TensorFlow device (/gpu:0) -> (device: 0, name: GRID K520, pci bus id: 0000:00:03.0)
Traceback (most recent call last): File "model_new.py", line 82, in <module>
model.fit_generator(train_generator, steps_per_epoch= len(train_samples),validation_data=validation_generator, validation_steps=len(validation_samples),epochs=9) File " lib/python3.5/site-packages/keras/legacy/interfaces.py", line 88, in wrapper
return func(*args, **kwargs) File " lib/python3.5/site-packages/keras/models.py", line 1110, in fit_generator
initial_epoch=initial_epoch) File " lib/python3.5/site-packages/keras/legacy/interfaces.py", line 88, in wrapper
return func(*args, **kwargs) File " lib/python3.5/site-packages/keras/engine/training.py", line 1890, in fit_generator
class_weight=class_weight) File " lib/python3.5/site-packages/keras/engine/training.py", line 1633, in train_on_batch
outputs = self.train_function(ins) File " lib/python3.5/site-packages/keras/backend/tensorflow_backend.py", line 2229, in __call__
feed_dict=feed_dict) File " lib/python3.5/site-packages/tensorflow/python/client/session.py", line 766, in run
run_metadata_ptr) File " lib/python3.5/site-packages/tensorflow/python/client/session.py", line 937, in _run
np_val = np.asarray(subfeed_val, dtype=subfeed_dtype) File " lib/python3.5/site-packages/numpy/core/numeric.py", line 531, in asarray
return array(a, dtype, copy=False, order=order) MemoryError
如果对解决此错误的任何建议表示赞赏。
If any suggestion to resolve this error is appreciated.
编辑:
问题是致命的。
Issue is fatal.
uname -a
Linux ip-172-31-76-109 4.4.0-78-generic #99-Ubuntu SMP
Thu Apr 27 15:29:09 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
sudo lshw -short
[sudo] password for carnd:
H/W path Device Class Description
==========================================
system HVM domU
/0 bus Motherboard
/0/0 memory 96KiB BIOS
/0/401 processor Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz
/0/402 processor CPU
/0/403 processor CPU
/0/404 processor CPU
/0/405 processor CPU
/0/406 processor CPU
/0/407 processor CPU
/0/408 processor CPU
/0/1000 memory 15GiB System Memory
/0/1000/0 memory 15GiB DIMM RAM
/0/100 bridge 440FX - 82441FX PMC [Natoma]
/0/100/1 bridge 82371SB PIIX3 ISA [Natoma/Triton II]
/0/100/1.1 storage 82371SB PIIX3 IDE [Natoma/Triton II]
/0/100/1.3 bridge 82371AB/EB/MB PIIX4 ACPI
/0/100/2 display GD 5446
/0/100/3 display GK104GL [GRID K520]
/0/100/1f generic Xen Platform Device
/1 eth0 network Ethernet interface
编辑2:
这是Amazon云中的EC2实例。并且所有文件的值为-1。
This is an EC2 instance in Amazon cloud. And all the files holding value -1.
:/sys$ find . -name numa_node -exec cat '{}' \;
find: ‘./fs/fuse/connections/39’: Permission denied
-1
-1
-1
-1
-1
-1
-1
find: ‘./kernel/debug’: Permission denied
EDIT3:
更新numa_nod文件后,与NUMA相关的错误消失了。但是上面列出的所有其他先前错误仍然存在。再次出现致命错误。
After updating the numa_nod files NUMA related error is disappeared. But all other previous errors listed above is remaining. And again I got a fatal error.
Using TensorFlow backend.
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcublas.so locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcudnn.so locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcufft.so locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcurand.so locally
Epoch 1/9
Exception in thread Thread-1:
Traceback (most recent call last):
File " lib/python3.5/threading.py", line 914, in _bootstrap_inner
self.run()
File " lib/python3.5/threading.py", line 862, in run
self._target(*self._args, **self._kwargs)
File " lib/python3.5/site-packages/keras/engine/training.py", line 612, in data_generator_task
generator_output = next(self._generator)
StopIteration
I tensorflow/core/common_runtime/gpu/gpu_device.cc:885] Found device 0 with properties:
name: GRID K520
major: 3 minor: 0 memoryClockRate (GHz) 0.797
pciBusID 0000:00:03.0
Total memory: 3.94GiB
Free memory: 3.91GiB
I tensorflow/core/common_runtime/gpu/gpu_device.cc:906] DMA: 0
I tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 0: Y
I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GRID K520, pci bus id: 0000:00:03.0)
Traceback (most recent call last):
File "model_new.py", line 85, in <module>
model.fit_generator(train_generator, steps_per_epoch= len(train_samples),validation_data=validation_generator, validation_steps=len(validation_samples),epochs=9)
File " lib/python3.5/site-packages/keras/legacy/interfaces.py", line 88, in wrapper
return func(*args, **kwargs)
File " lib/python3.5/site-packages/keras/models.py", line 1110, in fit_generator
initial_epoch=initial_epoch)
File " lib/python3.5/site-packages/keras/legacy/interfaces.py", line 88, in wrapper
return func(*args, **kwargs)
File " lib/python3.5/site-packages/keras/engine/training.py", line 1890, in fit_generator
class_weight=class_weight)
File " lib/python3.5/site-packages/keras/engine/training.py", line 1633, in train_on_batch
outputs = self.train_function(ins)
File " lib/python3.5/site-packages/keras/backend/tensorflow_backend.py", line 2229, in __call__
feed_dict=feed_dict)
File " lib/python3.5/site-packages/tensorflow/python/client/session.py", line 766, in run
run_metadata_ptr)
File " lib/python3.5/site-packages/tensorflow/python/client/session.py", line 937, in _run
np_val = np.asarray(subfeed_val, dtype=subfeed_dtype)
File " lib/python3.5/site-packages/numpy/core/numeric.py", line 531, in asarray
return array(a, dtype, copy=False, order=order)
MemoryError
推荐答案
有代码打印消息从SysFS读取的成功NUMA节点具有负值(-1),它不是致命错误,仅是警告。实际错误是在文件 model_new.py,第85行中的
MemoryError
。我们需要更多来源来检查此错误。尝试使模型更小或在具有更多RAM的服务器上运行。
There is the code which prints the message "successful NUMA node read from SysFS had negative value (-1)", and it is not Fatal Error, it is just warning. Real error is MemoryError
in your File "model_new.py", line 85, in <module>
. We need more sources to check this error. Try to make your model smaller or run on server with more RAM.
关于NUMA节点警告:
About NUMA node warning:
// Attempts to read the NUMA node corresponding to the GPU device's PCI bus out
// of SysFS. Returns -1 if it cannot...
static int TryToReadNumaNode(const string &pci_bus_id, int device_ordinal)
{...
string filename =
port::Printf("/sys/bus/pci/devices/%s/numa_node", pci_bus_id.c_str());
FILE *file = fopen(filename.c_str(), "r");
if (file == nullptr) {
LOG(ERROR) << "could not open file to read NUMA node: " << filename
<< "\nYour kernel may have been built without NUMA support.";
return kUnknownNumaNode;
} ...
if (port::safe_strto32(content, &value)) {
if (value < 0) { // See http://b/18228951 for details on this path.
LOG(INFO) << "successful NUMA node read from SysFS had negative value ("
<< value << "), but there must be at least one NUMA node"
", so returning NUMA node zero";
fclose(file);
return 0;
}
TensorFlow能够打开 / sys / bus / pci / devices /%s / numa_node
文件,其中%s是GPU PCI卡的ID( 字符串pci_bus_id = CUDADriver :: GetPCIBusID(device _)
)。您的PC不是多插槽的,只有一个安装了8核Xeon E5-2670的CPU插槽,因此此ID应该为'0'(在Linux中,单个NUMA节点编号为0),但是错误消息指出它是该文件中的 -1
值!
TensorFlow was able to open /sys/bus/pci/devices/%s/numa_node
file where %s is id of GPU PCI card (string pci_bus_id = CUDADriver::GetPCIBusID(device_)
). Your PC is not multisocket, there is only single CPU socket with 8-core Xeon E5-2670 installed, so this id should be '0' (single NUMA node is numbered as 0 in Linux), but the error message says that it was -1
value in this file!
因此,我们知道sysfs已装入 / sys
,有 numa_node
个特殊文件,在您的Linux内核配置中启用了CONFIG_NUMA( zgrep NUMA / boot / config * / proc / config *
)。实际上已启用: CONFIG_NUMA = y
-在 x86_64 4.4.0-78-generic内核的Deb
So, we know that sysfs is mounted into /sys
, there is numa_node
special file, CONFIG_NUMA is enabled in your Linux Kernel config (zgrep NUMA /boot/config* /proc/config*
). Actually it is enabled: CONFIG_NUMA=y
- in the deb of your x86_64 4.4.0-78-generic kernel
特殊文件 numa_node
记录在 https:/ /www.kernel.org/doc/Documentation/ABI/testing/sysfs-bus-pci (您的PC的ACPI是否错误?)
What: /sys/bus/pci/devices/.../numa_node
Date: Oct 2014
Contact: Prarit Bhargava <prarit@redhat.com>
Description:
This file contains the NUMA node to which the PCI device is
attached, or -1 if the node is unknown. The initial value
comes from an ACPI _PXM method or a similar firmware
source. If that is missing or incorrect, this file can be
written to override the node. In that case, please report
a firmware bug to the system vendor. Writing to this file
taints the kernel with TAINT_FIRMWARE_WORKAROUND, which
reduces the supportability of your system.
快速入门( kludge ):在每次启动后,找到该GPU的 numa_node
并使用root帐户执行此命令,其中NNNNN是卡的PCI ID(在 lspci
输出和 / sys / bus / pci / devices /
目录中搜索)
There is quick (kludge) workaround for this error: find the numa_node
of your GPU and with root account do after every boot this command where NNNNN is the PCI id of your card (search in lspci
output and in /sys/bus/pci/devices/
directory)
echo 0 | sudo tee -a /sys/bus/pci/devices/NNNNN/numa_node
或者直接将其回显每个这样的文件,都应该相当安全:
Or just echo it into every such file, it should be rather safe:
for a in /sys/bus/pci/devices/*; do echo 0 | sudo tee -a $a/numa_node; done
另外,您的 lshw
表示不是PC,而是Xen虚拟访客。 Xen平台(ACPI)仿真和Linux PCI总线NUMA支持代码之间有问题。
Also your lshw
shows that it is not PC, but Xen virtual guest. There is something wrong between Xen platform (ACPI) emulation and Linux PCI bus NUMA-support code.
这篇关于TensorFlow中的MemoryError;并且“从SysFS读取的成功NUMA节点具有负值(-1)”。与xen的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!