numpy.memmap:伪造的内存分配 [英] numpy.memmap: bogus memory allocation
问题描述
我有一个python3
脚本,可以对numpy.memmap
数组进行操作.它将一个数组写入位于/tmp
:
I have a python3
script that operates with numpy.memmap
arrays. It writes an array to newly generated temporary file that is located in /tmp
:
import numpy, tempfile
size = 2 ** 37 * 10
tmp = tempfile.NamedTemporaryFile('w+')
array = numpy.memmap(tmp.name, dtype = 'i8', mode = 'w+', shape = size)
array[0] = 666
array[size-1] = 777
del array
array2 = numpy.memmap(tmp.name, dtype = 'i8', mode = 'r+', shape = size)
print('File: {}. Array size: {}. First cell value: {}. Last cell value: {}'.\
format(tmp.name, len(array2), array2[0], array2[size-1]))
while True:
pass
HDD的大小仅为250G.但是,它可以以某种方式在/tmp
中生成10T大文件,并且相应的数组似乎仍然可以访问.脚本的输出如下:
The size of the HDD is only 250G. Nevertheless, it can somehow generate 10T large files in /tmp
, and the corresponding array still seems to be accessible. The output of the script is following:
File: /tmp/tmptjfwy8nr. Array size: 1374389534720. First cell value: 666. Last cell value: 777
该文件确实存在,并显示为10T大:
The file really exists and is displayed as being 10T large:
$ ls -l /tmp/tmptjfwy8nr
-rw------- 1 user user 10995116277760 Dec 1 15:50 /tmp/tmptjfwy8nr
但是,/tmp
的整体大小要小得多:
However, the whole size of /tmp
is much smaller:
$ df -h /tmp
Filesystem Size Used Avail Use% Mounted on
/dev/sda1 235G 5.3G 218G 3% /
该过程还假装使用10T虚拟内存,这也是不可能的. top
命令的输出:
The process also is pretending to use 10T virtual memory, which is also not possible. The output of top
command:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
31622 user 20 0 10.000t 16592 4600 R 100.0 0.0 0:45.63 python3
据我了解,这意味着在numpy.memmap
调用期间,未分配整个数组所需的内存,因此显示的文件大小是虚假的.反过来,这意味着当我开始逐渐用数据填充整个数组时,有时我的程序将崩溃或数据将被破坏.
As far as I understand, this means that during the call of numpy.memmap
the needed memory for the whole array is not allocated and therefore displayed file size is bogus. This in turn means that when I start to gradually fill the whole array with my data, at some point my program will crash or my data will be corrupted.
的确,如果我在代码中引入以下内容:
Indeed, if I introduce the following in my code:
for i in range(size):
array[i] = i
一段时间后出现错误:
Bus error (core dumped)
因此,问题是:如何在开始时检查是否确实有足够的内存用于数据,然后确实为整个数组保留空间?
Therefore, the question: how to check at the beginning, if there is really enough memory for the data and then indeed reserve the space for the whole array?
推荐答案
关于您正在生成10 TB文件的事实并没有什么伪造"
您要的是大小数组
There's nothing 'bogus' about the fact that you are generating 10 TB files
You are asking for arrays of size
2 ** 37 * 10 = 1374389534720个元素
2 ** 37 * 10 = 1374389534720 elements
dtype 'i8'
表示8字节(64位)整数,因此您的最终数组的大小为
A dtype of 'i8'
means an 8 byte (64 bit) integer, therefore your final array will have a size of
1374389534720 * 8 = 10995116277760字节
1374389534720 * 8 = 10995116277760 bytes
或
10995116277760/1E12 = 10.99511627776 TB
10995116277760 / 1E12 = 10.99511627776 TB
如果您只有250 GB的可用磁盘空间,那么如何创建"10 TB"文件?
假设您使用的是相当现代的文件系统,则您的操作系统将能够生成几乎任意大的稀疏文件,无论您实际上是否有足够的物理磁盘空间来备份它们.
If you only have 250 GB of free disk space then how are you able to create a "10 TB" file?
Assuming that you are using a reasonably modern filesystem, your OS will be capable of generating almost arbitrarily large sparse files, regardless of whether or not you actually have enough physical disk space to back them.
例如,在我的Linux机器上,我可以执行以下操作:
For example, on my Linux machine I'm allowed to do something like this:
# I only have about 50GB of free space...
~$ df -h /
Filesystem Type Size Used Avail Use% Mounted on
/dev/sdb1 ext4 459G 383G 53G 88% /
~$ dd if=/dev/zero of=sparsefile bs=1 count=0 seek=10T
0+0 records in
0+0 records out
0 bytes (0 B) copied, 0.000236933 s, 0.0 kB/s
# ...but I can still generate a sparse file that reports its size as 10 TB
~$ ls -lah sparsefile
-rw-rw-r-- 1 alistair alistair 10T Dec 1 21:17 sparsefile
# however, this file uses zero bytes of "actual" disk space
~$ du -h sparsefile
0 sparsefile
在初始化np.memmap
文件后尝试在其上调用du -h
,以查看其实际使用了多少磁盘空间.
Try calling du -h
on your np.memmap
file after it has been initialized to see how much actual disk space it uses.
当您开始实际将数据写入np.memmap
文件时,一切都会好起来,直到超出存储的物理容量为止,此时该过程将以Bus error
终止.这意味着如果您需要写<将250GB的数据存储到np.memmap
数组中,那么可能没有问题(在实践中,这还可能取决于您在数组中写入的位置以及行还是列的大小).
As you start actually writing data to your np.memmap
file, everything will be OK until you exceed the physical capacity of your storage, at which point the process will terminate with a Bus error
. This means that if you needed to write < 250GB of data to your np.memmap
array then there might be no problem (in practice this would probably also depend on where you are writing within the array, and on whether it is row or column major).
创建内存映射时,内核在调用进程的虚拟地址空间内分配一个新的地址块,并将其映射到磁盘上的文件.因此,您的Python进程正在使用的虚拟内存量将随着刚刚创建的文件的大小而增加.由于文件也可以是稀疏的,因此虚拟内存不仅会超过可用的RAM总量,而且还会超过计算机上的物理磁盘总空间.
When you create a memory map, the kernel allocates a new block of addresses within the virtual address space of the calling process and maps them to a file on your disk. The amount of virtual memory that your Python process is using will therefore increase by the size of the file that has just been created. Since the file can also be sparse, then not only can the virtual memory exceed the total amount of RAM available, but it can also exceed the total physical disk space on your machine.
我假设您想在Python中以编程方式进行此操作.
I'm assuming that you want to do this programmatically in Python.
-
获取可用的可用磁盘空间量. 上一个SO问题的答案中提供了多种方法.一种选择是
os.statvfs
:
Get the amount of free disk space available. There are various methods given in the answers to this previous SO question. One option is
os.statvfs
:
import os
def get_free_bytes(path='/'):
st = os.statvfs(path)
return st.f_bavail * st.f_bsize
print(get_free_bytes())
# 56224485376
计算出以字节为单位的数组大小:
Work out the size of your array in bytes:
import numpy as np
def check_asize_bytes(shape, dtype):
return np.prod(shape) * np.dtype(dtype).itemsize
print(check_asize_bytes((2 ** 37 * 10,), 'i8'))
# 10995116277760
检查是否2.> 1.
Check whether 2. > 1.
更新:是否有一种安全"的方式来分配np.memmap
文件,以确保保留足够的磁盘空间来存储整个阵列?
一种可能是使用 fallocate
预分配磁盘空间,例如:
Update: Is there a 'safe' way to allocate an np.memmap
file, which guarantees that sufficient disk space is reserved to store the full array?
One possibility might be to use fallocate
to pre-allocate the disk space, e.g.:
~$ fallocate -l 1G bigfile
~$ du -h bigfile
1.1G bigfile
您可以从Python调用它,例如使用 subprocess.check_call
:
You could call this from Python, for example using subprocess.check_call
:
import subprocess
def fallocate(fname, length):
return subprocess.check_call(['fallocate', '-l', str(length), fname])
def safe_memmap_alloc(fname, dtype, shape, *args, **kwargs):
nbytes = np.prod(shape) * np.dtype(dtype).itemsize
fallocate(fname, nbytes)
return np.memmap(fname, dtype, *args, shape=shape, **kwargs)
mmap = safe_memmap_alloc('test.mmap', np.int64, (1024, 1024))
print(mmap.nbytes / 1E6)
# 8.388608
print(subprocess.check_output(['du', '-h', 'test.mmap']))
# 8.0M test.mmap
我不知道使用标准库来执行此操作的平台无关方法,但是有一个 fallocate
PyPI上的Python模块,该模块应可用于任何基于Posix的操作系统.
I'm not aware of a platform-independent way to do this using the standard library, but there is a fallocate
Python module on PyPI that should work for any Posix-based OS.
这篇关于numpy.memmap:伪造的内存分配的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!