numpy.memmap:伪造的内存分配 [英] numpy.memmap: bogus memory allocation

查看:248
本文介绍了numpy.memmap:伪造的内存分配的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个python3脚本,可以对numpy.memmap数组进行操作.它将一个数组写入位于/tmp:

I have a python3 script that operates with numpy.memmap arrays. It writes an array to newly generated temporary file that is located in /tmp:

import numpy, tempfile

size = 2 ** 37 * 10
tmp = tempfile.NamedTemporaryFile('w+')
array = numpy.memmap(tmp.name, dtype = 'i8', mode = 'w+', shape = size)
array[0] = 666
array[size-1] = 777
del array
array2 = numpy.memmap(tmp.name, dtype = 'i8', mode = 'r+', shape = size)
print('File: {}. Array size: {}. First cell value: {}. Last cell value: {}'.\
      format(tmp.name, len(array2), array2[0], array2[size-1]))
while True:
    pass

HDD的大小仅为250G.但是,它可以以某种方式在/tmp中生成10T大文件,并且相应的数组似乎仍然可以访问.脚本的输出如下:

The size of the HDD is only 250G. Nevertheless, it can somehow generate 10T large files in /tmp, and the corresponding array still seems to be accessible. The output of the script is following:

File: /tmp/tmptjfwy8nr. Array size: 1374389534720. First cell value: 666. Last cell value: 777

该文件确实存在,并显示为10T大:

The file really exists and is displayed as being 10T large:

$ ls -l /tmp/tmptjfwy8nr
-rw------- 1 user user 10995116277760 Dec  1 15:50 /tmp/tmptjfwy8nr

但是,/tmp的整体大小要小得多:

However, the whole size of /tmp is much smaller:

$ df -h /tmp
Filesystem      Size  Used Avail Use% Mounted on
/dev/sda1       235G  5.3G  218G   3% /

该过程还假装使用10T虚拟内存,这也是不可能的. top命令的输出:

The process also is pretending to use 10T virtual memory, which is also not possible. The output of top command:

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND 
31622 user      20   0 10.000t  16592   4600 R 100.0  0.0   0:45.63 python3

据我了解,这意味着在numpy.memmap调用期间,未分配整个数组所需的内存,因此显示的文件大小是虚假的.反过来,这意味着当我开始逐渐用数据填充整个数组时,有时我的程序将崩溃或数据将被破坏.

As far as I understand, this means that during the call of numpy.memmap the needed memory for the whole array is not allocated and therefore displayed file size is bogus. This in turn means that when I start to gradually fill the whole array with my data, at some point my program will crash or my data will be corrupted.

的确,如果我在代码中引入以下内容:

Indeed, if I introduce the following in my code:

for i in range(size):
    array[i] = i

一段时间后出现错误:

Bus error (core dumped)

因此,问题是:如何在开始时检查是否确实有足够的内存用于数据,然后确实为整个数组保留空间?

Therefore, the question: how to check at the beginning, if there is really enough memory for the data and then indeed reserve the space for the whole array?

推荐答案

关于您正在生成10 TB文件的事实并没有什么伪造"

您要的是大小数组

There's nothing 'bogus' about the fact that you are generating 10 TB files

You are asking for arrays of size

2 ** 37 * 10 = 1374389534720个元素

2 ** 37 * 10 = 1374389534720 elements

dtype 'i8'表示8字节(64位)整数,因此您的最终数组的大小为

A dtype of 'i8' means an 8 byte (64 bit) integer, therefore your final array will have a size of

1374389534720 * 8 = 10995116277760字节

1374389534720 * 8 = 10995116277760 bytes

10995116277760/1E12 = 10.99511627776 TB

10995116277760 / 1E12 = 10.99511627776 TB


如果您只有250 GB的可用磁盘空间,那么如何创建"10 TB"文件?

假设您使用的是相当现代的文件系统,则您的操作系统将能够生成几乎任意大的稀疏文件,无论您实际上是否有足够的物理磁盘空间来备份它们.


If you only have 250 GB of free disk space then how are you able to create a "10 TB" file?

Assuming that you are using a reasonably modern filesystem, your OS will be capable of generating almost arbitrarily large sparse files, regardless of whether or not you actually have enough physical disk space to back them.

例如,在我的Linux机器上,我可以执行以下操作:

For example, on my Linux machine I'm allowed to do something like this:

# I only have about 50GB of free space...
~$ df -h /
Filesystem     Type  Size  Used Avail Use% Mounted on
/dev/sdb1      ext4  459G  383G   53G  88% /

~$ dd if=/dev/zero of=sparsefile bs=1 count=0 seek=10T
0+0 records in
0+0 records out
0 bytes (0 B) copied, 0.000236933 s, 0.0 kB/s

# ...but I can still generate a sparse file that reports its size as 10 TB
~$ ls -lah sparsefile
-rw-rw-r-- 1 alistair alistair 10T Dec  1 21:17 sparsefile

# however, this file uses zero bytes of "actual" disk space
~$ du -h sparsefile
0       sparsefile

在初始化np.memmap文件后尝试在其上调用du -h,以查看其实际使用了多少磁盘空间.

Try calling du -h on your np.memmap file after it has been initialized to see how much actual disk space it uses.

当您开始实际将数据写入np.memmap文件时,一切都会好起来,直到超出存储的物理容量为止,此时该过程将以Bus error终止.这意味着如果您需要写<将250GB的数据存储到np.memmap数组中,那么可能没有问题(在实践中,这还可能取决于您在数组中写入的位置以及行还是列的大小).

As you start actually writing data to your np.memmap file, everything will be OK until you exceed the physical capacity of your storage, at which point the process will terminate with a Bus error. This means that if you needed to write < 250GB of data to your np.memmap array then there might be no problem (in practice this would probably also depend on where you are writing within the array, and on whether it is row or column major).

创建内存映射时,内核在调用进程的虚拟地址空间内分配一个新的地址块,并将其映射到磁盘上的文件.因此,您的Python进程正在使用的虚拟内存量将随着刚刚创建的文件的大小而增加.由于文件也可以是稀疏的,因此虚拟内存不仅会超过可用的RAM总量,而且还会超过计算机上的物理磁盘总空间.

When you create a memory map, the kernel allocates a new block of addresses within the virtual address space of the calling process and maps them to a file on your disk. The amount of virtual memory that your Python process is using will therefore increase by the size of the file that has just been created. Since the file can also be sparse, then not only can the virtual memory exceed the total amount of RAM available, but it can also exceed the total physical disk space on your machine.

我假设您想在Python中以编程方式进行此操作.

I'm assuming that you want to do this programmatically in Python.

  1. 获取可用的可用磁盘空间量. 上一个SO问题的答案中提供了多种方法.一种选择是 os.statvfs :

  1. Get the amount of free disk space available. There are various methods given in the answers to this previous SO question. One option is os.statvfs:

import os

def get_free_bytes(path='/'):
    st = os.statvfs(path)
    return st.f_bavail * st.f_bsize

print(get_free_bytes())
# 56224485376

  • 计算出以字节为单位的数组大小:

  • Work out the size of your array in bytes:

    import numpy as np
    
    def check_asize_bytes(shape, dtype):
        return np.prod(shape) * np.dtype(dtype).itemsize
    
    print(check_asize_bytes((2 ** 37 * 10,), 'i8'))
    # 10995116277760
    

  • 检查是否2.> 1.

  • Check whether 2. > 1.


    更新:是否有一种安全"的方式来分配np.memmap文件,以确保保留足够的磁盘空间来存储整个阵列?

    一种可能是使用 fallocate 预分配磁盘空间,例如:


    Update: Is there a 'safe' way to allocate an np.memmap file, which guarantees that sufficient disk space is reserved to store the full array?

    One possibility might be to use fallocate to pre-allocate the disk space, e.g.:

    ~$ fallocate -l 1G bigfile
    
    ~$ du -h bigfile
    1.1G    bigfile
    

    您可以从Python调用它,例如使用 subprocess.check_call :

    You could call this from Python, for example using subprocess.check_call:

    import subprocess
    
    def fallocate(fname, length):
        return subprocess.check_call(['fallocate', '-l', str(length), fname])
    
    def safe_memmap_alloc(fname, dtype, shape, *args, **kwargs):
        nbytes = np.prod(shape) * np.dtype(dtype).itemsize
        fallocate(fname, nbytes)
        return np.memmap(fname, dtype, *args, shape=shape, **kwargs)
    
    mmap = safe_memmap_alloc('test.mmap', np.int64, (1024, 1024))
    
    print(mmap.nbytes / 1E6)
    # 8.388608
    
    print(subprocess.check_output(['du', '-h', 'test.mmap']))
    # 8.0M    test.mmap
    

    我不知道使用标准库来执行此操作的平台无关方法,但是有一个 fallocate PyPI上的Python模块,该模块应可用于任何基于Posix的操作系统.

    I'm not aware of a platform-independent way to do this using the standard library, but there is a fallocate Python module on PyPI that should work for any Posix-based OS.

    这篇关于numpy.memmap:伪造的内存分配的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

  • 查看全文
    登录 关闭
    扫码关注1秒登录
    发送“验证码”获取 | 15天全站免登陆