为什么使用 np.empty 进行分配而不是 O(1) [英] Why is allocation using np.empty not O(1)

查看:39
本文介绍了为什么使用 np.empty 进行分配而不是 O(1)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

官方numpy上说文档

<块引用>

返回给定形状和类型的新数组,不初始化条目.

对于 np.empty,这意味着创建(分配)这个数组所花费的时间将是 O(1),但是 timeit 中的一些简单测试显示事实并非如此:

<预><代码>>>>timeit.timeit(lambda: np.empty(100000000), number=10000)0.2733485999999914>>>timeit.timeit(lambda: np.empty(1000000000), number=10000)0.8293009999999867

作为一个附带问题,未触及的 np.empty 数组中存在哪些值?它们都是非常小的值,但我希望它们只是该地址内存中存在的任何值.(示例数组:np.empty(2) = array([-6.42940774e-036, 2.07409447e-117]).这些看起来不像存储在内存中的东西)>

解决方案

首先,我尝试在我的各种尺寸的机器上重现这种行为.以下是原始结果:

np.empty(10**1) # 421 ns ± 23.7 ns 每个循环(7 次运行,每次 1000000 次循环)np.empty(10**2) # 每个循环 406 ns ± 1.44 ns(7 次运行,每次 1000000 次循环)np.empty(10**3) # 每个循环 471 ns ± 5.8 ns(7 次运行,每次 1000000 次循环)np.empty(10**4) # 616 ns ± 1.56 ns 每个循环(7 次运行,每次 1000000 次循环)np.empty(10**5) # 620 ns ± 2.83 ns 每个循环(7 次运行,每次 1000000 次循环)np.empty(10**6) # 9.61 µs ± 34.2 ns 每个循环(7 次运行,每次 100000 次循环)np.empty(10**7) # 11.1 µs ± 17.6 ns 每个循环(7 次运行,每次 100000 次循环)np.empty(10**8) # 22.1 µs ± 173 ns 每个循环(7 次运行,每次 10000 次循环)np.empty(10**9) # 62.8 µs ± 220 ns 每个循环(7 次运行,每次 10000 次循环)np.empty(10**10) # =>内存错误

因此,您是对的:O(1) 还没有完成(至少在我的 Windows 机器和您的系统上也是如此).请注意,在这么短的时间内无法(急切地)初始化这些值,因为这意味着 RAM 吞吐量超过 127 TB/s,而我的机器上显然没有.

<块引用>

对于 np.empty,这意味着创建(分配)这个数组所花费的时间为 O(1)

O(1) 中完成分配的假设并不完全正确.为了检查这一点,我构建了一个简单的 C 程序,执行一个简单的 malloc+free 循环并测量时间.以下是原始结果:

./malloc.exe 10 # 平均时间:41.815 ns(1 次运行,每次 1000000 次循环)./malloc.exe 100 # 平均时间:45.295 ns(1 次运行,每次 1000000 次循环)./malloc.exe 1000 # 平均时间:47.400 ns(1 次运行,每次 1000000 次循环)./malloc.exe 10000 # 平均时间:122.457 ns(1 次运行,每次 1000000 次循环)./malloc.exe 100000 # 平均时间:123.032 ns(1 次运行,每次 1000000 次循环)./malloc.exe 1000000 # 平均时间:8.351 us(1 次运行,每次 1000000 次循环)./malloc.exe 10000000 # 平均时间:9.342 us(1 次运行,每次 100000 次循环)./malloc.exe 100000000 # 平均时间:18.972 us(1 次运行,每次 10000 次循环)./malloc.exe 1000000000 # 平均时间:64.527 us(1 次运行,每次 10000 次循环)./malloc.exe 10000000000 # =>内存错误

如您所见,结果与 Numpy 的结果匹配(除了由于在 CPython 中调用 Python 函数的开销而导致的小部分).因此,问题不是来自 Numpy,而是标准 libc 中的分配算法或操作系统本身.

<块引用>

作为一个附带问题,未触及的 np.empty 数组中存在哪些值?

它是未初始化的数据.在实践中,它通常是零初始化(但并非总是如此),因为主流平台出于安全原因清理分配的内存(以便密码等关键数据在之前存储在另一个进程的内存中时不会泄漏).你不应该依赖这个.


malloc 时序的更深入解释:

如您所见,100K 项和 1M 项的分配之间存在差距.这可以通过使用快速用户空间分配器(称为sbrk 在 Unix 和 Linux 系统上):当数据较小时,大多数主流平台的 libc 不会直接向操作系统请求内存.而是使用快速预分配的本地内存池.实际上,在大多数主流平台上,预先分配了多个不同大小的池,libc 选择了正确的".取决于分配的大小,因此小数据大小的时间变化.请注意,此过程是为了提高分配速度,同时考虑到内存碎片.这种策略要快得多,因为内核调用(如mmap)非常昂贵(在我的机器上至少需要几微秒).

此外,大多数操作系统 (OS) 看起来都像多个内存池.Linux、MacOS 和 Windows 将虚拟内存分成小(通常为 4KB).由于在处理 GB/TB 的已分配数据时处理太小的页面会带来显着的开销,因此这些操作系统还提供称为超级页面或大页面(通常为 2MB 到几 GB)的大页面.操作系统中采用的路径可以根据分配的内存量而变化,并且大多数操作系统都针对分配小块虚拟内存而不是大块进行了优化.

请注意,用于管理系统内存的数据结构的大小通常受 RAM 大小的限制,而 RAM 大小通常在运行时保持不变.此外,给定操作系统中用于管理内存碎片的算法的复杂性理论上O(1)(或接近于).因此,有些人认为分配/释放数据是在恒定时间内完成的.但这是有争议的,因为人们应该考虑实际结果,而不仅仅是理论渐近界限.


有关更多信息,您可以查看以下帖子:

It is said on the official numpy docs

Return a new array of given shape and type, without initializing entries.

for np.empty, which would imply that the time taken to create (allocate) this array would be O(1), but some simple testing in timeit shows this is not the case:

>>> timeit.timeit(lambda: np.empty(100000000 ), number=10000)
0.2733485999999914
>>> timeit.timeit(lambda: np.empty(1000000000), number=10000)
0.8293009999999867

As a side question, what are the values present in an untouched np.empty array? They're all really small values, but I would expect them to just be whatever values are present in memory at that address. (Example array: np.empty(2) = array([-6.42940774e-036, 2.07409447e-117]). These don't seem anything like what would be stored in memory)

解决方案

First of all, I tried to reproduce this behaviour on my machine with various sizes. Here are the raw results:

np.empty(10**1)   # 421 ns ± 23.7 ns per loop    (on 7 runs, 1000000 loops each)
np.empty(10**2)   # 406 ns ± 1.44 ns per loop    (on 7 runs, 1000000 loops each)
np.empty(10**3)   # 471 ns ± 5.8 ns per loop     (on 7 runs, 1000000 loops each)
np.empty(10**4)   # 616 ns ± 1.56 ns per loop    (on 7 runs, 1000000 loops each)
np.empty(10**5)   # 620 ns ± 2.83 ns per loop    (on 7 runs, 1000000 loops each)
np.empty(10**6)   # 9.61 µs ± 34.2 ns per loop   (on 7 runs, 100000 loops each)
np.empty(10**7)   # 11.1 µs ± 17.6 ns per loop   (on 7 runs, 100000 loops each)
np.empty(10**8)   # 22.1 µs ± 173 ns per loop    (on 7 runs, 10000 loops each)
np.empty(10**9)   # 62.8 µs ± 220 ns per loop    (on 7 runs, 10000 loops each)
np.empty(10**10)  # => Memory Error

Thus, you are right: this is not done is O(1) (at least on my Windows machine and your system too). Note that the values cannot be (eagerly) initialized during such a small time because it would imply a RAM throughput of more than 127 TB/s which I clearly do not have on my machine.

for np.empty, which would imply that the time taken to create (allocate) this array would be O(1)

The hypothesis that allocations are done in O(1) is not totally true. To check that, I built a simple C program doing a simple malloc+free loop and measured the timings. Here are the raw results:

./malloc.exe 10           # Average time:  41.815 ns (on 1 run, 1000000 loops each)
./malloc.exe 100          # Average time:  45.295 ns (on 1 run, 1000000 loops each)
./malloc.exe 1000         # Average time:  47.400 ns (on 1 run, 1000000 loops each)
./malloc.exe 10000        # Average time: 122.457 ns (on 1 run, 1000000 loops each)
./malloc.exe 100000       # Average time: 123.032 ns (on 1 run, 1000000 loops each)
./malloc.exe 1000000      # Average time:   8.351 us (on 1 run, 1000000 loops each)
./malloc.exe 10000000     # Average time:   9.342 us (on 1 run, 100000 loops each)
./malloc.exe 100000000    # Average time:  18.972 us (on 1 run, 10000 loops each)
./malloc.exe 1000000000   # Average time:  64.527 us (on 1 run, 10000 loops each)
./malloc.exe 10000000000  # => Memory error

As you can see, the results are matching with the ones of Numpy (except for the small ones which is due to the overhead of calling a Python function in CPython). Thus, the issue does not comes from Numpy but the allocation algorithm in the standard libc or the OS itself.

As a side question, what are the values present in an untouched np.empty array?

It is uninitialized data. In practice, it is often zero-initialized (but not always) because mainstream platforms sanitize allocated memory for security reasons (so that critical data like passwords do not leak when they are previously stored in the memory of another process). You should not rely on this.


Deeper explanation of the malloc timings:

As you can see, there is a gap between the allocation of 100K items and 1M items. This can be explained by the use of a fast user-space allocator (called sbrk on Unix and Linux systems): when data are small, the libc of most mainstream platforms does not directly request memory to the operating system. It rather use a fast pre-allocated local memory-pool. Actually, on most mainstream platforms, multiple pool of different sizes are pre-allocated and the libc choose the "right one" depending on the allocated size, hence the timing variation for small data size. Note that this process is done to improve the allocation speed while taking into account memory fragmentation. This strategy is much faster as kernel calls (like mmap) are very expensive (it takes at least several micro-seconds on my machine).

Moreover, most operating systems (OS) have what looks like multiple memory pools. Linux, MacOS and Windows split the virtual memory into small pages (of typically 4KB). Since working on too small pages introduces a significant overhead when dealing with GB/TB of allocated data, these OS provide also big pages called super-pages or huge-pages (of typically 2MB to few GBs). The path taken in the OS can change regarding the amount of allocated memory and most OS are optimized for allocating small chunks of virtual memory and not big ones.

Note that the size of the data structures used to manage the system memory is often bounded by the size of the RAM which is generally constant at runtime. Moreover, the complexity of the algorithm used in a given OS to manage the memory fragmentation may be theoretically O(1) (or close to that). As a result, some people argue that allocating/freeing data is done in constant time. But this controversial because one should take into account practical results and not just theoretical asymptotic bounds.


For more information you can look to the following posts:

这篇关于为什么使用 np.empty 进行分配而不是 O(1)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆