为什么numpy.any过大的数组这么慢? [英] Why is numpy.any so slow over large arrays?

查看:1136
本文介绍了为什么numpy.any过大的数组这么慢?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在寻找最有效的方式,以确定是否一个大阵
包含至少一个非零值。乍一看 np.any 恍如
作业明显的工具,但它似乎在大阵列异常缓慢。

I'm looking for the most efficient way to determine whether a large array contains at least one nonzero value. At first glance np.any seems like the obvious tool for the job, but it seems unexpectedly slow over large arrays.

考虑这种极端的例子:

first = np.zeros(1E3,dtype=np.bool)
last = np.zeros(1E3,dtype=np.bool)

first[0] = True
last[-1] = True

# test 1
%timeit np.any(first)
>>> 100000 loops, best of 3: 6.36 us per loop

# test 2
%timeit np.any(last)
>>> 100000 loops, best of 3: 6.95 us per loop

至少 np.any 似乎是在做一些隐约明智的在这里 - 如果
非零值是数组中的第一个,应该没有必要考虑
其他任何返回,所以我希望测试1稍微前
比测试2快。

At least np.any seems to be doing something vaguely sensible here - if the nonzero value is the first one in the array, there should be no need to consider any others before returning True, so I would expect test 1 to be slightly quicker than test 2.

然而,当我们使阵列大得多,会发生什么?

However, what happens when we make the arrays much larger?

first = np.zeros(1E9,dtype=np.bool)
last = np.zeros(1E9,dtype=np.bool)

first[0] = True
last[-1] = True

# test 3
%timeit np.any(first)
>>> 10 loops, best of 3: 21.6 ms per loop

# test 4
%timeit np.any(last)
>>> 1 loops, best of 3: 739 ms per loop

正如预期的那样,测试4比3,测试速度慢但相当多,在测试3
np.any 还是应该只需要检查一个元素的值
第一,以便知道它至少​​包含一个非零值。为什么,
那么,是测试3比1的测试这么多慢?

As expected, test 4 is quite a lot slower than test 3. However, in test 3 np.any should still only have to check the value of a single element in first in order to know that it contains at least one nonzero value. Why, then, is test 3 so much slower than test 1?

我使用numpy的(1.8.0.dev-e11cd9b)的开发版本,但我得到使用numpy的1.7.1完全相同的时序结果。我运行64位Linux,Python的2.7.4。我的系统基本上是空转(我运行IPython的会话,浏览器和文本编辑器),而且我绝对不打掉。我也复制运行numpy的1.7.1另一台机器上的结果。

I'm using a development version of Numpy (1.8.0.dev-e11cd9b), but I get the exact same timing results using Numpy 1.7.1. I'm running 64 bit Linux, Python 2.7.4. My system is basically idling (I'm running an IPython session, a browser and a text editor), and I'm definitely not hitting the swap. I also replicated the result on another machine running Numpy 1.7.1.

使用numpy的1.6.2,我得到的〜1.85us为测试1和3,从而jorgeca说,有似乎已经numpy的1.6.2和<罢工>的 1.7.1 <之间的一些性能回归时间/ EM> 1.7.0在这方面。

Using Numpy 1.6.2 I get times of ~1.85us for both tests 1 and 3, so as jorgeca says there seems to have been some performance regression between Numpy 1.6.2 and 1.7.1 1.7.0 in this regard.

继JF塞巴斯蒂安和jorgeca的铅我已经做了一些使用 np.all 元素为零的数组,其上应该是等同于调用更多的标杆 np.any

Following J.F. Sebastian and jorgeca's lead I've done some more benchmarking using np.all on an array of zeros, which ought to be equivalent to calling np.any on an array where the first element is one.

测试脚本:

import timeit
import numpy as np
print 'Numpy v%s' %np.version.full_version
stmt = "np.all(x)"
for ii in xrange(10):
    setup = "import numpy as np; x = np.zeros(%d,dtype=np.bool)" %(10**ii)
    timer = timeit.Timer(stmt,setup)
    n,r = 1,3
    t = np.min(timer.repeat(r,n))
    while t < 0.2:
        n *= 10
        t = np.min(timer.repeat(r,n))
    t /= n
    if t < 1E-3:
        timestr = "%1.3f us" %(t*1E6)
    elif t < 1:
        timestr = "%1.3f ms" %(t*1E3)
    else:
        timestr = "%1.3f s" %t
    print "Array size: 1E%i, %i loops, best of %i: %s/loop" %(ii,n,r,timestr)

结果:

Numpy v1.6.2
Array size: 1E0, 1000000 loops, best of 3: 1.738 us/loop
Array size: 1E1, 1000000 loops, best of 3: 1.845 us/loop
Array size: 1E2, 1000000 loops, best of 3: 1.862 us/loop
Array size: 1E3, 1000000 loops, best of 3: 1.858 us/loop
Array size: 1E4, 1000000 loops, best of 3: 1.864 us/loop
Array size: 1E5, 1000000 loops, best of 3: 1.882 us/loop
Array size: 1E6, 1000000 loops, best of 3: 1.866 us/loop
Array size: 1E7, 1000000 loops, best of 3: 1.853 us/loop
Array size: 1E8, 1000000 loops, best of 3: 1.860 us/loop
Array size: 1E9, 1000000 loops, best of 3: 1.854 us/loop

Numpy v1.7.0
Array size: 1E0, 100000 loops, best of 3: 5.881 us/loop
Array size: 1E1, 100000 loops, best of 3: 5.831 us/loop
Array size: 1E2, 100000 loops, best of 3: 5.924 us/loop
Array size: 1E3, 100000 loops, best of 3: 5.864 us/loop
Array size: 1E4, 100000 loops, best of 3: 5.997 us/loop
Array size: 1E5, 100000 loops, best of 3: 6.979 us/loop
Array size: 1E6, 100000 loops, best of 3: 17.196 us/loop
Array size: 1E7, 10000 loops, best of 3: 116.162 us/loop
Array size: 1E8, 1000 loops, best of 3: 1.112 ms/loop
Array size: 1E9, 100 loops, best of 3: 11.061 ms/loop

Numpy v1.7.1
Array size: 1E0, 100000 loops, best of 3: 6.216 us/loop
Array size: 1E1, 100000 loops, best of 3: 6.257 us/loop
Array size: 1E2, 100000 loops, best of 3: 6.318 us/loop
Array size: 1E3, 100000 loops, best of 3: 6.247 us/loop
Array size: 1E4, 100000 loops, best of 3: 6.492 us/loop
Array size: 1E5, 100000 loops, best of 3: 7.406 us/loop
Array size: 1E6, 100000 loops, best of 3: 17.426 us/loop
Array size: 1E7, 10000 loops, best of 3: 115.946 us/loop
Array size: 1E8, 1000 loops, best of 3: 1.102 ms/loop
Array size: 1E9, 100 loops, best of 3: 10.987 ms/loop

Numpy v1.8.0.dev-e11cd9b
Array size: 1E0, 100000 loops, best of 3: 6.357 us/loop
Array size: 1E1, 100000 loops, best of 3: 6.399 us/loop
Array size: 1E2, 100000 loops, best of 3: 6.425 us/loop
Array size: 1E3, 100000 loops, best of 3: 6.397 us/loop
Array size: 1E4, 100000 loops, best of 3: 6.596 us/loop
Array size: 1E5, 100000 loops, best of 3: 7.569 us/loop
Array size: 1E6, 100000 loops, best of 3: 17.445 us/loop
Array size: 1E7, 10000 loops, best of 3: 115.109 us/loop
Array size: 1E8, 1000 loops, best of 3: 1.094 ms/loop
Array size: 1E9, 100 loops, best of 3: 10.840 ms/loop

编辑4:

继seberg的评论我试着用 np.float32 数组,而不是 np.bool 相同的测试。在这种情况下,numpy的1.6.2也显示出增长放缓的数组的大小增加:

Edit 4:

Following seberg's comment I tried the same test with an np.float32 array instead of np.bool. In this case, Numpy 1.6.2 also shows a slowdown as array sizes increase:

Numpy v1.6.2
Array size: 1E0, 100000 loops, best of 3: 3.503 us/loop
Array size: 1E1, 100000 loops, best of 3: 3.597 us/loop
Array size: 1E2, 100000 loops, best of 3: 3.742 us/loop
Array size: 1E3, 100000 loops, best of 3: 4.745 us/loop
Array size: 1E4, 100000 loops, best of 3: 14.533 us/loop
Array size: 1E5, 10000 loops, best of 3: 112.463 us/loop
Array size: 1E6, 1000 loops, best of 3: 1.101 ms/loop
Array size: 1E7, 100 loops, best of 3: 11.724 ms/loop
Array size: 1E8, 10 loops, best of 3: 116.924 ms/loop
Array size: 1E9, 1 loops, best of 3: 1.168 s/loop

Numpy v1.7.1
Array size: 1E0, 100000 loops, best of 3: 6.548 us/loop
Array size: 1E1, 100000 loops, best of 3: 6.546 us/loop
Array size: 1E2, 100000 loops, best of 3: 6.804 us/loop
Array size: 1E3, 100000 loops, best of 3: 7.784 us/loop
Array size: 1E4, 100000 loops, best of 3: 17.946 us/loop
Array size: 1E5, 10000 loops, best of 3: 117.235 us/loop
Array size: 1E6, 1000 loops, best of 3: 1.096 ms/loop
Array size: 1E7, 100 loops, best of 3: 12.328 ms/loop
Array size: 1E8, 10 loops, best of 3: 118.431 ms/loop
Array size: 1E9, 1 loops, best of 3: 1.172 s/loop

为什么要这样呢?正如布尔情况下, np.all 还是应该只在返回前要检查的第一要素,所以还是时间应该是恒定的w.r.t.数组的大小。

Why should this happen? As with the boolean case, np.all should still only have to check the first element before returning, so times should still be constant w.r.t. array size.

推荐答案

正如在评论中已经猜到了,我可以证实,数组的处理是在块正在做。首先,我会告诉你那里的东西都在code,然后我会告诉你如何可以改变块大小和这样做对你的标杆作用。

As has been guessed in the comments, I can confirm that the processing of the array is being done in chunks. First, I will show you where things are in the code and then I will show you how you can change the chunk size and the effect that doing so has on your benchmark.

np.all(x)是相同x.all()。所有()调用真正np.core.umath.logical_and.reduce(X)。

np.all(x) is the same as x.all(). all() really calls np.core.umath.logical_and.reduce(x).

如果您想深入到numpy的来源,我会尽力引导你发现一个缓冲区/块大小被使用。与所有的code,我们将寻找的文件夹是numpy的/核心/ src目录/ umath /。

If you want to dig into the numpy source, I will try to guide you through finding that a buffer/chunk size is used. The folder with all of the code we will be looking at is numpy/core/src/umath/.

PyUFunc_Reduce()在ufunc_object.c是C函数处理的减少。在PyUFunc_Reduce(),该块,或缓冲液,大小是通过查找用于经由PyUFunc_GetPyValues​​()函数(ufunc_object.c)在一些全局词典减少值找到。在我的机器,并从开发分支编译,块的大小为8192 PyUFunc_ReduceWrapper()在reduction.c被调用来建立迭代器(等于块大小一个箭步),它调用循环功能通过这是reduce_loop()在ufunc_object.c。

PyUFunc_Reduce() in ufunc_object.c is the C function that handles the reduce. In PyUFunc_Reduce(), the chunk, or buffer, size is found by looking up the value for reduce in some global dictionary via the PyUFunc_GetPyValues() function (ufunc_object.c). On my machine and compiling from the development branch, the chunk size is 8192. PyUFunc_ReduceWrapper() in reduction.c is called to set-up the iterator (with a stride equal to the chunk size) and it calls the passed in loop function which is reduce_loop() in ufunc_object.c.

reduce_loop()基本上只是使用了迭代​​器,并呼吁每个块另一个内环()函数。该内环功能在loops.c.src找到。对于布尔数组和我们所有的/逻辑与的情况下,适当的内环功能BOOL_logical_and。您可以通过搜索布尔袢找到合适的功能,然后它低于第二功能(这是很难找到,由于这里使用的模板类编程)。在那里,你会发现,短路,其实是在正在做的每个块。

reduce_loop() basically just uses the iterator and calls another innerloop() function for each chunk. The innerloop function is found in loops.c.src. For a boolean array and our case of all/logical_and, the appropriate innerloop function is BOOL_logical_and. You can find the right function by searching for BOOLEAN LOOPS and then it is the second function below that (it is hard to find due to the template-like programming used here). There you will find that short circuiting is in fact being done for each chunk.

您可以得到np.getbuffersize块/缓冲区大小()。对我来说,返回8192无需手动设置它相匹配我发现打印出的code中的缓冲区大小。您可以使用np.setbuffersize()来更改块大小。

You can get the chunk/buffer size with np.getbuffersize(). For me, that returns 8192 without manually setting it which matches what I found by printing out the buffer size in the code. You can use np.setbuffersize() to change the chunk size.

我改变了你的标杆code以下内容:

I changed your benchmark code to the following:

import timeit
import numpy as np
print 'Numpy v%s' %np.version.full_version
stmt = "np.all(x)"
for ii in xrange(9):
    setup = "import numpy as np; x = np.zeros(%d,dtype=np.bool); np.setbufsize(%d)" %(10**ii, max(8192, min(10**ii, 10**7)))
    timer = timeit.Timer(stmt,setup)
    n,r = 1,3
    t = np.min(timer.repeat(r,n))
    while t < 0.2:
        n *= 10
        t = np.min(timer.repeat(r,n))
    t /= n
    if t < 1E-3:
        timestr = "%1.3f us" %(t*1E6)
    elif t < 1:
        timestr = "%1.3f ms" %(t*1E3)
    else:
        timestr = "%1.3f s" %t
    print "Array size: 1E%i, %i loops, best of %i: %s/loop" %(ii,n,r,timestr)

numpy的不喜欢的缓冲区大小太小或太大,所以我确信,它没有得到比8192大于或小于1E7因为numpy的不喜欢1E8的缓冲区大小。否则,我被设置缓冲区大小到阵列的大小被处理。我只走到1E8,因为我的机器只是在一瞬间拥有4GB内存。下面是结果:

Numpy doesn't like the buffer size being too small or too big so I made sure that it didn't get smaller than 8192 or larger than 1E7 because Numpy didn't like a buffer size of 1E8. Otherwise, I was setting the buffer size to the size of the array being processed. I only went up to 1E8 because my machine only has 4GB of memory at the moment. Here are the results:

Numpy v1.8.0.dev-2a5c2c8
Array size: 1E0, 100000 loops, best of 3: 5.351 us/loop
Array size: 1E1, 100000 loops, best of 3: 5.390 us/loop
Array size: 1E2, 100000 loops, best of 3: 5.366 us/loop
Array size: 1E3, 100000 loops, best of 3: 5.360 us/loop
Array size: 1E4, 100000 loops, best of 3: 5.433 us/loop
Array size: 1E5, 100000 loops, best of 3: 5.400 us/loop
Array size: 1E6, 100000 loops, best of 3: 5.397 us/loop
Array size: 1E7, 100000 loops, best of 3: 5.381 us/loop
Array size: 1E8, 100000 loops, best of 3: 6.126 us/loop

有是最近的定时的小上扬,因为有由于被加工的限制在缓冲区的大小可以为多大多个块

There is a small uptick in the last timing because there are multiple chunks being processed due to the limitations on how big the buffer size can be.

这篇关于为什么numpy.any过大的数组这么慢?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆