将NumPy数组转换为集合需要太长时间 [英] Converting NumPy array to a set takes too long

查看:113
本文介绍了将NumPy数组转换为集合需要太长时间的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试执行以下操作

I'm trying to execute the following

from numpy import *
x = array([[3,2,3],[711,4,104],.........,[4,4,782,7845]])  # large nparray
for item in x:
    set(item)

与之相比需要花费很长的时间

and it takes very long compared to:

x = array([[3,2,3],[711,4,104],.........,[4,4,782,7845]])  # large nparray
for item in x:
    item.tolist()

为什么将NumPy数组转换为set比转换为list需要更长的时间? 我的意思是基本上都具有复杂性O(n)?

Why does it take much longer to convert a NumPy array to a set than to a list? I mean basically both have complexity O(n)?

推荐答案

TL; DR:set()函数使用Python的迭代协议创建一个集合.但是在NumPy数组上迭代(在Python级别上)是如此缓慢,以至于在执行迭代之前使用tolist()将数组转换为Python列表要快得多.

TL;DR: The set() function creates a set using Pythons iteration protocol. But iterating (on the Python level) over NumPy arrays is so slow that using tolist() to convert the array to a Python list before doing the iteration is (much) faster.

要了解为什么对NumPy数组进行迭代如此之慢的原因,重要的是要了解Python对象,Python列表和NumPy数组如何存储在内存中.

To understand why iterating over NumPy arrays is so slow it's important to know how Python objects, Python lists, and NumPy arrays are stored in memory.

Python对象需要一些簿记属性(例如引用计数,指向其类的链接等)以及它表示的 value .例如,整数ten = 10可能看起来像这样:

A Python object needs some bookkeeping properties (like the reference count, a link to its class, ...) and the value it represents. For example the integer ten = 10 could look like this:

蓝色圆圈是您在Python解释器中使用的变量ten的名称",而下层对象(实例)实际上代表了整数(由于簿记属性在这里并不重要,因此我在其中忽略了它们)图片).

The blue circle is the "name" you use in the Python interpreter for the variable ten and the lower object (instance) is what actually represents the integer (since the bookkeeping properties aren't imporant here I ignored them in the images).

Python list只是Python对象的集合,例如mylist = [1, 2, 3]将这样保存:

A Python list is just a collection of Python objects, for example mylist = [1, 2, 3] would be saved like this:

这次,列表引用了Python整数123,而名称mylist仅引用了list实例.

This time the list references the Python integers 1, 2 and 3 and the name mylist just references the list instance.

但是数组myarray = np.array([1, 2, 3])不会将Python对象存储为元素:

But an array myarray = np.array([1, 2, 3]) doesn't store Python objects as elements:

123直接存储在NumPy array实例中.

The values 1, 2 and 3 are stored directly in the NumPy array instance.

有了这些信息,我可以解释为什么在array上进行迭代比在list上进行迭代要慢得多:

With this information I can explain why iterating over an array is so much slower compared to an iteration over a list:

每次访问list中的下一个元素时,list只会返回一个存储的对象.这非常快,因为该元素已经作为Python对象存在(它只需要将引用计数加1)即可.

Each time you access the next element in a list the list just returns a stored object. That's very fast because the element already exists as Python object (it just needs to increment the reference count by one).

另一方面,当您需要array的元素时,需要在返回值之前为包含所有簿记内容的值创建一个新的Python框".遍历数组时,需要为数组中的每个元素创建一个Python框:

On the other hand when you want an element of an array it needs to create a new Python "box" for the value with all the bookkeeping stuff before it is returned. When you iterate over the array it needs to create one Python box for each element in your array:

创建这些框很慢,并且遍历NumPy数组的主要原因比遍历存储值及其框的Python集合(列表/元组/集合/字典)慢得多的主要原因:

Creating these boxes is slow and the main reason why iterating over NumPy arrays is much slower than iterating over Python collections (lists/tuples/sets/dictionaries) which store the values and their box:

import numpy as np
arr = np.arange(100000)
lst = list(range(100000))

def iterateover(obj):
    for item in obj:
        pass

%timeit iterateover(arr)
# 20.2 ms ± 155 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit iterateover(lst)
# 3.96 ms ± 26.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

set构造函数"只是对对象进行迭代.

The set "constructor" just does an iteration over the object.

我绝对不能回答的一件事是为什么tolist方法如此之快.最后,结果Python列表中的每个值都必须放在"Python框"中,因此tolist 可以避免很多工作.但是我可以确定的一件事是list(array)array.tolist()慢:

One thing I can't answer definitely is why the tolist method is so much faster. In the end each value in the resulting Python list needs to be in a "Python box" so there's not much work that tolist could avoid. But one thing I know for sure is that list(array) is slower than array.tolist():

arr = np.arange(100000)

%timeit list(arr)
# 20 ms ± 114 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit arr.tolist()
# 10.3 ms ± 253 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

其中的每一个都具有O(n)运行时复杂性,但是恒定因素却大不相同.

Each of these has O(n) runtime complexity but the constant factors are very different.

在您的情况下,您确实将set()tolist()进行了比较-这并不是一个很好的比较.将set(arr)list(arr)set(arr.tolist())arr.tolist()进行比较会更有意义:

In your case you did compare set() to tolist() - which isn't a particular good comparison. It would make more sense to compare set(arr) to list(arr) or set(arr.tolist()) to arr.tolist():

arr = np.random.randint(0, 1000, (10000, 3))

def tosets(arr):
    for line in arr:
        set(line)

def tolists(arr):
    for line in arr:
        list(line)

def tolists_method(arr):
    for line in arr:
        line.tolist()

def tosets_intermediatelist(arr):
    for line in arr:
        set(line.tolist())

%timeit tosets(arr)
# 72.2 ms ± 2.68 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit tolists(arr)
# 80.5 ms ± 2.18 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit tolists_method(arr)
# 16.3 ms ± 140 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit tosets_intermediatelist(arr)
# 38.5 ms ± 200 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

因此,如果需要set,则最好使用set(arr.tolist()).对于更大的数组,可以使用 np.unique ,但是因为您的行仅包含3个可能会变慢的项目(对于成千上万个元素,可能会快得多!).

So if you want sets you are better off using set(arr.tolist()). For bigger arrays it could make sense to use np.unique but because your rows only contain 3 items that will likely be slower (for thousands of elements it could be much faster!).

在您询问有关numba的评论中,是的,确实numba可以加快速度. Numba支持类型集(仅数字类型) ),但这并不意味着它总是总是更快.

In the comments you asked about numba and yes, it's true that numba could speed this up. Numba supports typed sets (only numeric types), but that doesn't mean it will be always faster.

我不确定numba是如何(重新)实现set的,但是由于键入了它们,因此很可能还避免了"Python框",并将值直接存储在set内:

I'm not sure how numba (re-)implements sets but because they are typed it's likely they also avoid the "Python boxes" and store the values directly inside the set:

集合比list更复杂,因为它们涉及到hash es和空插槽(Python对集合使用开放式寻址,因此我假设numba也会这样做).

Sets are more complicated than lists because it they involve hashes and empty slots (Python uses open-addressing for sets, so I assume numba will too).

就像NumPy array一样,numba set直接保存值.因此,当您将NumPy array转换为numba set(或反之亦然)时,它根本不需要使用"Python框",因此当您在numba 中创建set时nopython 函数,它将比set(arr.tolist())操作快得多:

Like the NumPy array the numba set saves the values directly. So when you convert a NumPy array to a numba set (or vise-versa) it won't need to use "Python boxes" at all, so when you create the sets in a numba nopython function it will be much faster even than the set(arr.tolist()) operation:

import numba as nb
@nb.njit
def tosets_numba(arr):
    for lineno in range(arr.shape[0]):
        set(arr[lineno])

tosets_numba(arr)  # warmup
%timeit tosets_numba(arr)
# 6.55 ms ± 105 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

这大约比set(arr.tolist())方法快五倍.但是重要的是要强调一点,我没有从函数中返回set.当您将set从nopython numba函数返回到Python时,Numba会创建一个python集-包括为该集中的所有值创建框"(这是numba所隐藏的东西).

That's roughly five times faster than the set(arr.tolist()) approach. But it's important to highlight that I did not return the sets from the function. When you return a set from a nopython numba function to Python Numba creates a python set - including "creating the boxes" for all values in the set (that's something numba is hiding).

仅供参考:如果将list传递给Numba nopython函数或从这些函数返回列表,则会发生相同的装箱/拆箱操作.因此,Python中的O(1)操作是Numba的O(n)操作!这就是为什么通常最好将NumPy数组传递给numba nopython函数(即O(1)).

Just FYI: The same boxing/unboxing happens if you pass lists to Numba nopython functions or return lists from these functions. So what's a O(1) operation in Python is an O(n) operation with Numba! That's why it's generally better to pass NumPy arrays to numba nopython function (which is O(1)).

我认为,如果您从函数中返回这些集合(目前无法实现,因为numba当前不支持集合列表)会更慢(因为它创建了一个numba集合稍后将其转换为python集)或仅略微加快(如果转换numbaset-> pythonset确实非常快).

I assume that if you return these sets from the function (not really possible right now because numba doesn't support lists of sets currently) it would be slower (because it creates a numba set and later converts it to a python set) or only marginally faster (if the conversion numbaset -> pythonset is really, really fast).

就我个人而言,仅当我不需要从函数返回numba且不需要在函数中对集合进行所有操作时,才使用numba在nopython模式下受支持.在任何其他情况下,我都不会在这里使用numba.

Personally I would use numba for sets only if I don't need to return them from the function and do all operations on the set inside the function and only if all the operations on the set are supported in nopython mode. In any other case I wouldn't use numba here.

仅需注意:from numpy import *应该避免,这样做会隐藏几个python内置函数(summinmax,...),并且会放置很多东西进入您的全局变量.最好使用import numpy as np.函数调用前面的np.使代码更清晰,键入的内容也不多.

Just a note: from numpy import * should be avoided, you hide several python built-in functions when you do that (sum, min, max, ...) and it puts a lot of stuff into your globals. Better to use import numpy as np. The np. in front of function calls makes the code clearer and isn't much to type.

这篇关于将NumPy数组转换为集合需要太长时间的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆