将 NumPy 数组转换为集合花费的时间太长 [英] Converting NumPy array to a set takes too long

查看:160
本文介绍了将 NumPy 数组转换为集合花费的时间太长的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试执行以下操作

from numpy import *x = array([[3,2,3],[711,4,104],........,[4,4,782,7845]]) # 大型 nparray对于 x 中的项目:设置(项目)

与以下相比需要很长时间:

x = array([[3,2,3],[711,4,104],........,[4,4,782,7845]]) # large nparray对于 x 中的项目:item.tolist()

为什么将 NumPy 数组转换为 set 比转换为 list 需要更长的时间?我的意思是基本上两者都有复杂度O(n)?

解决方案

TL;DR:set() 函数使用 Python 迭代协议创建一个集合.但是在 NumPy 数组上迭代(在 Python 级别)非常慢,以至于在执行迭代之前使用 tolist() 将数组转换为 Python 列表会(快得多).

要了解为什么迭代 NumPy 数组如此缓慢,重要的是要了解 Python 对象、Python 列表和 NumPy 数组是如何存储在内存中的.

Python 对象需要一些簿记属性(如引用计数、指向其类的链接等)以及它所代表的.例如整数 ten = 10 可能如下所示:

蓝色圆圈是您在 Python 解释器中为变量 ten 使用的名称",较低的对象(实例)是实际代表整数的内容(因为簿记属性并不重要在这里,我在图像中忽略了它们).

Python list 只是 Python 对象的集合,例如 mylist = [1, 2, 3] 会像这样保存:

这次列表引用了 Python 整数 123 而名称 mylist 只引用了list 实例.

但是数组 myarray = np.array([1, 2, 3]) 不会将 Python 对象存储为元素:

123 的值直接存储在 NumPy array 实例中.

<小时>

有了这些信息,我可以解释为什么迭代 array 比迭代 list 慢得多:

每次访问 list 中的下一个元素时,list 只会返回一个存储的对象.这非常快,因为该元素已经作为 Python 对象存在(它只需要将引用计数加一).

另一方面,当你想要一个 array 的元素时,它需要在返回之前为该值创建一个新的 Python框",其中包含所有簿记内容.当您遍历数组时,它需要为数组中的每个元素创建一个 Python 框:

创建这些框很慢,主要原因是迭代 NumPy 数组比迭代存储值及其框的 Python 集合(列表/元组/集合/字典)慢得多:

将 numpy 导入为 nparr = np.arange(100000)lst = 列表(范围(100000))定义迭代(对象):对于 obj 中的项目:经过%timeit 迭代(arr)# 每个循环 20.2 ms ± 155 µs(平均值 ± 标准偏差,7 次运行,每次 10 次循环)%timeit 迭代(lst)# 每个循环 3.96 ms ± 26.6 µs(平均值 ± 标准偏差,7 次运行,每次 100 次循环)

set构造函数"只是对对象进行迭代.

我无法明确回答的一件事是为什么 tolist 方法要快得多.最后,生成的 Python 列表中的每个值都需要位于Python 框"中,因此 tolist 可以 避免的工作并不多.但我确定的一件事是 list(array)array.tolist() 慢:

arr = np.arange(100000)%timeit 列表(arr)# 每个循环 20 ms ± 114 µs(平均值 ± 标准偏差,7 次运行,每次 10 次循环)%timeit arr.tolist()# 每个循环 10.3 ms ± 253 µs(平均值 ± 标准偏差,7 次运行,每次 100 次循环)

每一个都有 O(n) 运行时复杂度,但常数因子却大不相同.

在您的情况下,您确实将 set()tolist() 进行了比较 - 这不是一个特别好的比较.将 set(arr)list(arr)set(arr.tolist()) 进行比较会更有意义arr.tolist():

arr = np.random.randint(0, 1000, (10000, 3))def tosets(arr):对于 arr 中的行:设置(行)def tolists(arr):对于 arr 中的行:列表(行)def tolists_method(arr):对于 arr 中的行:行.tolist()def tosets_intermediatelist(arr):对于 arr 中的行:设置(行.tolist())%timeit tosets(arr)# 72.2 ms ± 2.68 ms 每个循环(平均值 ± 标准偏差.7 次运行,每次 10 次循环)%timeit tolists(arr)# 每个循环 80.5 ms ± 2.18 ms(平均值 ± 标准偏差,7 次运行,每次 10 次循环)%timeit tolists_method(arr)# 每个循环 16.3 ms ± 140 µs(平均值 ± 标准偏差,7 次运行,每次 100 次循环)%timeit tosets_intermediatelist(arr)# 每个循环 38.5 ms ± 200 µs(平均值 ± 标准偏差,7 次运行,每次 10 次循环)

所以如果你想要set,你最好使用set(arr.tolist()).对于更大的数组,可能使用

集合比 list 更复杂,因为它们涉及 hashes 和空槽(Python 对集合使用开放寻址,所以我假设 numba 也会).

与 NumPy array 一样,numba set 直接保存值.因此,当您将 NumPy array 转换为 numba set(或反之亦然)时,它根本不需要使用Python 框",因此当您创建sets 在 numba nopython 函数中它会比 set(arr.tolist()) 操作快得多:

将 numba 导入为 nb@nb.njitdef tosets_numba(arr):对于范围内的 lineno(arr.shape[0]):设置(arr [lineno])tosets_numba(arr) # 热身%timeit tosets_numba(arr)# 每个循环 6.55 ms ± 105 µs(平均值 ± 标准偏差,7 次运行,每次 100 次循环)

这大约比 set(arr.tolist()) 方法快五倍.但需要强调的是,我没有从函数中返回 set .当您从 nopython numba 函数返回一个 set 到 Python Numba 时,会创建一个 python 集 - 包括为集合中的所有值创建框"(这是 numba 隐藏的东西)).

仅供参考:如果您将 lists 传递给 Numba nopython 函数或从这些函数返回列表,则会发生相同的装箱/拆箱.那么什么是 Python 中的 O(1) 操作是使用 Numba 的 O(n) 操作!这就是为什么通常最好将 NumPy 数组传递给 numba nopython 函数(即 O(1)).

我假设如果你从函数中返回这些集合(现在不太可能,因为 numba 目前不支持集合列表)它会更慢(因为它创建了一个 numba 集合 稍后将其转换为 python 集)或仅稍微快一点(如果转换 numbaset -> pythonset 真的非常快).

就我个人而言,仅当我不需要从函数返回它们并在函数内对集合执行所有操作时,我才会将 numba 用于集合 仅当集合上的所有操作都是在 nopython 模式下支持.在任何其他情况下,我都不会在这里使用 numba.

<小时>

请注意:from numpy import * 应该避免,当你这样做时你隐藏了几个 python 内置函数 (sum, min, max, ...) 并将很多东西放入您的全局变量中.最好使用 import numpy as np.函数调用前面的 np. 使代码更清晰,不需要输入太多.

I'm trying to execute the following

from numpy import *
x = array([[3,2,3],[711,4,104],.........,[4,4,782,7845]])  # large nparray
for item in x:
    set(item)

and it takes very long compared to:

x = array([[3,2,3],[711,4,104],.........,[4,4,782,7845]])  # large nparray
for item in x:
    item.tolist()

Why does it take much longer to convert a NumPy array to a set than to a list? I mean basically both have complexity O(n)?

解决方案

TL;DR: The set() function creates a set using Pythons iteration protocol. But iterating (on the Python level) over NumPy arrays is so slow that using tolist() to convert the array to a Python list before doing the iteration is (much) faster.

To understand why iterating over NumPy arrays is so slow it's important to know how Python objects, Python lists, and NumPy arrays are stored in memory.

A Python object needs some bookkeeping properties (like the reference count, a link to its class, ...) and the value it represents. For example the integer ten = 10 could look like this:

The blue circle is the "name" you use in the Python interpreter for the variable ten and the lower object (instance) is what actually represents the integer (since the bookkeeping properties aren't imporant here I ignored them in the images).

A Python list is just a collection of Python objects, for example mylist = [1, 2, 3] would be saved like this:

This time the list references the Python integers 1, 2 and 3 and the name mylist just references the list instance.

But an array myarray = np.array([1, 2, 3]) doesn't store Python objects as elements:

The values 1, 2 and 3 are stored directly in the NumPy array instance.


With this information I can explain why iterating over an array is so much slower compared to an iteration over a list:

Each time you access the next element in a list the list just returns a stored object. That's very fast because the element already exists as Python object (it just needs to increment the reference count by one).

On the other hand when you want an element of an array it needs to create a new Python "box" for the value with all the bookkeeping stuff before it is returned. When you iterate over the array it needs to create one Python box for each element in your array:

Creating these boxes is slow and the main reason why iterating over NumPy arrays is much slower than iterating over Python collections (lists/tuples/sets/dictionaries) which store the values and their box:

import numpy as np
arr = np.arange(100000)
lst = list(range(100000))

def iterateover(obj):
    for item in obj:
        pass

%timeit iterateover(arr)
# 20.2 ms ± 155 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit iterateover(lst)
# 3.96 ms ± 26.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

The set "constructor" just does an iteration over the object.

One thing I can't answer definitely is why the tolist method is so much faster. In the end each value in the resulting Python list needs to be in a "Python box" so there's not much work that tolist could avoid. But one thing I know for sure is that list(array) is slower than array.tolist():

arr = np.arange(100000)

%timeit list(arr)
# 20 ms ± 114 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit arr.tolist()
# 10.3 ms ± 253 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Each of these has O(n) runtime complexity but the constant factors are very different.

In your case you did compare set() to tolist() - which isn't a particular good comparison. It would make more sense to compare set(arr) to list(arr) or set(arr.tolist()) to arr.tolist():

arr = np.random.randint(0, 1000, (10000, 3))

def tosets(arr):
    for line in arr:
        set(line)

def tolists(arr):
    for line in arr:
        list(line)

def tolists_method(arr):
    for line in arr:
        line.tolist()

def tosets_intermediatelist(arr):
    for line in arr:
        set(line.tolist())

%timeit tosets(arr)
# 72.2 ms ± 2.68 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit tolists(arr)
# 80.5 ms ± 2.18 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit tolists_method(arr)
# 16.3 ms ± 140 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit tosets_intermediatelist(arr)
# 38.5 ms ± 200 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

So if you want sets you are better off using set(arr.tolist()). For bigger arrays it could make sense to use np.unique but because your rows only contain 3 items that will likely be slower (for thousands of elements it could be much faster!).


In the comments you asked about numba and yes, it's true that numba could speed this up. Numba supports typed sets (only numeric types), but that doesn't mean it will be always faster.

I'm not sure how numba (re-)implements sets but because they are typed it's likely they also avoid the "Python boxes" and store the values directly inside the set:

Sets are more complicated than lists because it they involve hashes and empty slots (Python uses open-addressing for sets, so I assume numba will too).

Like the NumPy array the numba set saves the values directly. So when you convert a NumPy array to a numba set (or vise-versa) it won't need to use "Python boxes" at all, so when you create the sets in a numba nopython function it will be much faster even than the set(arr.tolist()) operation:

import numba as nb
@nb.njit
def tosets_numba(arr):
    for lineno in range(arr.shape[0]):
        set(arr[lineno])

tosets_numba(arr)  # warmup
%timeit tosets_numba(arr)
# 6.55 ms ± 105 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

That's roughly five times faster than the set(arr.tolist()) approach. But it's important to highlight that I did not return the sets from the function. When you return a set from a nopython numba function to Python Numba creates a python set - including "creating the boxes" for all values in the set (that's something numba is hiding).

Just FYI: The same boxing/unboxing happens if you pass lists to Numba nopython functions or return lists from these functions. So what's a O(1) operation in Python is an O(n) operation with Numba! That's why it's generally better to pass NumPy arrays to numba nopython function (which is O(1)).

I assume that if you return these sets from the function (not really possible right now because numba doesn't support lists of sets currently) it would be slower (because it creates a numba set and later converts it to a python set) or only marginally faster (if the conversion numbaset -> pythonset is really, really fast).

Personally I would use numba for sets only if I don't need to return them from the function and do all operations on the set inside the function and only if all the operations on the set are supported in nopython mode. In any other case I wouldn't use numba here.


Just a note: from numpy import * should be avoided, you hide several python built-in functions when you do that (sum, min, max, ...) and it puts a lot of stuff into your globals. Better to use import numpy as np. The np. in front of function calls makes the code clearer and isn't much to type.

这篇关于将 NumPy 数组转换为集合花费的时间太长的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆