如果参数是集合,为什么并集会消耗更多的内存? [英] Why does union consume more memory if the argument is a set?

查看:109
本文介绍了如果参数是集合,为什么并集会消耗更多的内存?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我为set s的内存分配行为感到困惑:

I'm puzzled by this behaviour of memory allocation of sets:

>>> set(range(1000)).__sizeof__()
32968
>>> set(range(1000)).union(range(1000)).__sizeof__()       # expected, set doesn't change
32968
>>> set(range(1000)).union(list(range(1000))).__sizeof__() #expected, set doesn't change
32968
>>> set(range(1000)).union(set(range(1000))).__sizeof__()  # not expected
65736

为什么将set用作参数加倍,从而导致结果set使用的内存量? 两种情况下的结果均与原始set:

Why using a set as argument doubles the amount of memory used by the resulting set? The result in both cases is identical to the original set:

>>> set(range(1000)) == set(range(1000)).union(range(1000)) == set(range(1000)).union(set(range(1000)))
True

请注意,使用普通的迭代器也会发生同样的情况:

Note that the same happens using a normal iterator:

>>> set(range(1000)).union(iter(list(range(1000)))).__sizeof__()
32968

并使用update方法:

>>> a.update(range(1000))
>>> a.__sizeof__()
32968
>>> a.update(set(range(1000)))
>>> a.__sizeof__()
65736

起初我以为是因为当调用union时,它看到另一个set的大小为1000,因此决定分配足够的内存以适合这两个set的所有元素s,但是之后它仅使用该内存的一部分,而在迭代器的情况下,它仅对其进行迭代并逐个添加元素(因为所有元素已经在set中,因此不会消耗更多的内存).

At first I thought that it was because when union is called, it sees that the size of the other set is 1000 and thus decides to allocate enough memory to fit all the elements of both sets, but afterwards it uses only part of that memory, while in the iterator case it simply iterators over it and adds the elements one by one(which doesn't consume more memory since all the elements are already in the set).

但是range也是一个序列,第一个示例中的list也是如此.

But range is also a sequence, and so is the list in the first example.

>>> len(range(1000))
1000
>>> range(1000)[100]
100

那么为什么rangelist不会发生这种情况,而仅在set发生呢? 背后是否有任何设计决定还是一个错误?

So why this doesn't happen with range and list but only with set? Is there any design decision behind this or is it a bug?

在linux 64位上的python 2.7.3和python 3.2.3上进行了测试.

Tested on python 2.7.3 and python 3.2.3 on linux 64 bit.

推荐答案

在Python 2.7.3中, set_update_internal() .后者根据其参数的Python类型使用几种不同的实现.如此众多的实现方式可以解释您进行的测试之间行为上的差异.

In Python 2.7.3, set.union() delegates to a C function called set_update_internal(). The latter uses several different implementations depending on the Python type of its argument. This multiplicity of implementations is what explains the difference in behaviour between the tests you've conducted.

当参数为set时使用的实现将以下假设记录在代码中:

The implementation that's used when the argument is a set makes the following assumption documented in the code:

/* Do one big resize at the start, rather than
 * incrementally resizing as we insert new keys.  Expect
 * that there will be no (or few) overlapping keys.
 */

很明显,在您的特定情况下,假设没有(或很少)重叠键是不正确的.这就是最终的set整体内存的结果.

Clearly, the assumption of no (or few) overlapping keys is incorrect in your particular case. This is what results in the final set overallocating memory.

我不确定我是否将其称为错误. set的实现者选择了我认为合理的折衷方案,而您只是发现自己处于该折衷方案的错误一侧.

I am not sure I would call this a bug though. The implementer of set chose what to me looks like a reasonable tradeoff, and you've simply found yourself on the wrong side of that tradeoff.

权衡的好处是,在许多情况下,预分配会带来更好的性能:

The upside of the tradeoff is that in many cases the pre-allocation results in better performance:

In [20]: rhs = list(range(1000))

In [21]: %timeit set().union(rhs)
10000 loops, best of 3: 30 us per loop

In [22]: rhs = set(range(1000))

In [23]: %timeit set().union(rhs)
100000 loops, best of 3: 14 us per loop

这里,set版本的速度快两倍,大概是因为它不会像添加rhs中的元素那样反复重复分配内存.

Here, the set version is twice as fast, presumably because it doesn't repeatedly reallocate memory as it's adding elements from rhs.

如果超额分配是一个破坏交易的方法,则有多种解决方法,您已经发现了其中的一些方法.

If the overallocation is a deal-breaker, there's a number of ways to work around it, some of which you've already discovered.

这篇关于如果参数是集合,为什么并集会消耗更多的内存?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆