Python内存模型 [英] Python Memory Model

查看:22
本文介绍了Python内存模型的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个很大的清单假设我这样做了(是的,我知道代码非常不符合 Python 风格,但就示例而言..):

n = (2**32)**2对于我在 xrange(10**7)li[i] = n

工作正常.然而:

for i in xrange(10**7)li[i] = i**2

消耗大量内存.我不明白为什么会这样 - 存储大数需要更多位,而在 Java 中,第二个选项确实更节省内存......

有人对此有解释吗?

解决方案

Java 对一些值类型(包括整数)进行了特殊处理,以便它们按值存储(而不是像其他所有内容一样按对象引用存储).Python 不会对此类类型进行特殊处理,因此将 n 分配给列表(或其他普通 Python 容器)中的许多条目不必进行复制.

请注意,引用始终是对对象,而不是对变量"——在 Python(或 Java)中没有对变量的引用"这样的东西.例如:

<预><代码>>>>n = 23>>>a = [n,n]>>>打印 id(n), id(a[0]), id(a[1])8402048 8402048 8402048>>>n = 45>>>打印 id(n), id(a[0]), id(a[1])8401784 8402048 8402048

我们从第一次打印中看到,列表 a 中的两个条目引用的对象与 n 引用的对象完全相同——但是当 n> 被重新赋值,it 现在指的是一个不同的对象,而 a 中的两个条目仍然指的是前一个.

一个 array.array(来自 Python 标准库模块 array) 与列表非常不同:它保留同构类型的紧凑副本,每个项目只占用存储该类型值副本所需的少量位.所有普通容器都保持引用(在 C 编码的 Python 运行时内部实现为指向 PyObject 结构的指针:每个指针,在 32 位构建上,占用 4 个字节,每个 PyObject 至少 16 个左右[包括指向类型的指针,引用计数、实际值和 malloc 四舍五入]),数组不是(所以它们不能是异构的,不能有除一些基本类型之外的项目等).

例如,一个包含 1000 个项目的容器,所有项目都是不同的小整数(每个项目的值可以容纳 2 个字节),将需要大约 2,000 个字节的数据作为 array.array('h'),但作为 list 大约有 20,000 个.但是如果所有项目都是相同的数字,数组仍然需要 2,000 字节的数据,列表将只需要 20 左右 [[在每一种情况下,你必须为容器对象添加另外 16 或 32 字节正确的,除了数据的内存]].

然而,虽然问题说数组"(即使在标签中),我怀疑它的 arr 实际上是一个数组——如果是,它不能存储 (2**32)*2(数组中最大的 int 值是 32 位)并且实际上不会观察到问题中报告的内存行为.所以,问题实际上可能是关于一个列表,而不是一个数组.

编辑:@ooboo 的评论提出了许多合理的后续问题,我将其移至此处,而不是试图在评论中压缩详细解释.

<块引用>

虽然很奇怪 - 毕竟,如何对存储的整数的引用?id(variable) 给出一个整数,即引用本身就是一个整数,不是使用整数更便宜?

CPython 将引用存储为 PyObject 的指针(Jython 和 IronPython,用 Java 和 C# 编写,使用这些语言的隐式引用;PyPy,用 Python 编写,具有非常灵活的后端,可以使用许多不同的策略)

id(v) 给出(仅在 CPython 上)指针的数值(作为唯一标识对象的便捷方式).列表可以是异构的(有些项目可能是整数,其他项目可能是不同类型的对象),因此将一些项目存储为指向 PyObject 的指针和其他不同的项目并不是一个明智的选择(每个对象也需要一个类型指示,在 CPython 中,一个引用计数,至少)——array.array 是同质的且有限的,因此它确实可以(并且确实)存储项目值的副本而不是引用(这通常更便宜,但不适用于相同项目出现很多的集合,例如绝大多数项目为 0 的稀疏数组).

语言规范将完全允许 ​​Python 实现尝试更微妙的优化技巧,只要它保持语义不变,但据我所知,目前没有针对此特定问题的实现(您可以尝试破解 PyPy 后端,但如果检查 int 与非 int 的开销超过了预期的收益,请不要感到惊讶).

<块引用>

另外,如果我将 2**64 分配给每个插槽分配 n,当 n 持有一个参考 2**64?什么时候发生我只写1?

这些是完全允许每个实现做出的实现选择的示例,因为保留语义并不难(因此假设即使 3.1 和 3.2 在这方面的行为也可能不同).

当您使用 int 文字(或任何其他不可变类型的文字)或其他产生此类类型结果的表达式时,由实现决定是否无条件地创建该类型的新对象,或花一些时间检查这些对象,看看是否存在可以重用的现有对象.

在实践中,CPython(我相信其他实现,但我不太熟悉它们的内部结构)使用足够整数的单个副本(保留几个小的预定义 C 数组PyObject 形式的整数值,随时可以使用或在需要时重用),但通常不会特意寻找其他现有的可重用对象.

但例如,同一函数中相同的文字常量很容易编译为对函数常量表中单个常量对象的引用,因此这是一种非常容易完成的优化,我相信当前的每个 Python 实现都可以执行

有时很难记住 Python 是一种语言,它有几个实现可能(合法且正确地)在很多细节上有所不同——每个人,包括像我这样的学究,在谈论流行的 C 编码实现时,倾向于只说Python"而不是CPython"(在这种情况下,区分语言和实现之间的区别是最重要的;-).尽管如此,区别非常重要,值得偶尔重复一下.

I have a very large list Suppose I do that (yeah, I know the code is very unpythonic, but for the example's sake..):

n = (2**32)**2
for i in xrange(10**7)
  li[i] = n

works fine. however:

for i in xrange(10**7)
  li[i] = i**2

consumes a significantly larger amount of memory. I don't understand why that is - storing the big number takes more bits, and in Java, the the second option is indeed more memory-efficient...

Does anyone have an explanation for this?

解决方案

Java special-cases a few value types (including integers) so that they're stored by value (instead of, by object reference like everything else). Python doesn't special-case such types, so that assigning n to many entries in a list (or other normal Python container) doesn't have to make copies.

Edit: note that the references are always to objects, not "to variables" -- there's no such thing as "a reference to a variable" in Python (or Java). For example:

>>> n = 23
>>> a = [n,n]
>>> print id(n), id(a[0]), id(a[1])
8402048 8402048 8402048
>>> n = 45
>>> print id(n), id(a[0]), id(a[1])
8401784 8402048 8402048

We see from the first print that both entries in list a refer to exactly the same object as n refers to -- but when n is reassigned, it now refers to a different object, while both entries in a still refer to the previous one.

An array.array (from the Python standard library module array) is very different from a list: it keeps compact copies of a homogeneous type, taking as few bits per item as are needed to store copies of values of that type. All normal containers keep references (internally implemented in the C-coded Python runtime as pointers to PyObject structures: each pointer, on a 32-bit build, takes 4 bytes, each PyObject at least 16 or so [including pointer to type, reference count, actual value, and malloc rounding up]), arrays don't (so they can't be heterogeneous, can't have items except from a few basic types, etc).

For example, a 1000-items container, with all items being different small integers (ones whose values can fit in 2 bytes each), would take about 2,000 bytes of data as an array.array('h'), but about 20,000 as a list. But if all items were the same number, the array would still take 2,000 bytes of data, the list would take only 20 or so [[in every one of these cases you have to add about another 16 or 32 bytes for the container-object proper, in addition to the memory for the data]].

However, although the question says "array" (even in a tag), I doubt its arr is actually an array -- if it were, it could not store (2**32)*2 (largest int values in an array are 32 bits) and the memory behavior reported in the question would not actually be observed. So, the question is probably in fact about a list, not an array.

Edit: a comment by @ooboo asks lots of reasonable followup questions, and rather than trying to squish the detailed explanation in a comment I'm moving it here.

It's weird, though - after all, how is the reference to the integer stored? id(variable) gives an integer, the reference is an integer itself, isn't it cheaper to use the integer?

CPython stores references as pointers to PyObject (Jython and IronPython, written in Java and C#, use those language's implicit references; PyPy, written in Python, has a very flexible back-end and can use lots of different strategies)

id(v) gives (on CPython only) the numeric value of the pointer (just as a handy way to uniquely identify the object). A list can be heterogeneous (some items may be integers, others objects of different types) so it's just not a sensible option to store some items as pointers to PyObject and others differently (each object also needs a type indication and, in CPython, a reference count, at least) -- array.array is homogeneous and limited so it can (and does) indeed store a copy of the items' values rather than references (this is often cheaper, but not for collections where the same item appears a LOT, such as a sparse array where the vast majority of items are 0).

A Python implementation would be fully allowed by the language specs to try subtler tricks for optimization, as long as it preserves semantics untouched, but as far as I know none currently does for this specific issue (you could try hacking a PyPy backend, but don't be surprised if the overhead of checking for int vs non-int overwhelms the hoped-for gains).

Also, would it make a difference if I assigned 2**64 to every slot instead of assigning n, when n holds a reference to 2**64? What happens when I just write 1?

These are examples of implementation choices that every implementation is fully allowed to make, as it's not hard to preserve the semantics (so hypothetically even, say, 3.1 and 3.2 could behave differently in this regard).

When you use an int literal (or any other literal of an immutable type), or other expression producing a result of such a type, it's up to the implementation to decide whether to make a new object of that type unconditionally, or spend some time checking among such objects to see if there's an existing one it can reuse.

In practice, CPython (and I believe the other implementations, but I'm less familiar with their internals) uses a single copy of sufficiently small integers (keeps a predefined C array of a few small integer values in PyObject form, ready to use or reuse at need) but doesn't go out of its way in general to look for other existing reusable objects.

But for example identical literal constants within the same function are easily and readily compiled as references to a single constant object in the function's table of constants, so that's an optimization that's very easily done, and I believe every current Python implementation does perform it.

It can sometimes be hard to remember than Python is a language and it has several implementations that may (legitimately and correctly) differ in a lot of such details -- everybody, including pedants like me, tends to say just "Python" rather than "CPython" when talking about the popular C-coded implementation (excepts in contexts like this one where drawing the distinction between language and implementation is paramount;-). Nevertheless, the distinction is quite important, and well worth repeating once in a while.

这篇关于Python内存模型的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆