Python何时为相同的字符串分配新的内存? [英] when does Python allocate new memory for identical strings?

查看:80
本文介绍了Python何时为相同的字符串分配新的内存?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

两个具有相同字符的Python字符串,a == b, 可能共享内存,id(a)== id(b), 或者可能在内存中两次,id(a)!= id(b). 试试

Two Python strings with the same characters, a == b, may share memory, id(a) == id(b), or may be in memory twice, id(a) != id(b). Try

ab = "ab"
print id( ab ), id( "a"+"b" )

此处Python认识到新创建的"a" +"b"是相同的 就像已经在内存中的"ab"一样-不错.

Here Python recognizes that the newly created "a"+"b" is the same as the "ab" already in memory -- not bad.

现在考虑状态名称的N长列表 [亚利桑那州",阿拉斯加",阿拉斯加",加利福尼亚" ...] (在我的情况下为N〜500000).
我看到50个不同的id()s⇒每个字符串"Arizona" ...仅存储一次,很好.
但是将列表写入磁盘,然后再次读回: 现在,相同"列表具有N个不同的id(),从而增加了内存,请参见下文.

Now consider an N-long list of state names [ "Arizona", "Alaska", "Alaska", "California" ... ] (N ~ 500000 in my case).
I see 50 different id() s ⇒ each string "Arizona" ... is stored only once, fine.
BUT write the list to disk and read it back in again: the "same" list now has N different id() s, way more memory, see below.

为什么-有人可以解释Python字符串内存分配吗?

How come -- can anyone explain Python string memory allocation ?

""" when does Python allocate new memory for identical strings ?
    ab = "ab"
    print id( ab ), id( "a"+"b" )  # same !
    list of N names from 50 states: 50 ids, mem ~ 4N + 50S, each string once
    but list > file > mem again: N ids, mem ~ N * (4 + S)
"""

from __future__ import division
from collections import defaultdict
from copy import copy
import cPickle
import random
import sys

states = dict(
AL = "Alabama",
AK = "Alaska",
AZ = "Arizona",
AR = "Arkansas",
CA = "California",
CO = "Colorado",
CT = "Connecticut",
DE = "Delaware",
FL = "Florida",
GA = "Georgia",
)

def nid(alist):
    """ nr distinct ids """
    return "%d ids  %d pickle len" % (
        len( set( map( id, alist ))),
        len( cPickle.dumps( alist, 0 )))  # rough est ?
# cf http://stackoverflow.com/questions/2117255/python-deep-getsizeof-list-with-contents

N = 10000
exec( "\n".join( sys.argv[1:] ))  # var=val ...
random.seed(1)

    # big list of random names of states --
names = []
for j in xrange(N):
    name = copy( random.choice( states.values() ))
    names.append(name)
print "%d strings in mem:  %s" % (N, nid(names) )  # 10 ids, even with copy()

    # list to a file, back again -- each string is allocated anew
joinsplit = "\n".join(names).split()  # same as > file > mem again
assert joinsplit == names
print "%d strings from a file:  %s" % (N, nid(joinsplit) )

# 10000 strings in mem:  10 ids  42149 pickle len  
# 10000 strings from a file:  10000 ids  188080 pickle len
# Python 2.6.4 mac ppc

添加了25jan:
Python内存(或任何程序的内存)中有两种字符串:

Added 25jan:
There are two kinds of strings in Python memory (or any program's):

  • 使用唯一字符串的Ucache中的Ustrings:可以节省内存,并且如果两个都在Ucache中,则可以使a == b更快
  • 其他类型的字符串,可以存储多次.

intern(astring)将字符串放入Ucache(Alex +1); 除此之外,我们对Python如何将Ostrings移至Ucache一无所知- 在"ab"之后,"a" +"b"是如何进入的? (文件中的字符串"没有意义-无法知道.)
简而言之,Ucache(可能有几个)仍然模糊不清.

intern(astring) puts astring in the Ucache (Alex +1); other than that we know nothing at all about how Python moves Ostrings to the Ucache -- how did "a"+"b" get in, after "ab" ? ("Strings from files" is meaningless -- there's no way of knowing.)
In short, Ucaches (there may be several) remain murky.

历史脚注: SPITBOL 统一所有字符串1970年.

A historical footnote: SPITBOL uniquified all strings ca. 1970.

推荐答案

Python语言的每个实现都可以自由地在分配不可变对象(例如字符串)时进行权衡取舍-要么从语言的角度来看,一个新的语言,或者找到一个既有的语言,再使用另一个对它的引用,就很好.当然,在实践中,现实世界中的实现会做出合理的折衷:在找到这样的对象时,再引用一个合适的现有对象既便宜又容易,如果要找到合适的现有对象(可能会可能不存在),看起来可能需要花费很长时间才能搜索到.

Each implementation of the Python language is free to make its own tradeoffs in allocating immutable objects (such as strings) -- either making a new one, or finding an existing equal one and using one more reference to it, are just fine from the language's point of view. In practice, of course, real-world implementation strike reasonable compromise: one more reference to a suitable existing object when locating such an object is cheap and easy, just make a new object if the task of locating a suitable existing one (which may or may not exist) looks like it could potentially take a long time searching.

因此,例如,在一个函数中多次出现相同的字符串文字(在我所知道的所有实现中)将使用对同一对象的新引用"策略,因为在构建该函数的常量池时,它非常快并易于避免重复;但是跨分开函数执行此操作可能是一项非常耗时的任务,因此现实世界中的实现要么根本不执行此操作,要么仅在某些启发式识别的情况子集中执行此操作可以希望在编译时间(通过搜索相同的现有常量而减慢)与内存消耗(如果不断制作新的常量副本的情况下增加)之间进行合理的权衡.

So, for example, multiple occurrences of the same string literal within a single function will (in all implementations I know of) use the "new reference to same object" strategy, because when building that function's constants-pool it's pretty fast and easy to avoid duplicates; but doing so across separate functions could potentially be a very time-consuming task, so real-world implementations either don't do it at all, or only do it in some heuristically identified subset of cases where one can hope for a reasonable tradeoff of compilation time (slowed down by searching for identical existing constants) vs memory consumption (increased if new copies of constants keep being made).

我不知道Python的任何实现(或与此有关的其他具有常量字符串的语言,例如Java)在从中读取数据时难以识别可能的重复项(以通过多个引用重用单个对象)的麻烦.一个文件-似乎似乎不是一个有前途的权衡(这里您要花的是 runtime ,而不是 compile ,所以这种权衡的吸引力更小了).当然,如果您知道(由于应用程序级别的考虑)这样的不可变对象很大并且很容易出现很多重复,则可以很容易地实现自己的常量池"策略(

I don't know of any implementation of Python (or for that matter other languages with constant strings, such as Java) that takes the trouble of identifying possible duplicates (to reuse a single object via multiple references) when reading data from a file -- it just doesn't seem to be a promising tradeoff (and here you'd be paying runtime, not compile time, so the tradeoff is even less attractive). Of course, if you know (thanks to application level considerations) that such immutable objects are large and quite prone to many duplications, you can implement your own "constants-pool" strategy quite easily (intern can help you do it for strings, but it's not hard to roll your own for, e.g., tuples with immutable items, huge long integers, and so forth).

这篇关于Python何时为相同的字符串分配新的内存?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆