Python 什么时候为相同的字符串分配新的内存? [英] when does Python allocate new memory for identical strings?

查看:22
本文介绍了Python 什么时候为相同的字符串分配新的内存?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

具有相同字符的两个 Python 字符串,a == b,可以共享内存,id(a) == id(b),或者可能在内存中两次,id(a) != id(b).试试

Two Python strings with the same characters, a == b, may share memory, id(a) == id(b), or may be in memory twice, id(a) != id(b). Try

ab = "ab"
print id( ab ), id( "a"+"b" )

这里python识别新创建的"a"+"b"是一样的作为已经在内存中的ab"——还不错.

Here Python recognizes that the newly created "a"+"b" is the same as the "ab" already in memory -- not bad.

现在考虑一个 N 长的州名列表[亚利桑那"、阿拉斯加"、阿拉斯加"、加利福尼亚"……](在我的情况下为 N ~ 500000).
我看到 50 个不同的 id() s ⇒每个字符串Arizona"......只存储一次,很好.
但是将列表写入磁盘并再次读回:相同"列表现在有 N 个不同的 id(),内存更多,见下文.

Now consider an N-long list of state names [ "Arizona", "Alaska", "Alaska", "California" ... ] (N ~ 500000 in my case).
I see 50 different id() s ⇒ each string "Arizona" ... is stored only once, fine.
BUT write the list to disk and read it back in again: the "same" list now has N different id() s, way more memory, see below.

怎么会——谁能解释一下 Python 字符串内存分配?

How come -- can anyone explain Python string memory allocation ?

""" when does Python allocate new memory for identical strings ?
    ab = "ab"
    print id( ab ), id( "a"+"b" )  # same !
    list of N names from 50 states: 50 ids, mem ~ 4N + 50S, each string once
    but list > file > mem again: N ids, mem ~ N * (4 + S)
"""

from __future__ import division
from collections import defaultdict
from copy import copy
import cPickle
import random
import sys

states = dict(
AL = "Alabama",
AK = "Alaska",
AZ = "Arizona",
AR = "Arkansas",
CA = "California",
CO = "Colorado",
CT = "Connecticut",
DE = "Delaware",
FL = "Florida",
GA = "Georgia",
)

def nid(alist):
    """ nr distinct ids """
    return "%d ids  %d pickle len" % (
        len( set( map( id, alist ))),
        len( cPickle.dumps( alist, 0 )))  # rough est ?
# cf http://stackoverflow.com/questions/2117255/python-deep-getsizeof-list-with-contents

N = 10000
exec( "
".join( sys.argv[1:] ))  # var=val ...
random.seed(1)

    # big list of random names of states --
names = []
for j in xrange(N):
    name = copy( random.choice( states.values() ))
    names.append(name)
print "%d strings in mem:  %s" % (N, nid(names) )  # 10 ids, even with copy()

    # list to a file, back again -- each string is allocated anew
joinsplit = "
".join(names).split()  # same as > file > mem again
assert joinsplit == names
print "%d strings from a file:  %s" % (N, nid(joinsplit) )

# 10000 strings in mem:  10 ids  42149 pickle len  
# 10000 strings from a file:  10000 ids  188080 pickle len
# Python 2.6.4 mac ppc

添加 25jan:
Python 内存(或任何程序的)中有两种字符串:

Added 25jan:
There are two kinds of strings in Python memory (or any program's):

  • Ustrings,在唯一字符串的 Ucache 中:这些可以节省内存,并且如果两者都在 Ucache 中,则使 a == b 变快
  • Ostrings,其他,可以存储任意次.

intern(astring) 将 astring 放入 Ucache (Alex +1);除此之外,我们对 Python 如何将 Ostrings 移动到 Ucache 一无所知——"a"+"b" 是如何进入的,在 "ab" 之后?(文件中的字符串"毫无意义——无法知道.)
简而言之,Ucaches(可能有几个)仍然模糊不清.

intern(astring) puts astring in the Ucache (Alex +1); other than that we know nothing at all about how Python moves Ostrings to the Ucache -- how did "a"+"b" get in, after "ab" ? ("Strings from files" is meaningless -- there's no way of knowing.)
In short, Ucaches (there may be several) remain murky.

历史脚注:SPITBOL统一所有字符串 ca.1970年.

A historical footnote: SPITBOL uniquified all strings ca. 1970.

推荐答案

Python 语言的每个实现都可以在分配不可变对象(例如字符串)方面自由地做出自己的权衡——要么从语言的角度来看,一个新的,或者找到一个现有的相等的并使用更多的引用,都很好.在实践中,当然,现实世界的实现会做出合理的妥协:在定位这样一个对象时,再引用一个合适的现有对象既便宜又容易,如果定位一个合适的现有对象的任务(可能或可能不存在)看起来可能需要很长时间的搜索.

Each implementation of the Python language is free to make its own tradeoffs in allocating immutable objects (such as strings) -- either making a new one, or finding an existing equal one and using one more reference to it, are just fine from the language's point of view. In practice, of course, real-world implementation strike reasonable compromise: one more reference to a suitable existing object when locating such an object is cheap and easy, just make a new object if the task of locating a suitable existing one (which may or may not exist) looks like it could potentially take a long time searching.

因此,例如,在单个函数中多次出现相同的字符串文字将(在我所知道的所有实现中)使用对同一对象的新引用"策略,因为在构建该函数的常量池时它非常快并且容易避免重复;但是在单独函数中这样做可能是一项非常耗时的任务,所以现实世界的实现要么根本不这样做,要么只在一些启发式识别的情况下这样做可以希望在编译时间(通过搜索相同的现有常量减慢速度)与内存消耗(如果不断制作新的常量副本时会增加)之间进行合理的权衡.

So, for example, multiple occurrences of the same string literal within a single function will (in all implementations I know of) use the "new reference to same object" strategy, because when building that function's constants-pool it's pretty fast and easy to avoid duplicates; but doing so across separate functions could potentially be a very time-consuming task, so real-world implementations either don't do it at all, or only do it in some heuristically identified subset of cases where one can hope for a reasonable tradeoff of compilation time (slowed down by searching for identical existing constants) vs memory consumption (increased if new copies of constants keep being made).

我不知道 Python 的任何实现(或者其他具有常量字符串的语言,例如 Java)在从一个文件——它似乎不是一个有前途的权衡(在这里你将支付运行时间,而不是编译时间,所以权衡更不吸引人).当然,如果您知道(由于应用程序级别的考虑)此类不可变对象很大并且很容易出现许多重复,那么您可以很容易地实现自己的常量池"策略(intern 可以帮助你为字符串做这件事,但不难为你自己做,例如,元组不可变项、巨大的长整数等).

I don't know of any implementation of Python (or for that matter other languages with constant strings, such as Java) that takes the trouble of identifying possible duplicates (to reuse a single object via multiple references) when reading data from a file -- it just doesn't seem to be a promising tradeoff (and here you'd be paying runtime, not compile time, so the tradeoff is even less attractive). Of course, if you know (thanks to application level considerations) that such immutable objects are large and quite prone to many duplications, you can implement your own "constants-pool" strategy quite easily (intern can help you do it for strings, but it's not hard to roll your own for, e.g., tuples with immutable items, huge long integers, and so forth).

这篇关于Python 什么时候为相同的字符串分配新的内存?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆