关于不可变字符串的变化id [英] About the changing id of an immutable string

查看:20
本文介绍了关于不可变字符串的变化id的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

关于str 类型对象的id(在python 2.7 中)的一些问题让我感到困惑.str 类型是不可变的,所以我希望一旦它被创建,它就会始终具有相同的 id.我觉得我自己表达的不太好,所以我会贴一个输入和输出序列的例子.

<预><代码>>>>id('所以')140614155123888>>>id('所以')140614155123848>>>id('所以')140614155123808

所以与此同时,它一直在变化.但是,在将变量指向该字符串后,情况发生了变化:

<预><代码>>>>所以 = '所以'>>>id('所以')140614155123728>>>所以 = '所以'>>>身份证(所以)140614155123728>>>not_so = '所以'>>>id(not_so)140614155123728

所以它看起来好像冻结了 id,一旦一个变量保存了那个值.确实,在del sodel not_so 之后,id('so') 的输出又开始变化了.

这与(小)整数的行为不同.

我知道不变性和具有相同的 id 之间没有真正的联系;尽管如此,我仍在试图找出这种行为的根源.我相信熟悉 python 内部结构的人不会像我一样惊讶,所以我试图达到同样的点......

更新

用不同的字符串尝试相同的结果会得到不同的结果...

<预><代码>>>>id('你好')139978087896384>>>id('你好')139978087896384>>>id('你好')139978087896384

现在等于...

解决方案

CPython 不承诺默认对 所有 字符串进行实习,但在实践中,Python 代码库中的很多地方已经做了重用-创建的字符串对象.许多 Python 内部使用(C 等价物)sys.intern() 函数调用 来显式地实习 Python 字符串,但除非您遇到这些特殊情况之一,否则两个相同的 Python 字符串文字将产生不同的字符串.

Python 还可以自由地重用内存位置,并且 Python 还将通过在编译时将它们与代码对象中的字节码一起存储一次来优化不可变的文字.Python REPL(交互式解释器)还将最新的表达式结果存储在 _ 名称中,这会使事情变得更加混乱.

因此,您不时看到相同的 ID.

在 REPL 中仅运行 id() 行需要经过几个步骤:

  1. 该行被编译,其中包括为字符串对象创建一个常量:

    <预><代码>>>>compile("id('foo')", '<stdin>', 'single').co_consts('foo', 无)

    这显示了带有编译字节码的存储常量;在这种情况下,一个字符串 'foo'None 单例.在此阶段可以优化由产生不可变值的简单表达式组成,请参阅下面关于优化器的说明.

  2. 执行时,从代码常量中加载字符串,id() 返回内存位置.生成的 int 值绑定到 _,并打印:

    <预><代码>>>>导入文件>>>dis.dis(compile("id('foo')", '<stdin>', 'single'))1 0 LOAD_NAME 0 (id)3 LOAD_CONST 0 ('foo')6 CALL_FUNCTION 19 PRINT_EXPR10 LOAD_CONST 1(无)13 RETURN_VALUE

  3. 代码对象没有被任何东西引用,引用计数下降到 0,代码对象被删除.因此,字符串对象也是如此.

Python 可以也许为新的字符串对象重用相同的内存位置,如果你重新运行相同的代码.如果重复此代码,这通常会导致打印相同的内存地址.这确实取决于您对 Python 内存的其他处理.

ID 重用是不可预测的;如果同时垃圾收集器运行以清除循环引用,则可以释放其他内存,您将获得新的内存地址.

接下来,Python 编译器还将实习任何存储为常量的 Python 字符串,前提是它看起来足够像一个有效的标识符.Python 代码对象工厂函数 PyCode_New 将实习任何只包含 ASCII 字母、数字或下划线的字符串对象,通过调用 intern_string_constants().这个函数通过常量结构递归,并且对于在那里找到的任何字符串对象 v 执行:

if (all_name_chars(v)) {PyObject *w = v;PyUnicode_InternInPlace(&v);如果 (w != v) {PyTuple_SET_ITEM(tuple, i, v);修改 = 1;}}

where all_name_chars() 记录为

/* all_name_chars(s): true iff s 匹配 [a-zA-Z0-9_]* */

由于您创建了符合该标准的字符串,因此它们被实习,这就是为什么您在第二个测试中看到用于 'so' 字符串的相同 ID 的原因:只要引用实习版本仍然存在,实习将导致未来的 'so' 文字重用实习字符串对象,即使在新的代码块中并绑定到不同的标识符.在您的第一个测试中,您没有保存对字符串的引用,因此在可以重用之前丢弃了内部字符串.

顺便说一下,您的新名称 so = 'so' 将一个字符串绑定到一个包含相同字符的名称.换句话说,您正在创建一个名称和值相等的全局变量.当 Python 实习生标识符和限定常量时,您最终会为标识符及其值使用相同的字符串对象:

<预><代码>>>>compile("so = 'so'", '<stdin>', 'single').co_names[0] 是 compile("so = 'so'", '<stdin>', 'single').co_consts[0]真的

如果您创建的字符串不是代码对象常量,或者包含字母 + 数字 + 下划线范围之外的字符,您将看到 id() 值未被重用:<预><代码>>>>some_var = '看,空格和标点符号!'>>>some_other_var = '看,空格和标点符号!'>>>id(some_var)4493058384>>>id(some_other_var)4493058456>>>foo = 'Concatenating_' + 'also_helps_if_long_enough'>>>bar = 'Concatenating_' + 'also_helps_if_long_enough'>>>foo 是酒吧错误的>>>foo == 酒吧真的

Python 编译器要么使用 窥视孔优化器(Python版本 <3.7) 或更强大的 AST 优化器 (3.7和更新)预先计算(折叠)涉及常量的简单表达式的结果.peepholder 将其输出限制为长度不超过 20 的序列(以防止代码对象膨胀和内存使用),而 AST 优化器对 4096 个字符的字符串使用单独的限制.这意味着,如果结果字符串符合当前 Python 版本的优化器限制,则连接仅由名称字符组成的较短字符串仍然会导致插入字符串.

例如在 Python 3.7 上, 'foo' * 20 将产生一个单一的 interned 字符串,因为常量折叠将它变成了一个单一的值,而在 Python 3.6 或更早版本上只有 'foo' * 6 将被折叠:

<预><代码>>>>导入磁盘,系统>>>sys.version_infosys.version_info(major=3,minor=7,micro=4,releaselevel='final',serial=0)>>>dis.dis("'foo' * 20")1 0 LOAD_CONST 0 ('foofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoo')2 RETURN_VALUE

<预><代码>>>>dis.dis("'foo' * 6")1 0 LOAD_CONST 2 ('foofoofoofoofoofoo')2 RETURN_VALUE>>>dis.dis("'foo' * 7")1 0 LOAD_CONST 0 ('foo')2 LOAD_CONST 1 (7)4 BINARY_MULTIPLY6 RETURN_VALUE

Something about the id of objects of type str (in python 2.7) puzzles me. The str type is immutable, so I would expect that once it is created, it will always have the same id. I believe I don't phrase myself so well, so instead I'll post an example of input and output sequence.

>>> id('so')
140614155123888
>>> id('so')
140614155123848
>>> id('so')
140614155123808

so in the meanwhile, it changes all the time. However, after having a variable pointing at that string, things change:

>>> so = 'so'
>>> id('so')
140614155123728
>>> so = 'so'
>>> id(so)
140614155123728
>>> not_so = 'so'
>>> id(not_so)
140614155123728

So it looks like it freezes the id, once a variable holds that value. Indeed, after del so and del not_so, the output of id('so') start changing again.

This is not the same behaviour as with (small) integers.

I know there is not real connection between immutability and having the same id; still, I am trying to figure out the source of this behaviour. I believe that someone whose familiar with python's internals would be less surprised than me, so I am trying to reach the same point...

Update

Trying the same with a different string gave different results...

>>> id('hello')
139978087896384
>>> id('hello')
139978087896384
>>> id('hello')
139978087896384

Now it is equal...

解决方案

CPython does not promise to intern all strings by default, but in practice, a lot of places in the Python codebase do reuse already-created string objects. A lot of Python internals use (the C-equivalent of) the sys.intern() function call to explicitly intern Python strings, but unless you hit one of those special cases, two identical Python string literals will produce different strings.

Python is also free to reuse memory locations, and Python will also optimize immutable literals by storing them once, at compile time, with the bytecode in code objects. The Python REPL (interactive interpreter) also stores the most recent expression result in the _ name, which muddles up things some more.

As such, you will see the same id crop up from time to time.

Running just the line id(<string literal>) in the REPL goes through several steps:

  1. The line is compiled, which includes creating a constant for the string object:

    >>> compile("id('foo')", '<stdin>', 'single').co_consts
    ('foo', None)
    

    This shows the stored constants with the compiled bytecode; in this case a string 'foo' and the None singleton. Simple expressions consisting of that produce an immutable value may be optimised at this stage, see the note on optimizers, below.

  2. On execution, the string is loaded from the code constants, and id() returns the memory location. The resulting int value is bound to _, as well as printed:

    >>> import dis
    >>> dis.dis(compile("id('foo')", '<stdin>', 'single'))
      1           0 LOAD_NAME                0 (id)
                  3 LOAD_CONST               0 ('foo')
                  6 CALL_FUNCTION            1
                  9 PRINT_EXPR          
                 10 LOAD_CONST               1 (None)
                 13 RETURN_VALUE        
    

  3. The code object is not referenced by anything, reference count drops to 0 and the code object is deleted. As a consequence, so is the string object.

Python can then perhaps reuse the same memory location for a new string object, if you re-run the same code. This usually leads to the same memory address being printed if you repeat this code. This does depend on what else you do with your Python memory.

ID reuse is not predictable; if in the meantime the garbage collector runs to clear circular references, other memory could be freed and you'll get new memory addresses.

Next, the Python compiler will also intern any Python string stored as a constant, provided it looks enough like a valid identifier. The Python code object factory function PyCode_New will intern any string object that contains only ASCII letters, digits or underscores, by calling intern_string_constants(). This function recurses through the constants structures and for any string object v found there executes:

if (all_name_chars(v)) {
    PyObject *w = v;
    PyUnicode_InternInPlace(&v);
    if (w != v) {
        PyTuple_SET_ITEM(tuple, i, v);
        modified = 1;
    }
}

where all_name_chars() is documented as

/* all_name_chars(s): true iff s matches [a-zA-Z0-9_]* */

Since you created strings that fit that criterion, they are interned, which is why you see the same ID being used for the 'so' string in your second test: as long as a reference to the interned version survives, interning will cause future 'so' literals to reuse the interned string object, even in new code blocks and bound to different identifiers. In your first test, you don't save a reference to the string, so the interned strings are discarded before they can be reused.

Incidentally, your new name so = 'so' binds a string to a name that contains the same characters. In other words, you are creating a global whose name and value are equal. As Python interns both identifiers and qualifying constants, you end up using the same string object for both the identifier and its value:

>>> compile("so = 'so'", '<stdin>', 'single').co_names[0] is compile("so = 'so'", '<stdin>', 'single').co_consts[0]
True

If you create strings that are either not code object constants, or contain characters outside of the letters + numbers + underscore range, you'll see the id() value not being reused:

>>> some_var = 'Look ma, spaces and punctuation!'
>>> some_other_var = 'Look ma, spaces and punctuation!'
>>> id(some_var)
4493058384
>>> id(some_other_var)
4493058456
>>> foo = 'Concatenating_' + 'also_helps_if_long_enough'
>>> bar = 'Concatenating_' + 'also_helps_if_long_enough'
>>> foo is bar
False
>>> foo == bar
True

The Python compiler either uses the peephole optimizer (Python versions < 3.7) or the more capable AST optimizer (3.7 and newer) to pre-calculate (fold) the results of simple expressions involving constants. The peepholder limits it's output to a sequence of length 20 or less (to prevent bloating code objects and memory use), while the AST optimizer uses a separate limit for strings of 4096 characters. This means that concatenating shorter strings consisting only of name characters can still lead to interned strings if the resulting string fits within the optimizer limits of your current Python version.

E.g. on Python 3.7, 'foo' * 20 will result in a single interned string, because constant folding turns this into a single value, while on Python 3.6 or older only 'foo' * 6 would be folded:

>>> import dis, sys
>>> sys.version_info
sys.version_info(major=3, minor=7, micro=4, releaselevel='final', serial=0)
>>> dis.dis("'foo' * 20")
  1           0 LOAD_CONST               0 ('foofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoo')
              2 RETURN_VALUE

and

>>> dis.dis("'foo' * 6")
  1           0 LOAD_CONST               2 ('foofoofoofoofoofoo')
              2 RETURN_VALUE
>>> dis.dis("'foo' * 7")
  1           0 LOAD_CONST               0 ('foo')
              2 LOAD_CONST               1 (7)
              4 BINARY_MULTIPLY
              6 RETURN_VALUE

这篇关于关于不可变字符串的变化id的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆