关于不可变字符串的更改ID [英] About the changing id of an immutable string

查看:80
本文介绍了关于不可变字符串的更改ID的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

关于类型str的对象id的某些事情(在python 2.7中)使我感到困惑. str类型是不可变的,因此我希望一旦创建它,​​它将始终具有相同的id.我相信我对自己的措辞不太好,所以我将发布一个输入和输出序列的示例.

Something about the id of objects of type str (in python 2.7) puzzles me. The str type is immutable, so I would expect that once it is created, it will always have the same id. I believe I don't phrase myself so well, so instead I'll post an example of input and output sequence.

>>> id('so')
140614155123888
>>> id('so')
140614155123848
>>> id('so')
140614155123808

因此,它一直在变化.但是,在指向该字符串的变量之后,情况发生了变化:

so in the meanwhile, it changes all the time. However, after having a variable pointing at that string, things change:

>>> so = 'so'
>>> id('so')
140614155123728
>>> so = 'so'
>>> id(so)
140614155123728
>>> not_so = 'so'
>>> id(not_so)
140614155123728

因此,一旦变量保存了该值,它就好像冻结了ID.实际上,在del sodel not_so之后,id('so')的输出再次开始更改.

So it looks like it freezes the id, once a variable holds that value. Indeed, after del so and del not_so, the output of id('so') start changing again.

与(小)整数不一样.

我知道不变性与具有相同的id之间没有真正的联系;我仍然试图找出这种行为的根源.我相信,一个熟悉python内部原理的人会比我少惊讶,因此我正在尝试达到相同的目的...

I know there is not real connection between immutability and having the same id; still, I am trying to figure out the source of this behaviour. I believe that someone whose familiar with python's internals would be less surprised than me, so I am trying to reach the same point...

使用不同的字符串尝试相同的操作会得到不同的结果...

Trying the same with a different string gave different results...

>>> id('hello')
139978087896384
>>> id('hello')
139978087896384
>>> id('hello')
139978087896384

现在它等于 ...

推荐答案

默认情况下,CPython不承诺内联 all 字符串,但是实际上,Python代码库中的许多地方确实已经重用了-创建的字符串对象.许多Python内部构件都使用 sys.intern()函数调用来显式地内生Python字符串,但是除非您遇到这些特殊情况之一,否则两个相同的Python字符串文字将产生不同的字符串.

CPython does not promise to intern all strings by default, but in practice, a lot of places in the Python codebase do reuse already-created string objects. A lot of Python internals use (the C-equivalent of) the sys.intern() function call to explicitly intern Python strings, but unless you hit one of those special cases, two identical Python string literals will produce different strings.

Python还可以自由使用 内存位置,并且Python还将在编译时将字节码存储在代码对象中一次,从而优化不可变的 literals . Python REPL(交互式解释器)还将最新的表达式结果存储在_名称中,这使事情更加混乱.

Python is also free to reuse memory locations, and Python will also optimize immutable literals by storing them once, at compile time, with the bytecode in code objects. The Python REPL (interactive interpreter) also stores the most recent expression result in the _ name, which muddles up things some more.

这样,您不时看到相同的ID.

As such, you will see the same id crop up from time to time.

仅运行REPL中的行id(<string literal>)会经历几个步骤:

Running just the line id(<string literal>) in the REPL goes through several steps:

  1. 该行已编译,其中包括为字符串对象创建一个常量:

  1. The line is compiled, which includes creating a constant for the string object:

>>> compile("id('foo')", '<stdin>', 'single').co_consts
('foo', None)

这将显示已存储的常量以及已编译的字节码;在这种情况下,字符串为'foo'None单例.由产生不可变值的简单表达式可以在此阶段进行优化,请参见下面有关优化器的说明.

This shows the stored constants with the compiled bytecode; in this case a string 'foo' and the None singleton. Simple expressions consisting of that produce an immutable value may be optimised at this stage, see the note on optimizers, below.

在执行时,从代码常量加载字符串,然后id()返回存储位置.生成的int值绑定到_并打印:

On execution, the string is loaded from the code constants, and id() returns the memory location. The resulting int value is bound to _, as well as printed:

>>> import dis
>>> dis.dis(compile("id('foo')", '<stdin>', 'single'))
  1           0 LOAD_NAME                0 (id)
              3 LOAD_CONST               0 ('foo')
              6 CALL_FUNCTION            1
              9 PRINT_EXPR          
             10 LOAD_CONST               1 (None)
             13 RETURN_VALUE        

  • 该代码对象未得到任何引用,引用计数降至0,并删除了该代码对象.因此,字符串对象也是如此.

  • The code object is not referenced by anything, reference count drops to 0 and the code object is deleted. As a consequence, so is the string object.

    如果重新运行相同的代码,则

    Python可能可能将相同的内存位置重新用于新的字符串对象.如果重复此代码,通常会导致打印相同的内存地址. 这确实取决于您对Python内存的其他处理方式.

    Python can then perhaps reuse the same memory location for a new string object, if you re-run the same code. This usually leads to the same memory address being printed if you repeat this code. This does depend on what else you do with your Python memory.

    ID重用是不可可预测的;如果在此期间垃圾回收器运行以清除循环引用,则可能会释放其他内存,并且您将获得新的内存地址.

    ID reuse is not predictable; if in the meantime the garbage collector runs to clear circular references, other memory could be freed and you'll get new memory addresses.

    接下来,Python编译器还将实习存储为常量的任何Python字符串,只要它看起来足够像一个有效的标识符即可. Python 代码对象工厂函数PyCode_New 将会进行实测通过调用

    Next, the Python compiler will also intern any Python string stored as a constant, provided it looks enough like a valid identifier. The Python code object factory function PyCode_New will intern any string object that contains only ASCII letters, digits or underscores, by calling intern_string_constants(). This function recurses through the constants structures and for any string object v found there executes:

    if (all_name_chars(v)) {
        PyObject *w = v;
        PyUnicode_InternInPlace(&v);
        if (w != v) {
            PyTuple_SET_ITEM(tuple, i, v);
            modified = 1;
        }
    }
    

    其中记录了 all_name_chars()

    where all_name_chars() is documented as

    /* all_name_chars(s): true iff s matches [a-zA-Z0-9_]* */
    

    由于您创建了符合该条件的字符串,因此它们会被扣留,这就是为什么您在第二个测试中看到相同的ID用于'so'字符串的原因:只要对扣留版本的引用仍然有效,则扣留会导致将来的'so'文字重新使用被插入的字符串对象,即使是在新的代码块中并绑定到不同的标识符.在您的第一个测试中,您没有保存对字符串的引用,因此,在重新使用已插入的字符串之前,先将其丢弃.

    Since you created strings that fit that criterion, they are interned, which is why you see the same ID being used for the 'so' string in your second test: as long as a reference to the interned version survives, interning will cause future 'so' literals to reuse the interned string object, even in new code blocks and bound to different identifiers. In your first test, you don't save a reference to the string, so the interned strings are discarded before they can be reused.

    顺便说一句,您的新名称so = 'so'将字符串绑定到包含相同字符的名称.换句话说,您正在创建一个名称和值相等的全局变量.随着Python既对标识符又对限定常量进行实习,您最终将对标识符及其值使用相同的字符串对象:

    Incidentally, your new name so = 'so' binds a string to a name that contains the same characters. In other words, you are creating a global whose name and value are equal. As Python interns both identifiers and qualifying constants, you end up using the same string object for both the identifier and its value:

    >>> compile("so = 'so'", '<stdin>', 'single').co_names[0] is compile("so = 'so'", '<stdin>', 'single').co_consts[0]
    True
    

    如果创建的字符串不是代码对象常量,或者包含字母,数字和下划线范围之外的字符,则会看到id()值未被重用:

    If you create strings that are either not code object constants, or contain characters outside of the letters + numbers + underscore range, you'll see the id() value not being reused:

    >>> some_var = 'Look ma, spaces and punctuation!'
    >>> some_other_var = 'Look ma, spaces and punctuation!'
    >>> id(some_var)
    4493058384
    >>> id(some_other_var)
    4493058456
    >>> foo = 'Concatenating_' + 'also_helps_if_long_enough'
    >>> bar = 'Concatenating_' + 'also_helps_if_long_enough'
    >>> foo is bar
    False
    >>> foo == bar
    True
    

    Python编译器使用窥孔优化器(Python版本< 3.7)或更强大的 AST优化器(3.7及更高版本)以预先计算(折叠)涉及常量的简单表达式的结果. peepholder将其输出限制为长度为20或更短的序列(以防止膨胀代码对象和内存使用),而AST优化程序对4096个字符的字符串使用单独的限制.这意味着如果结果字符串符合当前Python版本的优化程序限制,则仅由名称字符 组成的较短字符串的串联仍然可以导致字符串被中断.

    The Python compiler either uses the peephole optimizer (Python versions < 3.7) or the more capable AST optimizer (3.7 and newer) to pre-calculate (fold) the results of simple expressions involving constants. The peepholder limits it's output to a sequence of length 20 or less (to prevent bloating code objects and memory use), while the AST optimizer uses a separate limit for strings of 4096 characters. This means that concatenating shorter strings consisting only of name characters can still lead to interned strings if the resulting string fits within the optimizer limits of your current Python version.

    例如在Python 3.7上,'foo' * 20将产生一个单个的字符串,因为常量折叠会将其转换为单个值,而在Python 3.6或更早版本上,只有'foo' * 6会被折叠:

    E.g. on Python 3.7, 'foo' * 20 will result in a single interned string, because constant folding turns this into a single value, while on Python 3.6 or older only 'foo' * 6 would be folded:

    >>> import dis, sys
    >>> sys.version_info
    sys.version_info(major=3, minor=7, micro=4, releaselevel='final', serial=0)
    >>> dis.dis("'foo' * 20")
      1           0 LOAD_CONST               0 ('foofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoo')
                  2 RETURN_VALUE
    

    >>> dis.dis("'foo' * 6")
      1           0 LOAD_CONST               2 ('foofoofoofoofoofoo')
                  2 RETURN_VALUE
    >>> dis.dis("'foo' * 7")
      1           0 LOAD_CONST               0 ('foo')
                  2 LOAD_CONST               1 (7)
                  4 BINARY_MULTIPLY
                  6 RETURN_VALUE
    

    这篇关于关于不可变字符串的更改ID的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

  • 查看全文
    登录 关闭
    扫码关注1秒登录
    发送“验证码”获取 | 15天全站免登陆