CPython:为什么字符串的+ =会更改字符串变量的ID [英] CPython: Why does += for strings change the id of string variable

查看:88
本文介绍了CPython:为什么字符串的+ =会更改字符串变量的ID的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

Cpython优化了字符串的递增操作,在初始化字符串的内存时,程序会为其留出额外的扩展空间,因此,在递增时,原始字符串不会复制到新位置。
我的问题是为什么字符串变量的id会发生变化。

Cpython optimizes string increment operations,When initializing memory for a string, the program leaves extra expansion space for it,so, when incrementing, the original string is not copied to the new location. my question is why the id of string variable changes.

>>> s = 'ab'
>>> id(s)
991736112104
>>> s += 'cd'
>>> id(s)
991736774080

为什么字符串变量的id会更改。

why the id of string variable changes.

推荐答案

您要触发的优化是CPython的实现细节,这是一件很微妙的事情:有很多细节(如果您是

The optimization you are trying to trigger is an implementation detail of CPython and is a quite subtle thing: there are many details (e.f. one you are experiencing) which can be preventing it.

要获得详细的解释,需要深入研究CPython的实现,因此首先,我将尝试挥舞一下,至少应该给出正在发生的事情的要旨。污点细节将在第二部分中突出显示重要的代码部分。

For a detailed explanation, one needs to dive into the CPython's implementation, so first I will try to give a hand-waving explanation, which should give at least the gist of what is going on. The gory details will be in the second part which highlights the important code-parts.

让我们看一下此功能,表现出所需的/优化的行为

Let's take a look at this function, which exhibits the desired/optimized behavior

def add_str(str1, str2, n):
    for i in range(n):
        str1+=str2
        print(id(str1))
    return str1

调用它会导致以下输出:

Calling it, leads to the following output:

>>> add_str("1","2",100)
2660336425032
... 4 times
2660336425032
2660336418608
... 6 times
2660336418608
2660336361520
... 6 times
2660336361520
2660336281800
 and so on

即每增加8个字符串就会创建一个新字符串,否则旧字符串(或我们将看到的内存)将被重用。第一个id仅被打印6次,因为它在unicode-object的大小为2模8时开始打印(而不是在后面的情况下为0)。

I.e. a new string is created only every 8 addition, otherwise the old string (or as we will see the memory) is reused. The first id is printed only 6 times because it starts printing when the size of the unicode-object is 2 modulo 8 (and not 0 as in the later cases).

第一个问题是,如果字符串在CPython中是不可变的,那么如何(最好是何时)对其进行更改?显然,如果将字符串绑定到不同的变量,我们将无法更改-但是,如果当前变量是唯一的引用,我们可以更改它-由于引用了CPython,因此可以很容易地对其进行检查(这是为什么此优化不适用于不使用引用计数的其他实现)。

The first question is, if a string is immutable in CPython, how (or better when) can it be changed? Obviously, we can't change the string if it is bound to different variables - but we could change it, if the current variable is the only one reference - which can be checked pretty easily due to reference counting of CPython (and it is the reason why this optimization isn't available for other implementation which don't use reference counting).

让我们通过添加其他引用来更改上述功能:

Let's change the function above by adding a additional reference:

def add_str2(str1, str2, n):
    for i in range(n):
        ref = str1
        str1+=str2
        print(id(str1))
    return str1

调用它会导致:

>>> add_str2("1","2",20)
2660336437656
2660337149168
2660337149296
2660337149168
2660337149296
... every time a different string - there is copying!

这实际上可以解释您的观察:

This actually explains your observation:

import sys
s = 'ab'
print(sys.getrefcount(s))
# 9
print(id(s))
# 2660273077752
s+='a'
print(id(s))
# 2660337158664  Different

您的字符串 s 被拘留的(例如,请参见此SO-answer 有关字符串实习和整数池的更多信息),因此 s 不仅是使用此字符串的对象,因此无法更改此字符串。

Your string s is interned (see for example this SO-answer for more information about string interning and integer pool), and thus s is not only one "using" this string and thus this string cannot be changed.

如果我们避免进行实习,则可以看到字符串已被重用:

If we avoid the interning, we can see, that the string is reused:

import sys
s = 'ab'*21  # will not be interned
print(sys.getrefcount(s))
# 2, that means really not interned
print(id(s))
# 2660336107312
s+='a'
print(id(s))
# 2660336107312  the same id!

但是此优化如何工作?

But how does this optimization works?

CPython使用自己的内存管理- pymalloc分配器,它针对寿命短的小型对象进行了优化。使用的内存块是 8 个字节的倍数,这意味着如果仅要求分配器分配1个字节,则仍将8个字节标记为已使用(由于返回的指针的8位字节,其余的7个字节不能用于其他对象)。

CPython uses its own memory management - the pymalloc allocator, which is optimized for small objects with short lifetimes. The used memory-blocks are multiple of 8 bytes, that means if allocator is asked for only 1 byte, still 8 bytes are marked as used (more precise because of the 8-byte aligment of the returned pointers the the remaining 7 bytes cannot be used for other objects).

但是函数 PyMem_Realloc :如果要求分配器将1个字节的块重新分配为2个字节的块,则无需执行任何操作-

There is however the function PyMem_Realloc: if the allocator is asked to reallocate a 1byte-block as a 2byte-block, there is nothing to do - there were some reserved bytes anyway.

这样,如果只有一个对字符串的引用,CPython可以要求分配器重新分配字符串,并需要更多的字节。在7个8的情况下,不需要分配器,附加字节几乎可用。

This way, if there is only one reference to the string, CPython can ask the allocator to reallocate the string and require a byte more. In 7 cases of 8 there is nothing to do for allocator and the additional byte becomes available almost free.

但是,如果字符串的大小变化超过7个字节,则复制成为强制性操作:

However, if the size of the string changes by more than 7 bytes, the copying becomes mandatory:

>>> add_str("1", "1"*8, 20)  # size change of 8
2660337148912
2660336695312
2660336517728
... every time another id

此外,pymalloc会退回到 PyMem_RawMalloc ,通常是内存C运行时管理器,并且以上对字符串的优化不再可行:

Furthermore, pymalloc falls back to PyMem_RawMalloc, which is usually the memory manager of the C-runtime, and the above optimization for strings is no longer possible:

>>> add_str("1"*512, "1", 20) #  str1 is larger as 512 bytes
2660318800256
2660318791040
2660318788736
2660318807744
2660318800256
2660318796224
... every time another id

实际上,每个地址后面是否有不同重新分配取决于C运行时的内存分配器及其状态。如果不对内存进行碎片整理,则机会很高, realloc 设法在不复制的情况下扩展了内存(但在我的机器上并非如此,因为我做了这些实验),另请参见此SO-帖子

Actually, whether the addresses are different after each reallocation depends on the memory allocator of the C-runtime and its state. If memory isn't defragmented, the chances are high, that realloc manages to extend memory without copying (but it was not the case on my machine as I did these experiments), see also this SO-post.

出于好奇,这里是 str1 + = str2 操作的整个追溯,可以在调试器

For the curious, here is the whole traceback of the str1+=str2 operation, which can be easily followed in a debugger:

这是怎么回事:

+ = 编译为 BINARY_ADD -optcode,并在 ceval.c ,有一个钩子/ Unicode对象的特殊处理(请参见 PyUnicode_CheckExact ):

The += is compiled to BINARY_ADD-optcode and when evaluated in ceval.c, there is a hook/special handling for unicode objects (see PyUnicode_CheckExact):

case TARGET(BINARY_ADD): {
    PyObject *right = POP();
    PyObject *left = TOP();
    PyObject *sum;
    ...
    if (PyUnicode_CheckExact(left) &&
             PyUnicode_CheckExact(right)) {
        sum = unicode_concatenate(left, right, f, next_instr);
        /* unicode_concatenate consumed the ref to left */
    }
    ...

unicode_concatenate 最终调用 PyUnicode_Append ,它检查左操作数是否可修改(基本检查,只有一个引用,字符串没有被嵌入,还有其他东西),并调整其大小或创建新的unicode对象,否则:

unicode_concatenate ends up calling PyUnicode_Append, which checks whether the left-operand is modifiable (which basically checks that there is only one reference, string isn't interned and some further stuff) and resizes it or creates a new unicode-object otherwise:

if (unicode_modifiable(left)
    && ...)
{
    /* append inplace */
    if (unicode_resize(p_left, new_len) != 0)
        goto error;

    /* copy 'right' into the newly allocated area of 'left' */
    _PyUnicode_FastCopyCharacters(*p_left, left_len, right, 0, right_len);
}
else {
    ...
    /* Concat the two Unicode strings */
    res = PyUnicode_New(new_len, maxchar);
    if (res == NULL)
        goto error;
    _PyUnicode_FastCopyCharacters(res, 0, left, 0, left_len);
    _PyUnicode_FastCopyCharacters(res, left_len, right, 0, right_len);
    Py_DECREF(left);
    ...
}

unicode_resize 最终调用 resize_compact (主要是因为在我们的案例中,我们只有ascii字符),最终结束调用 PyObject_REALLOC

...
new_unicode = (PyObject *)PyObject_REALLOC(unicode, new_size);
...

基本上将调用 pymalloc_realloc

which basically will be calling pymalloc_realloc:

static int
pymalloc_realloc(void *ctx, void **newptr_p, void *p, size_t nbytes)
{
    ...
    /* pymalloc is in charge of this block */
    size = INDEX2SIZE(pool->szidx);
    if (nbytes <= size) {
        /* The block is staying the same or shrinking.
          ....
            *newptr_p = p;
            return 1; // 1 means success!
          ...
    }
    ...
}

INDEX2SIZE 会四舍五入到最接近的8的倍数:

Where INDEX2SIZE just rounds up to the nearest multiple of 8:

#define ALIGNMENT               8               /* must be 2^N */
#define ALIGNMENT_SHIFT         3

/* Return the number of bytes in size class I, as a uint. */
#define INDEX2SIZE(I) (((uint)(I) + 1) << ALIGNMENT_SHIFT)

已确认。

这篇关于CPython:为什么字符串的+ =会更改字符串变量的ID的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆