CPython字符串添加优化失败案例 [英] CPython string addition optimisation failure case

查看:87
本文介绍了CPython字符串添加优化失败案例的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

为什么在CPython中这样做

Why, in CPython, does

def add_string(n):
    s = ''
    for _ in range(n):
        s += ' '

采用线性时间,但是

def add_string_in_list(n):
    l = ['']
    for _ in range(n):
        l[0] += ' '

需要二次时间吗?

证明:

Timer(partial(add_string, 1000000)).timeit(1)
#>>> 0.1848409200028982
Timer(partial(add_string, 10000000)).timeit(1)
#>>> 1.1123797750042286

Timer(partial(add_string_in_list, 10000)).timeit(1)
#>>> 0.0033865350123960525
Timer(partial(add_string_in_list, 100000)).timeit(1)
#>>> 0.25131178900483064


我所知道的

当要添加的字符串的引用计数为1时,CPython对字符串的添加进行了优化.


What I know

CPython has an optimisation for string addition when the string being added to has a reference count of 1.

这是因为Python中的字符串是不可变的,因此通常无法对其进行编辑.如果一个字符串存在多个引用并且已被突变,则两个引用都将看到更改后的字符串.显然,这是不希望的,因此无法在多个引用之间发生变异.

This is because strings in Python are immutable, and so normally they cannot be edited. If multiple references exist to a string and it is mutated, both references will see the changed string. This is obviously not wanted, so mutation cannot happen with multiple references.

但是,如果对该字符串只有一个引用,则对该值进行更改只会更改该引用的字符串,而该引用需要更改.您可以测试这是否是可能的原因,

If there is only one reference to the string, however, mutating the value will only change the string for that one reference, which wants it changed. You can test that this is a likely cause as so:

from timeit import Timer
from functools import partial

def add_string_two_references(n):
    s = ''
    for _ in range(n):
        s2 = s
        s += ' '

Timer(partial(add_string_two_references, 20000)).timeit(1)
#>>> 0.032532954995986074
Timer(partial(add_string_two_references, 200000)).timeit(1)
#>>> 1.0898985149979126

我不确定为什么只有30倍,而不是预期的100倍,但是我认为这是开销.

I'm unsure why the factor is only 30x, instead of the expected 100x, but I believe that it's overhead.

那么列表版本为什么要创建两个引用?这甚至是阻碍优化的原因吗?

So why is the list version creating two references? Is this even what's preventing the optimisation?

您可以检查它对普通物体的处理是否有所不同:

You can check that it's not treating normal objects any differently:

class Counter:
    def __iadd__(self, other):
        print(sys.getrefcount(self))

s = Counter()
s += None
#>>> 6

class Counter:
    def __iadd__(self, other):
        print(sys.getrefcount(self))

l = [Counter()]
l[0] += None
#>>> 6

推荐答案

在基于列表的方法中,将获取并修改来自列表索引0的字符串,然后再将其放回索引0处的列表.
暂时,解释器仍在列表中保留旧版本的字符串,并且无法执行就地修改.
如果您查看 Python的源文件,那么您将会看到不支持在适当位置修改列表的元素.因此必须从列表中检索对象(在这种情况下为字符串),进行修改然后放回原处.
换句话说,list类型完全与str类型对+=运算符的支持无关.

In the list based approach, string from index 0 of the list is taken and modified before being put back to the list at index 0.
For this short moment interpreter still has the old version of string in the list and can't perform in place modification.
If you take a look at Python's source then you'll see that there is no support for modifying the element of the list in place. So the object (string in this case) has to be retrieved from the list, modified and then put back.
In other words list type is completely agnostic of the str type support for += operator.

并考虑以下代码:

l = ['abc', 'def']
def nasty():
    global l
    l[0] = 'ghi'
    l[1] = 'jkl'
    return 'mno'
l[0] += nasty()

l的值是['abcmno', 'jkl'],这证明了'abc'是从列表中获取的,然后执行了nasty()修改列表内容的操作,字符串'abc''mno'被连接起来并得到了结果被分配给l[0].如果在访问l[0]之前对其进行评估,则结果将为'ghimno'.

The value of l is ['abcmno', 'jkl'] which proves that 'abc' was taken from the list, then nasty() got executed modifying the contents of the list, strings 'abc' and 'mno' got concatenated and result was assigned to l[0]. If nasty() was evaluated before accessing l[0] to modify it in place, then the result would be 'ghimno'.

这篇关于CPython字符串添加优化失败案例的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆