为什么“ new_file + =行+字符串”比“ new_file = new_file +行+字符串”这么快? [英] Why is 'new_file += line + string' so much faster than 'new_file = new_file + line + string'?

查看:90
本文介绍了为什么“ new_file + =行+字符串”比“ new_file = new_file +行+字符串”这么快?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

使用以下代码时,我们的代码需要10分钟才能虹吸68,000条记录:

Our code takes 10 minutes to siphon thru 68,000 records when we use:

new_file = new_file + line + string

但是,当我们执行以下操作时,只需要1秒钟:

However when we do the following it takes just 1 second:

new_file += line + string

代码:

for line in content:
import time
import cmdbre

fname = "STAGE050.csv"
regions = cmdbre.regions
start_time = time.time()
with open(fname) as f:
        content = f.readlines()
        new_file_content = ""
        new_file = open("CMDB_STAGE060.csv", "w")
        row_region = ""
        i = 0
        for line in content:
                if (i==0):
                        new_file_content = line.strip() + "~region" + "\n"
                else:
                        country = line.split("~")[13]
                        try:
                                row_region = regions[country]
                        except KeyError:
                                row_region = "Undetermined"
                        new_file_content += line.strip() + "~" + row_region + "\n"
                print (row_region)
                i = i + 1
        new_file.write(new_file_content)
        new_file.close()
        end_time = time.time()
        print("total time: " + str(end_time - start_time))

我用python编写的所有代码都使用第一个选择。这只是基本的字符串操作...我们正在从文件中读取输入,对其进行处理并将其输出到新文件中。我100%确信第一种方法的运行时间比第二种方法长约600倍,但是为什么呢?

All code I've ever written in python uses the first option. This is just basic string operations... we are reading input from a file, processing it and outputting it to the new file. I am 100% certain that the first method takes roughly 600 times longer to run than the second, but why?

正在处理的文件是csv,但使用〜而不是逗号。我们在这里所做的就是使用此csv,其中有一个用于国家/地区的列,并添加了一个用于国家/地区的列,例如LAC,EMEA,NA等... cmdbre.regions只是一本字典,所有〜200个国家/地区是键,每个地区都作为值。

The file being processed is a csv but uses ~ instead of a comma. All we are doing here is taking this csv, which has a column for country, and adding a column for the countries region, e.g. LAC, EMEA, NA, etc... cmdbre.regions is just a dictionary, with all ~200 countries as the key and each region as the value.

更改为附加字符串操作...循环在1秒而不是10分钟内完成...在csv中有68,000条记录。

Once I changed to the append string operation... the loop completed in 1 second instead of 10 minutes... 68,000 records in the csv.

推荐答案

CPython(参考解释器)对就地字符串连接进行了优化(当附加的字符串没有其他参考时)。在执行 + 时,不能可靠地应用此优化,仅 + = 涉及两个实时引用,即赋值目标和操作数,而前者不参与 + 操作,因此很难对其进行优化)。

CPython (the reference interpreter) has an optimization for in-place string concatenation (when the string being appended to has no other references). It can't apply this optimization as reliably when doing +, only += (+ involves two live references, the assignment target and the operand, and the former isn't involved in the + operation, so it's harder to optimize it).

根据 PEP 8


编写代码的方式不应损害其他Python实现(PyPy,

Code should be written in a way that does not disadvantage other implementations of Python (PyPy, Jython, IronPython, Cython, Psyco, and such).

例如,对于形式为+ = b或a的语句,请不要依赖CPython有效地实现就地字符串连接的实现。 = a + b。即使在CPython中,这种优化也很脆弱(仅适用于某些类型),并且在不使用引用计数的实现中根本没有这种优化。在库的性能敏感部分中,应使用’.join()表单。这样可以确保在各种实现中串联发生。

For example, do not rely on CPython's efficient implementation of in-place string concatenation for statements in the form a += b or a = a + b . This optimization is fragile even in CPython (it only works for some types) and isn't present at all in implementations that don't use refcounting. In performance sensitive parts of the library, the ''.join() form should be used instead. This will ensure that concatenation occurs in linear time across various implementations.

根据问题编辑进行更新:是的,您破坏了优化。您串联了许多字符串,而不仅仅是一个字符串,Python从左到右求值,因此它必须首先执行最左侧的串联。因此:

Update based on question edits: Yeah, you broke the optimization. You concatenated many strings, not just one, and Python evaluates left-to-right, so it must do the left-most concatenation first. Thus:

new_file_content += line.strip() + "~" + row_region + "\n"

与以下内容完全不同:

new_file_content = new_file_content + line.strip() + "~" + row_region + "\n"

因为前者将所有 new 片段连接在一起,然后将它们附加到累加器字符串all后者必须立即使用不涉及 new_file_content 本身的临时变量从左至右评估每个添加项。为了清楚起见添加括号,就像您所做的:

because the former concatenates all the new pieces together, then appends them to the accumulator string all at once, while the latter must evaluate each addition from left to right with temporaries that don't involve new_file_content itself. Adding parens for clarity, it's like you did:

new_file_content = (((new_file_content + line.strip()) + "~") + row_region) + "\n"

由于它直到到达类型时才真正知道类型,因此不能假设所有这些都是字符串,

Because it doesn't actually know the types until it reaches them, it can't assume all of those are strings, so the optimization doesn't kick in.

如果将第二段代码更改为:

If you changed the second bit of code to:

new_file_content = new_file_content + (line.strip() + "~" + row_region + "\n")

或稍慢一些,但仍比您的慢代码快很多倍,因为它可以保持CPython优化:

or slightly slower, but still many times faster than your slow code because it keeps the CPython optimization:

new_file_content = new_file_content + line.strip()
new_file_content = new_file_content + "~"
new_file_content = new_file_content + row_region
new_file_content = new_file_content + "\n"

因此,累积对于CPython来说是显而易见的,您可以解决性能问题。但坦率地说,只要执行类似这样的逻辑附加操作,就应该使用 + = + = 存在是有原因的,它为维护者和解释者提供了有用的信息。除此之外, DRY 也是一种很好的做法;为什么在不需要时为变量两次命名?

so the accumulation was obvious to CPython, you'd fix the performance problem. But frankly, you should just be using += anytime you're performing a logical append operation like this; += exists for a reason, and it provides useful information to both the maintainer and the interpreter. Beyond that, it's good practice as far as DRY goes; why name the variable twice when you don't need to?

当然,根据PEP8指南,即使使用 + = ,这也是不良形式。在大多数具有不可变字符串的语言中(包括大多数非CPython Python解释器),重复的字符串串联是 Schlemiel画家的算法,这会导致严重的性能问题。正确的解决方案是建立一个列表字符串,然后 join 将它们全部合并在一起,例如:

Of course, per PEP8 guidelines, even using += here is bad form. In most languages with immutable strings (including most non-CPython Python interpreters), repeated string concatenation is a form of Schlemiel the Painter's Algorithm, which causes serious performance problems. The correct solution is to build a list of strings, then join them all at one, e.g.:

    new_file_content = []
    for i, line in enumerate(content):
        if i==0:
            # In local tests, += anonymoustuple runs faster than
            # concatenating short strings and then calling append
            # Python caches small tuples, so creating them is cheap,
            # and using syntax over function calls is also optimized more heavily
            new_file_content += (line.strip(), "~region\n")
        else:
            country = line.split("~")[13]
            try:
                    row_region = regions[country]
            except KeyError:
                    row_region = "Undetermined"
            new_file_content += (line.strip(), "~", row_region, "\n")

    # Finished accumulating, make final string all at once
    new_file_content = "".join(new_file_content)

通常即使使用CPython字符串连接选项,速度也会更快,并且在非CPython Python解释器上也将可靠地快速运行,因为它使用可变的 list 来有效地累积结果,然后允许''。join 来预先计算字符串的总长度,一次分配最终的字符串(而不是沿途进行增量调整),然后恰好填充一次。

which is usually faster even when the CPython string concatenation options are available, and will be reliably fast on non-CPython Python interpreters as well because it uses a mutable list to accumulate results efficiently, then allows ''.join to precompute the total length of the string, allocate the final string all at once (instead of incremental resizes along the way), and populate it exactly once.

旁注:对于您的特定情况,您根本不应该累加或串联。您有一个输入文件和一个输出文件,并且可以逐行处理。每次您要追加或累积文件内容时,只需将它们写出来(我在整理代码时已对PEP8合规性和其他较小的样式改进进行了一些清理):

Side-note: For your specific case, you shouldn't be accumulating or concatenating at all. You've got an input file and an output file, and can process line by line. Every time you would append or accumulate file contents, just write them out instead (I've cleaned up the code a bit for PEP8 compliance and other minor style improvements while I was at it):

start_time = time.monotonic()  # You're on Py3, monotonic is more reliable for timing

# Use with statements for both input and output files
with open(fname) as f, open("CMDB_STAGE060.csv", "w") as new_file:
    # Iterate input file directly; readlines just means higher peak memory use
    # Maintaining your own counter is silly when enumerate exists
    for i, line in enumerate(f):
        if not i:
            # Write to file directly, don't store
            new_file.write(line.strip() + "~region\n")
        else:
            country = line.split("~")[13]
            # .get exists to avoid try/except when you have a simple, constant default
            row_region = regions.get(country, "Undetermined")
            # Write to file directly, don't store
            new_file.write(line.strip() + "~" + row_region + "\n")
end_time = time.monotonic()
# Print will stringify arguments and separate by spaces for you
print("total time:", end_time - start_time)


深入了解实现细节


对于那些对实现细节感到好奇的人,CPython字符串conca t优化是在字节码解释器中实现的,而不是在 str 类型本身上实现的(从技术上讲, PyUnicode_Append 进行突变优化) ,但需要解释程序的帮助才能确定引用计数,以便知道可以安全地使用优化;如果没有解释器的帮助,则只有C扩展模块才能从该优化中受益。)

Implementation details deep dive

For those curious on implementation details, the CPython string concat optimization is implemented in the byte code interpreter, not on the str type itself (technically, PyUnicode_Append does the mutation optimization, but it requires help from the interpreter to fix up reference counts so it knows it can use the optimization safely; without interpreter help, only C extension modules would ever benefit from that optimization).

当解释器检测到两个操作数都是Python级别 str 类型(在C层,在Python 3中,它仍然被称为 PyUnicode ,是2.x天的遗留物,不值得更改),它称为特殊的 unicode_concatenate 函数,用于检查是否下一条指令是三个基本 STORE _ * 指令之一。如果是,并且目标与左操作数相同,它将清除目标引用,因此 PyUnicode_Append 将仅看到对该操作数的单个引用,从而允许它调用一个具有单个引用的 str 的优化代码。

When the interpreter detects that both operands are the Python level str type (at C layer, in Python 3, it's still referred to as PyUnicode, a legacy of 2.x days that wasn't worth changing), it calls a special unicode_concatenate function, which checks whether the very next instruction is one of three basic STORE_* instructions. If it is, and the target is the same as the left operand, it clears the target reference so PyUnicode_Append will see only a single reference to the operand, allowing it to invoke the optimized code for a str with a single reference.

这意味着您不仅可以通过这样做破坏优化

This means that not only can you break the optimization by doing

a = a + b + c

您可以只要有问题的变量不是顶级(全局,嵌套或局部)名称,也可以将其断开。如果您使用对象属性,则使用 list 索引, dict 值等,甚至 + = 不会帮助您,它不会显示简单的 STORE ,因此它不会清除目标引用,所有这些都会得到超慢的,非原地的行为:

you can also break it any time the variable in question is not a top level (global, nested or local) name. If you're operating on an object attribute, a list index, a dict value, etc., even += won't help you, it won't see a "simple STORE", so it doesn't clear the target reference, and all of these get the ultraslow, not-in-place behavior:

foo.x += mystr
foo[0] += mystr
foo['x'] += mystr

str 类型;在Python 2中,优化对 unicode 对象无济于事,而在Python 3中,对 bytes 对象,在这两个版本中都不会针对 str 的子类进行优化;

It's also specific to the str type; in Python 2, the optimization doesn't help with unicode objects, and in Python 3, it doesn't help with bytes objects, and in neither version will it optimize for subclasses of str; those always take the slow path.

基本上,对于刚接触Python的人来说,最简单的常见情况下,优化应该尽可能地好,但这不会造成严重的麻烦即使是比较复杂的情况。这只是加强了PEP8的建议:如果您可以通过执行正确的操作并使用<$ c $来针对任何商店目标在每个解释器上更快地运行,则取决于解释器的实现细节是一个坏主意。 c> str.join

Basically, the optimization is there to be as nice as possible in the simplest common cases for people new to Python, but it's not going to go to serious trouble for even moderately more complex cases. This just reinforces the PEP8 recommendation: Depending on implementation details of your interpreter is a bad idea when you could run faster on every interpreter, for any store target, by doing the right thing and using str.join.

这篇关于为什么“ new_file + =行+字符串”比“ new_file = new_file +行+字符串”这么快?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆