Python vs CPP:为什么速度差异如此之大? [英] Python vs CPP: Why is the difference in speed so huge?

查看:139
本文介绍了Python vs CPP:为什么速度差异如此之大?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

def main():
    i = 2
    sum = 1
    while i < 100000:
        j = 2
        while j < i:
            if i%j == 0:
                sum += 1
                break
            j += 1
        i += 1

    print(sum)


if __name__ == "__main__":
    main()



#include<iostream>

using namespace std;

int main() {
    int sum = 1;
    for (int i=2; i<100000; i++) {
        for (int j=2; j<i; j++) {
            if (i%j == 0) {
                sum++;
                break;
            }
        }
    }
    cout << sum << endl;
    return 0;
}



C ++



运行方式: g ++ -std = c ++ 11 x.cpp -ox&&时间./x

时间: ./ x 1.36s用户0.00s系统99%cpu 1.376

运行方式: python x.py

时间: python x.py 32.10s用户0.21s系统98%cpu 32.854总计

有人能解释两个程序花费的时间之间的巨大差异吗?可以采取什么措施来加快python的速度?

Can anyone explain the huge difference between the time taken by the 2 programs? And what can be done to speed up the python one?

推荐答案

下面是一个简单的区别示例:

Here's a simple example of the difference:

i ++ 编译为(在x86-64机器上)简单的 inc注册指令。只需花费一小部分周期即可执行。

i++ in C++ compiles down to (on x86-64 machines) a simple inc REGISTER instruction. Takes a fraction of a cycle to execute.

i + = 1 在Python中可以用<$分解c $ c> dis 模块通过 dis.dis('i + = 1')通知我们所涉及的字节码为:

i += 1 in Python can be disassembled with the dis module via dis.dis('i += 1') which informs us that the bytecode involved is:

  1           0 LOAD_NAME                0 (i)
              2 LOAD_CONST               0 (1)
              4 INPLACE_ADD
              6 STORE_NAME               0 (i)
              8 LOAD_CONST               1 (None)
             10 RETURN_VALUE

< a href = https://tio.run/##K6gsycjPM/7/PzO3IL@oRCEls5gLiPWAWEM9U0HbVsFQXfP/fwA rel = noreferrer title = Python 3 –在线尝试>在线尝试!

从技术上讲,所有以 _NAME 结尾的指令都变为 _FAST 在一个函数中(我们反汇编了一个孤立的语句,因此它的行为略有不同),以及 LOAD_CONST(无) / RETURN_VALUE 配对将不存在exp在实际函数中进行分解(该函数必须执行此操作,但不是针对每个表达式),但必须足够接近。实际上,函数中的实际字节码更像是:

Technically, all the instructions that end in _NAME become _FAST in a function (we disassembled an isolated statement, so it behaved slightly differently), and the LOAD_CONST (None)/RETURN_VALUE pair won't exist for the expression in a real function (the function has to do it, but not for every expression), but close enough. In practice, the real bytecode within a function would be more like:

  1           0 LOAD_FAST                0 (i)
              2 LOAD_CONST               0 (1)
              4 INPLACE_ADD
              6 STORE_FAST               0 (i)

每条指令都需要运行 switch 语句或计算出的 goto (取决于CPython的编译方式) ,加载下一条指令并更新代码位置信息(还涉及反复检查以确保没有其他线程在请求​​ the GIL )。 LOAD_FAST LOAD_CONST 指令涉及C数组查找和引用计数调整(仅单个引用计数调整等同于 i ++ ,只是它必须更改内存,而不是寄存器,因此速度较慢。 STORE_FAST 类似地涉及C数组查找,调整引用计数(以减小现有值),并经常释放内存(如果decref删除了对该值的最后一个引用)。
INPLACE_ADD 必须动态查找并调用一个函数指针来执行加法(首先是通过几层函数间接实现)本身必须提取每个Python int 的基础C值来完成工作(并且如果数字足够大,则涉及基于数组的数学运算,这会很丑陋),(通常)创建一个全新的Python int 对象,并进行更多的引用计数调整。

Each of those instructions requires either a run through a switch statement or a computed goto (depending on how CPython was compiled), loading the next instruction and updating code position information (it also involves repeatedly checking to ensure no other thread is asking for the GIL). LOAD_FAST and LOAD_CONST instructions involve a C array lookup and reference count adjustment (a single reference count adjustment alone is equivalent to the i++ from before, except it has to change memory, not a register, so it's slower). STORE_FAST similarly involves an C array lookup, reference count adjustment (to decrement the existing value), and often, freeing memory (if the decref removed the last reference to the value). INPLACE_ADD has to dynamically lookup and call a function pointer to perform the addition (and it does so through a few layers of function indirection in the first place), which itself has to extract the underlying C value of each Python int to do the work (and if the numbers are big enough, this involves array based math, which gets ugly), (usually) create a brand new Python int object, and also do more reference count adjustment.

基本上,相当于C / C ++在针对寄存器的单个廉价汇编指令中所做的工作,Python必须执行(估计)六个函数调用(包括一个通过函数指针的调用),数十个内存查找,十二个左右的引用坦白地说,最令人惊讶的是Python所花的时间比C ++长约24倍。

Basically, to get the equivalent of what C/C++ does in a single, cheap assembly instruction against a register, Python had to perform (estimating) half a dozen function calls (including one through a function pointer), dozens of memory lookups, a dozen or so reference count adjustments, etc. Frankly, the most surprising thing is that Python only takes ~24x longer than the C++.

我会注意到相对的成本最高。单个字节码执行的工作越多,解释器开销的重要性就越小。不幸的是,对于这种情况,您的代码只不过是简单的数学运算,因此Python(至少是CPython)在这里最糟糕。

I'll note that the relative cost here is highest for simple math operations; the more work a single bytecode does, the less the interpreter overhead matters. Unfortunately for this case, your code is nothing but simple math, so Python (at least, CPython) is at its worst here.

为了加快速度,主要规则是:

As for speeding it up, the main rules are:


  1. 编写Python代码,而不是C代码。当Python的 range 可以为您完成这项工作(并节省很多单个字节码指令)时,您将手动维护计数器。正如我所提到的,这是解释器开销最高的最简单,最便宜的操作,但是这些操作通常是您实际上不需要做的很多事情,因为通常有更好的方法(例如 for 范围而不是范围内循环,而带有手动计数器调整的循环)。 / li>
  2. 对于大规模数学运算,请使用可以批量完成工作的扩展模块,例如 numpy 。一次添加的所有开销都是不好的;支付1000次添加是很简单的。

  3. 尝试使用其他解释器(例如PyPy)

  4. 使用Cython从Python代码中编译C ++(要求添加适当的 cdef 声明)

  5. 使用 ctypes 调用现有的C库,并/或编写原始Python C扩展(当Cython无法处理您想要的内容时)

  1. Write Python code, not C code. You're manually maintaining your counters, when Python's range could do the job for you (and save a lot of individual bytecode instructions). As I mentioned, it's the simplest, cheapest operations where the interpreter overhead is highest, but those operations are normally stuff you don't actually need to do very much, because there is usually a better way to do them (e.g. for loops over range rather than while loops with manual counter adjustment).
  2. For mass math operations, use extension modules that can do the work in bulk, e.g. numpy. All that overhead for a single addition is bad; paying it for 1000 additions is pretty trivial.
  3. Try alternate interpreters (e.g. PyPy)
  4. Use Cython to compile C++ from your Python code (requires adding appropriate cdef declarations)
  5. Use ctypes to call existing C libraries, and/or write raw Python C extensions (when Cython can't handle what you want)

除此之外,您只需要接受使用动态类型的解释型语言总是会产生编译型静态类型化语言所没有的开销。

Aside from that, you just have to accept that interpreted languages with dynamic typing are always going to have overhead that a compiled, statically typed language won't have.

要解决第1点问题,您的代码的Python版本应如下所示:

To address point #1, a Pythonic version of your code would look like:

def main():
    sum = 1
    for i in range(2, 100000):
        for j in range(2, i):
            if i%j == 0:
                sum += 1
                break

    print(sum)

if __name__ == "__main__":
    main()

您甚至可以将内部循环替换为:

You could even replace the inner loop with:

    sum += any(i % j == 0 for j in range(2, i))

尽管这不太可能产生任何性能上的好处一点代码简化。使用 range 可以提高性能,该功能将增量和测试的所有基本数学功能捆绑到一个专用函数中,从而大大减少了开销。

though that's unlikely to yield any performance benefits, just a bit of code simplification. The performance benefits come from using range, which bundles all the basic math of incrementing and testing into a single dedicated function, reducing the overhead significantly.

为演示字节码复杂度的差异,请考虑一个仅执行 while 和手动计数器或<$ c的循环的函数$ c> for 和 range

For demonstration of the difference in bytecode complexity, consider a function that does nothing but run a loop with either while and a manual counter or for and range:

def whileloop(n):
    i = 0
    while i < n:
        i += 1

def forloop(n):
    for i in range(n):
        pass

反汇编每个函数均显示:

Disassembling each function shows:

  3           0 LOAD_CONST               1 (0)
              2 STORE_FAST               1 (i)

  4           4 SETUP_LOOP              20 (to 26)
        >>    6 LOAD_FAST                1 (i)
              8 LOAD_FAST                0 (n)
             10 COMPARE_OP               0 (<)
             12 POP_JUMP_IF_FALSE       24

  5          14 LOAD_FAST                1 (i)
             16 LOAD_CONST               2 (1)
             18 INPLACE_ADD
             20 STORE_FAST               1 (i)
             22 JUMP_ABSOLUTE            6
        >>   24 POP_BLOCK
        >>   26 LOAD_CONST               0 (None)
             28 RETURN_VALUE

对于 whileloop 和:

  8           0 SETUP_LOOP              16 (to 18)
              2 LOAD_GLOBAL              0 (range)
              4 LOAD_FAST                0 (n)
              6 CALL_FUNCTION            1
              8 GET_ITER
        >>   10 FOR_ITER                 4 (to 16)
             12 STORE_FAST               1 (i)

  9          14 JUMP_ABSOLUTE           10
        >>   16 POP_BLOCK
        >>   18 LOAD_CONST               0 (None)
             20 RETURN_VALUE

在线试玩

for forloop while 的循环主体(每次通过执行一次,包括测试终止条件)从 LOAD_FAST SETUP_LOOP 之后到 JUMP_ABSOLUTE 之后,每个循环包含9条指令;对于对于,它从 FOR_ITER JUMP_ABSOLUTE ,仅包含三个说明。由于所有这些指令的工作都很琐碎,因此很容易看出使用 while 循环的手动管理计数器,循环本身的开销将如何显着增加

for forloop. The body of the loop (the stuff executed once per pass, including testing the termination condition) for the while runs from the LOAD_FAST following the SETUP_LOOP to the JUMP_ABSOLUTE, encompassing nine instructions per loop; for the for, it runs from the FOR_ITER to the JUMP_ABSOLUTE, encompassing just three instructions. Since the work done for all of these instructions is pretty trivial, it's easy to see how the overhead of the loop itself would be significantly higher for the manually managed counter with a while loop.

这篇关于Python vs CPP:为什么速度差异如此之大?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆