PyPy比Python快17倍.可以加快Python的速度吗? [英] PyPy 17x faster than Python. Can Python be sped up?

查看:114
本文介绍了PyPy比Python快17倍.可以加快Python的速度吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

解决了最近的出现了代码问题,我发现我的默认Python比慢40倍. PyPy.通过限制对len的调用,我可以使用此代码将其降低到大约17倍.通过在函数中运行来限制全局查找.

Solving a recent Advent of Code problem, I found my default Python was ~40x slower than PyPy. I was able to get that down to about 17x with this code by limiting calls to len and limiting global lookups by running it in a function.

现在,e.py在我的机器上的python 3.6.3上运行5.162秒,在PyPy上运行.297秒.

Right now, e.py runs in 5.162 seconds on python 3.6.3 and .297 seconds on PyPy on my machine.

我的问题是:这是JIT不可回避的提速,还是有某种方法可以加快CPython的答案? (缺少极端手段:我可以去Cython/Numba还是什么?)我如何说服自己我无能为力了?

My question is: is this the irreducible speedup of JIT, or is there some way to speed up the CPython answer? (short of extreme means: I could go to Cython/Numba or something?) How do I convince myself that there's nothing more I can do?

有关数字输入文件的列表,请参见要点.

See the gist for the list of numbers input file.

问题说明中所述,它们表示跳转偏移量. position += offsets[current],然后将当前偏移量增加1.当跳转将您带出列表之外时,就完成了.

As described in the problem statement, they represent jump offsets. position += offsets[current], and increment the current offset by 1. You're done when a jump takes you outside the list.

下面是给出的示例(耗时5秒的完整输入要长得多,并且有更大的数字):

Here's the example given (the full input that takes 5 seconds is much longer, and has larger numbers):

(0) 3  0  1  -3  - before we have taken any steps.
(1) 3  0  1  -3  - jump with offset 0 (that is, don't jump at all). Fortunately, the instruction is then incremented to 1.
 2 (3) 0  1  -3  - step forward because of the instruction we just modified. The first instruction is incremented again, now to 2.
 2  4  0  1 (-3) - jump all the way to the end; leave a 4 behind.
 2 (4) 0  1  -2  - go back to where we just were; increment -3 to -2.
 2  5  0  1  -2  - jump 4 steps forward, escaping the maze.

代码:

def run(cmds):
    location = 0
    counter = 0
    while 1:
        try:
            cmd = cmds[location]
            if cmd >= 3:
                cmds[location] -= 1
            else:
                cmds[location] += 1
            location += cmd
            if location < 0:
                print(counter)
                break
            counter += 1
        except:
            print(counter)
            break

if __name__=="__main__":
    text = open("input.txt").read().strip().split("\n")
    cmds = [int(cmd) for cmd in text]
    run(cmds)

我用Cython编译并运行了代码,运行时间降至2.53s,但这仍然比PyPy慢了一个数量级.

edit: I compiled and ran the code with Cython, that dropped runtime to 2.53s, but that's still almost an order of magnitude slower than PyPy.

Numba能让我保持在2倍之内

edit:最好的我能写的Cython降为1.32 s,比pypy高4倍

edit: The best Cython I could write got down to 1.32s, a little over 4x pypy

edit:按照@viraptor的建议,将cmd变量移动到cdef中,可使Cython降低至.157秒!速度快了将近一个数量级,并且与普通的python相比没有那么远.不过,我对PyPy JIT印象深刻,它免费提供了所有这些功能!

edit: Moving the cmd variable into a cdef, as suggested by @viraptor, got the Cython down to .157 seconds! Nearly a full order of magnitude faster, and not that far from regular python. Still, I come away from this extremely impressed with the PyPy JIT, which did all this for free!

推荐答案

作为Python的基线,我用C语言编写了此代码(然后决定使用C ++来实现更方便的数组I/O).它可以使用clang ++有效地针对x86-64进行编译.在Skylake x86上,此代码的运行速度比问题中的CPython3.6.2快82倍,因此,即使您的Python速度更快,仍然离保持接近最佳的机器代码还有几个因素. (是的,我查看了编译器的asm输出,以检查它是否做得很好).

As a baseline for Python, I wrote this in C (and then decided to use C++ for more convenient array I/O). It compiles efficiently for x86-64 with clang++. This runs 82x faster than CPython3.6.2 with the code in the question, on a Skylake x86, so even your faster Python versions are still a couple factors away from keeping up with near-optimal machine code. (Yes, I looked at the compiler's asm output to check that it did a good job).

让一个好的JIT或提前编译器看到循环逻辑是提高性能的关键.问题逻辑本质上是串行的,因此没有让Python运行已经编译过的空间. C在整个数组上做某事(例如NumPy),因为除非使用Cython或其他工具,否则不会针对此特定问题编译C.当问题的每一步都归结到CPython解释器时,就是性能的牺牲,这是因为内存瓶颈或其他任何问题都无法掩盖其缓慢性.

Letting a good JIT or ahead-of-time compiler see the loop logic is key for performance here. The problem logic is inherently serial, so there's no scope for getting Python to run already-compiled C to do something over the whole array (like NumPy), because there won't be compiled C for this specific problem unless you use Cython or something. Having each step of the problem go back to the CPython interpreter is death for performance, when its slowness isn't hidden by memory bottlenecks or anything.

更新:将偏移量数组转换为指针数组可将其速度提高1.5倍(简单的寻址模式+从关键路径循环依赖项链中删除add ,将其降低到 4周期L1D负载使用延迟简单寻址模式(如果指针来自另一个负载),则对于索引寻址模式+ add延迟,不是6c = 5c + 1c).

Update: transforming the array of offsets into an array of pointers speeds it up by another factor of 1.5x (simple addressing mode + removing an add from the critical path loop-carried dependency chain, bringing it down to just the 4 cycle L1D load-use latency for a simple addressing mode (when the pointer comes from another load), not 6c = 5c + 1c for an indexed addressing mode + add latency).

但是我认为我们对Python可能很慷慨,并且不希望它能跟上适合现代CPU的算法转换:P(特别是因为即使在64位模式下,我也使用32位指针来确保4585元素数组仍然只有18kiB,因此它可以容纳在32kiB L1D缓存中.就像Linux x32 ABI或AArch64 ILP32 ABI一样.)

But I think we can be generous to Python and not expect it to keep up with algorithm transformations to suit modern CPUs :P (Especially because I used 32-bit pointers even in 64-bit mode to make sure the 4585 element array was still only 18kiB so it fits in the 32kiB L1D cache. Like the Linux x32 ABI does, or the AArch64 ILP32 ABI.)

此外,更新的替代表达式使gcc像无c铛一样无分支地对其进行编译. (注释并在此答案中保留了原始的perf stat输出,因为有趣的是看到无分支与有错误预测的分支的效果.)

Also, an alternate expression for the update gets gcc to compile it branchless, like clang does. (Commented out and original perf stat output left in this answer, because it's interesting the see the effect of branchless vs. branchy with mispredicts.)

unsigned jumps(int offset[], unsigned size) {
    unsigned location = 0;
    unsigned counter = 0;

    do {
          //location += offset[location]++;            // simple version
          // >=3 conditional version below

        int off = offset[location];

        offset[location] += (off>=3) ? -1 : 1;       // branchy with gcc
        // offset[location] = (off>=3) ? off-1 : off+1;  // branchless with gcc and clang.  

        location += off;

        counter++;
    } while (location < size);

    return counter;
}

#include <iostream>
#include <iterator>
#include <vector>

int main()
{
    std::ios::sync_with_stdio(false);     // makes cin faster
    std::istream_iterator<int> begin(std::cin), dummy;
    std::vector<int> values(begin, dummy);   // construct a dynamic array from reading stdin

    unsigned count = jumps(values.data(), values.size());
    std::cout << count << '\n';
}

使用clang4.0.1 -O3 -march=skylake时,内部循环是无分支的;它对>=3条件使用条件移动.我使用了? :,因为这就是我希望编译器可以执行的操作.

With clang4.0.1 -O3 -march=skylake, the inner loop is branchless; it uses a conditional-move for the >=3 condition. I used ? : because that's what I was hoping the compiler would do. Source + asm on the Godbolt compiler explorer

.LBB1_4:                                # =>This Inner Loop Header: Depth=1
    mov     ebx, edi               ; silly compiler: extra work inside the loop to save code outside
    mov     esi, dword ptr [rax + 4*rbx]  ; off = offset[location]
    cmp     esi, 2
    mov     ecx, 1
    cmovg   ecx, r8d               ; ecx = (off>=3) ? -1 : 1;  // r8d = -1 (set outside the loop)
    add     ecx, esi               ; off += -1 or 1
    mov     dword ptr [rax + 4*rbx], ecx  ; store back the updated off
    add     edi, esi               ; location += off  (original value)
    add     edx, 1                 ; counter++
    cmp     edi, r9d
    jb      .LBB1_4                ; unsigned compare against array size

这是我的i7-6700k Skylake CPU上perf stat ./a.out < input.txt的输出(对于clang版本):

Here's the output of perf stat ./a.out < input.txt (for the clang version), on my i7-6700k Skylake CPU:

21841249        # correct total, matches Python

 Performance counter stats for './a.out':

         36.843436      task-clock (msec)         #    0.997 CPUs utilized          
                 0      context-switches          #    0.000 K/sec                  
                 0      cpu-migrations            #    0.000 K/sec                  
               119      page-faults               #    0.003 M/sec                  
       143,680,934      cycles                    #    3.900 GHz                    
       245,059,492      instructions              #    1.71  insn per cycle         
        22,654,670      branches                  #  614.890 M/sec                  
            20,171      branch-misses             #    0.09% of all branches        

       0.036953258 seconds time elapsed

由于循环中的数据依赖性,平均每时钟指令数比4低得多.下一次迭代的加载地址取决于该迭代的load + add,无序执行无法将其隐藏.但是,它可能与更新当前位置的值的所有工作重叠.

The average instructions-per-clock is much lower than 4 because of the data dependency in the loop. The load address for the next iteration depends on the load+add for this iteration, and out-of-order execution can't hide that. It can overlap all the work of updating the value of the current location, though.

int更改为short没有性能下降(如预期;

Changing from int to short has no performance downside (as expected; movsx has the same latency as mov on Skylake), but halves the memory consumption so a larger array could fit in L1D cache if needed.

我尝试将数组编译到程序中(作为int offsets[] = { file contents with commas added };,因此不必读取和解析它.它也使大小成为编译时常数.这将运行时间减少到〜36.2 +/-从36.8降低到0.1毫秒,因此从文件读取的版本仍将其大部分时间都花在了实际问题上,而不是解析输入.(与Python不同,C ++的启动开销可忽略不计,而我的Skylake CPU上升到借助Skylake中的硬件P状态管理,最大时钟速度可以在1毫秒之内实现.)

I tried compiling the array into the program (as int offsets[] = { file contents with commas added }; so it doesn't have to be read and parsed. It also made the size a compile-time constant. This reduced the run time to ~36.2 +/- 0.1 milliseconds, down from ~36.8, so the version that reads from a file was still spending most of its time on the actual problem, not parsing the input. (Unlike Python, C++ has negligible startup overhead, and my Skylake CPU ramps up to max clock speed in well under a millisecond thanks to hardware P-state management in Skylake.)

如前所述,使用像[rdi]而不是[rdi + rdx*4]的简单寻址模式进行指针追赶的延迟降低了1c,并且避免了add(index += offset变为current = target). Intel自IvyBridge以来在寄存器之间具有零延迟整数mov,因此不会延长关键路径.此处的 源(带注释)+ ASM此哈克优化 即可.典型运行(将文本解析为std::vector): 23.26 +- 0.05 ms ,90.725 M周期(3.900 GHz),288.724 M instructions(3.18 IPC).有趣的是,它实际上是更多的指令,但是由于循环依赖项链的等待时间较短,因此运行速度更快.

As described earlier, pointer-chasing with a simple addressing mode like [rdi] instead of [rdi + rdx*4] has 1c lower latency, and avoids the add (index += offset becomes current = target). Intel since IvyBridge has zero-latency integer mov between registers so that doesn't lengthen the critical path. Here's the source (with comments) + asm for this hacky optimization. A typical run (with text parsing into a std::vector): 23.26 +- 0.05 ms, 90.725 M cycles (3.900 GHz), 288.724 M instructions (3.18 IPC). Interestingly it's actually more total instructions, but runs much faster due to the lower latency of the loop-carried dependency chain.

gcc使用分支,它慢了大约2倍. (根据整个程序上的perf stat,有14%的分支预测有误..分支只是更新值的一部分,但分支未命中会使管道停滞,因此它们也会影响关键路径,从而导致数据依赖项不在这里.这似乎是优化程序的错误决定.)

gcc uses a branch and it's about 2x slower. (14% of branches are mispredicted according to perf stat on the whole program. It's only branching as part of updating the values, but branch misses stall the pipeline so they affect the critical path too, in a way that data dependencies don't here. This seems like a poor decision by the optimizer.)

将条件重写为offset[location] = (off>=3) ? off-1 : off+1;说服gcc使用看起来不错的asm进入无分支.

Rewriting the conditional as offset[location] = (off>=3) ? off-1 : off+1; convinces gcc to go branchless with asm that looks good.

gcc7.1.1 -O3 -march = skylake (对于使用(off <= 3) ? : -1 : +1的分支进行编译的原始源).

gcc7.1.1 -O3 -march=skylake (for the original source that compiles with a branch for (off <= 3) ? : -1 : +1).

Performance counter stats for './ec-gcc':

     70.032162      task-clock (msec)         #    0.998 CPUs utilized          
             0      context-switches          #    0.000 K/sec                  
             0      cpu-migrations            #    0.000 K/sec                  
           118      page-faults               #    0.002 M/sec                  
   273,115,485      cycles                    #    3.900 GHz                    
   255,088,412      instructions              #    0.93  insn per cycle         
    44,382,466      branches                  #  633.744 M/sec                  
     6,230,137      branch-misses             #   14.04% of all branches        

   0.070181924 seconds time elapsed

与CPython(Arch Linux上为Python3.6.2):

perf stat python ./orig-v2.e.py
21841249

 Performance counter stats for 'python ./orig-v2.e.py':

       3046.703831      task-clock (msec)         #    1.000 CPUs utilized          
                10      context-switches          #    0.003 K/sec                  
                 0      cpu-migrations            #    0.000 K/sec                  
               923      page-faults               #    0.303 K/sec                  
    11,880,130,860      cycles                    #    3.899 GHz                    
    38,731,286,195      instructions              #    3.26  insn per cycle         
     8,489,399,768      branches                  # 2786.421 M/sec                  
        18,666,459      branch-misses             #    0.22% of all branches        

       3.046819579 seconds time elapsed

抱歉,我没有安装PyPy或其他Python实现.

I don't have PyPy or other Python implementations installed, sorry.

这篇关于PyPy比Python快17倍.可以加快Python的速度吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆