numpy:除以0.5有什么特别之处? [英] Numpy: What is special about division by 0.5?

查看:133
本文介绍了numpy:除以0.5有什么特别之处?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

答案 @Dunes指出,由于流水线化,浮点乘法和除法之间几乎没有区别.但是,从我使用其他语言的经验来看,我希望这种划分速度会变慢.

This answer of @Dunes states, that due to pipeline-ing there is (almost) no difference between floating-point multiplication and division. However, from my expience with other languages I would expect the division to be slower.

我的小测试如下:

A=np.random.rand(size)
command(A)

对于不同的命令和size=1e8,我在计算机上得到以下时间:

For different commands and size=1e8 I get the following times on my machine:

Command:    Time[in sec]:
A/=0.5      2.88435101509
A/=0.51     5.22591209412
A*=2.0      1.1831600666
A*2.0       3.44263911247  //not in-place, more cache misses?
A+=A        1.2827270031

最有趣的部分:除以0.5几乎是除以0.51的两倍.可以假设,这是由于某种智能优化,例如用A+A替换除法.但是A*2A+A的计时时间太遥远,无法支持这一主张.

The most interesting part: dividing by 0.5 is almost twice as fast as dividing by 0.51. One could assume, it is due to some smart optimization, e.g. replacing division by A+A. However the timings of A*2 and A+A are too far off to support this claim.

通常,用值(1/2)^n的浮点数除法更快:

In general, the division by floats with values (1/2)^n is faster:

Size: 1e8
    Command:    Time[in sec]:
    A/=0.5      2.85750007629
    A/=0.25     2.91607499123
    A/=0.125    2.89376401901
    A/=2.0      2.84901714325
    A/=4.0      2.84493684769
    A/=3.0      5.00480890274
    A/=0.75     5.0354950428
    A/=0.51     5.05687212944

如果我们查看size=1e4,它会变得更加有趣:

It gets even more interesting, if we look at size=1e4:

Command:    1e4*Time[in sec]:
A/=0.5      3.37723994255
A/=0.51     3.42854404449
A*=2.0      1.1587908268
A*2.0       1.19793796539
A+=A        1.11329007149

现在,除以.5和除以.51之间没有区别!

Now, there is no difference between division by .5 and by .51!

我尝试了不同的numpy版本和不同的机器.在某些计算机(例如Intel Xeon E5-2620)上,可以看到这种效果,但在其他一些计算机上则看不到-而且不依赖于numpy版本.

I tried it out for different numpy versions and different machines. On some machines (e.g. Intel Xeon E5-2620) one can see this effect, but not on some other machines - and this does not depend on the numpy version.

使用@Ralph Versteegen的脚本(请参见他的出色答案!),我得到以下结果:

With the script of @Ralph Versteegen (see his great answer!) I get the following results:

  • 使用i5-2620(Haswell,2x6内核,但是非常老的numpy版本不使用SIMD)的计时:

  • i7-5500U(Broadwell,2核,numpy 1.11.2)的计时:

问题是:如果阵列大小较大(> 10 ^ 6,则对于某些处理器而言,与按0.5进行除法相比,0.51进行除法的成本较高的原因是什么? ).

The question is: What is the reason for higher cost of the division by 0.51 compared to division by 0.5 for some processors, if the array sizes are large (>10^6).

@nneonneo的答案指出,对于某些intel处理器,除以2的幂后有一个优化,但这并不能解释为什么我们只能在大型阵列上看到它的好处.

The @nneonneo's answer states, that for some intel processors there is an optimization when divided by powers of two, but this does not explain, why we can see the benefit of it only for large arrays.

最初的问题是如何解释这些不同的行为(被0.5除以0.51除)?"

The original question was "How can these different behaviors (division by 0.5 vs. division by 0.51) be explained?"

在这里,我的原始测试脚本产生了计时:

Here also, my original testing script, which produced the timings:

import numpy as np
import timeit

def timeit_command( command, rep):
    print "\t"+command+"\t\t", min(timeit.repeat("for i in xrange(%d):"
        %rep+command, "from __main__ import A", number=7))    

sizes=[1e8,  1e4]
reps=[1,  1e4]
commands=["A/=0.5", "A/=0.51", "A*=2.2", "A*=2.0", "A*2.2", "A*2.0",
          "A+=A", "A+A"]

for size, rep in zip(sizes, reps):
    A=np.random.rand(size)
    print "Size:",size
    for command in commands:
        timeit_command(command, rep)


推荐答案

起初,我怀疑numpy正在调用BLAS,但至少在我的机器上(python 2.7.13,numpy 1.11.2,OpenBLAS),它没有这样做. t,通过对gdb的快速检查可以发现:

At first I suspected that numpy is invoking BLAS, but at least on my machine (python 2.7.13, numpy 1.11.2, OpenBLAS), it doesn't, as a quick check with gdb reveals:

> gdb --args python timing.py
...
Size: 100000000.0
^C
Thread 1 "python" received signal SIGINT, Interrupt.
sse2_binary_scalar2_divide_DOUBLE (op=0x7fffb3aee010, ip1=0x7fffb3aee010, ip2=0x6fe2c0, n=100000000)
    at numpy/core/src/umath/simd.inc.src:491
491 numpy/core/src/umath/simd.inc.src: No such file or directory.
(gdb) disass
   ...
   0x00007fffe6ea6228 <+392>:   movapd (%rsi,%rax,8),%xmm0
   0x00007fffe6ea622d <+397>:   divpd  %xmm1,%xmm0
=> 0x00007fffe6ea6231 <+401>:   movapd %xmm0,(%rdi,%rax,8)
   ...
(gdb) p $xmm1
$1 = {..., v2_double = {0.5, 0.5}, ...}

实际上,无论使用什么常量,numpy都运行完全相同的通用循环.因此,所有时序差异都完全是由于CPU造成的.

In fact, numpy is running exactly the same generic loop regardless of the constant used. So all timing differences are purely due to the CPU.

实际上,除法是一条执行时间高度可变的指令.要做的工作量取决于操作数的位模式,还可以检测和加快特殊情况.根据E5-2620上的这些表格(我不知道其准确性)( (Sandy Bridge)DIVPD具有10-22个周期的延迟和反向吞吐量,而MULPS具有10个周期的延迟和5个周期的反向吞吐量.

Actually, division is an instruction with a highly variable execution time. The amount of work to be done depends on the bit patterns of the operands, and special cases can also be detected and sped up. According to these tables (whose accuracy I do not know), on your E5-2620 (a Sandy Bridge) DIVPD has a latency and an inverse throughput of 10-22 cycles, and MULPS has latency 10 cycles and inverse throughput of 5 cycles.

现在,关于A*2.0A*=2.0慢. gdb显示出正好使用相同的函数进行乘法运算,只是现在输出op与第一个输入ip1不同.因此,它纯粹是将多余的内存提取到缓存中的伪像,从而减慢了大输入的非就地操作(即使MULPS每个周期仅产生2 * 8/5 = 3.2字节的输出!).使用1e4大小的缓冲区时,所有内容都适合缓存,因此不会产生明显影响,并且其他开销通常会淹没A/=0.5A/=0.51之间的差异.

Now, as for A*2.0 being slower than A*=2.0. gdb shows that exactly the same function is being used for multiplication, except now the output op differs from first input ip1. So it has to be purely an artifact of the extra memory being drawn into cache slowing down the non-inplace operation for the large input (even though MULPS is producing only 2*8/5 = 3.2 bytes of output per cycle!). When using the 1e4-sized buffers, everything fits in cache, so that doesn't have a significant effect, and other overheads mostly drown out the difference between A/=0.5 and A/=0.51.

仍然,在那些时间有很多奇怪的效果,所以我画了一些图(下面是生成此图的代码)

Still, there are lots of weird effects in those timings, so I plotted some a graph (code to generate this is below)

我已经针对每个DIVPD/MULPD/ADDPD指令的CPU周期数绘制了A数组的大小.我在3.3GHz AMD FX-6100上运行了该程序.黄色和红色垂直线是L2和L3缓存大小.蓝线是根据这些表所假定的DIVPD的最大吞吐量,即1/4.5个周期(这似乎是可疑的).如您所见,即使执行numpy操作的开销"接近于零,甚至A+=2.0都无法接近该值.因此,仅循环和从L2缓存读写16个字节就需要大约24个周期的开销!令人震惊的是,也许内存访问没有对齐.

I've plotted size of the A array against the number of CPU cycles per DIVPD/MULPD/ADDPD instruction. I ran this on a 3.3GHz AMD FX-6100. Yellow and red vertical lines are L2 and L3 cache size. The blue line is the supposed maximum throughput of DIVPD according to those tables, 1/4.5cycles (which seems dubious). As you can see, not even A+=2.0 gets anywhere near this, even when the "Overhead" of performing an numpy operation falls close to zero. So there is about 24 cycles of overhead just looping and reading and writing 16 bytes to/from L2 cache! Pretty shocking, maybe the memory accesses aren't aligned.

许多有趣的效果要注意:

Lots of interesting effects to note:

  • 在30KB数组以下,大多数时间在python/numpy中是开销
  • 乘法和加法的速度相同(如Agner表格所示)
  • A/=0.5A/=0.51之间的速度差朝图的右侧下降;这是因为当读/写存储器的时间增加时,它会重叠并掩盖执行划分所花费的一些时间.因此,A/=0.5A*=2.0A+=2.0变为相同的速度.
  • 比较A/=0.51A/=0.5A+=2.0之间的最大差异,表明除法的吞吐量为4.5-44个周期,这与Agner表中的4.5-11不匹配.
  • 但是,当numpy开销变大时,A/= 0.5和A/= 0.51之间的差异大部分消失了,尽管仍然存在一些周期差异.这很难解释,因为numpy的开销无法掩盖进行除法的时间.
  • 非本地操作(虚线)比L3高速缓存大得多时,会变得异常慢,但就地操作却不会.它们需要两倍于RAM的内存带宽,但我无法解释为什么它们会慢20倍!
  • 虚线在左侧发散.大概是因为除法和乘法是由具有不同开销量的不同numpy函数处理的.
  • Below arrays of 30KB the majority of time is overhead in python/numpy
  • Multiplication and addition are the same speed (as given in Agner's tables)
  • The difference in speed between A/=0.5 and A/=0.51 drops towards the right of graph; this is because when the time to read/write memory increases, it overlaps and masks some of the time taken to do the division. For that reason, A/=0.5, A*=2.0 and A+=2.0 become the same speed.
  • Comparing the maximum difference between A/=0.51, A/=0.5 and A+=2.0 suggests that division has a throughput of 4.5-44 cycles, which fails to match the 4.5-11 in Agner's table.
  • However, the difference between A/=0.5 and A/=0.51 mostly disappears when the numpy overhead gets large, although there's still a few cycles difference. This is hard to explain, because numpy overhead can't mask time to do the division.
  • Operations which are not in-place (dashed lines) become incredibly slow when much larger than the L3 cache size, but in-place operations don't. They require double the memory bandwidth to RAM, but I can't explain why they would be 20x slower!
  • The dashed lines diverge on the left. This is assumably because division and multiplication are handled by different numpy functions with different amounts of overhead.

不幸的是,在另一台CPU的FPU速度,缓存大小,内存带宽,numpy版本等不同的计算机上,这些曲线看起来可能完全不同.

Unfortunately, on another machine with a CPU with different FPU speed, cache size, memory bandwidth, numpy version, etc, these curves could look quite different.

我从中得到的好处是:将多个算术运算与numpy链接在一起比在Cython中进行相同的运算(同时仅对输入进行一次迭代)要慢很多倍,因为没有最佳位置"算术运算的成本支配着其他成本.

My take-away from this is: chaining together multiple arithmetic operations with numpy is going to be many times slower than doing the same in Cython while iterating over the inputs just once, because there is no "sweet spot" at which the cost of the arithmetic operations dominates the other costs.

import numpy as np
import timeit
import matplotlib.pyplot as plt

CPUHz = 3.3e9
divpd_cycles = 4.5
L2cachesize = 2*2**20
L3cachesize = 8*2**20

def timeit_command(command, pieces, size):
    return min(timeit.repeat("for i in xrange(%d): %s" % (pieces, command),
                             "import numpy; A = numpy.random.rand(%d)" % size, number = 6))

def run():
    totaliterations = 1e7

    commands=["A/=0.5", "A/=0.51", "A/0.5", "A*=2.0", "A*2.0", "A+=2.0"]
    styles=['-', '-', '--', '-', '--', '-']

    def draw_graph(command, style, compute_overhead = False):
        sizes = []
        y = []
        for pieces in np.logspace(0, 5, 11):
            size = int(totaliterations / pieces)
            sizes.append(size * 8)  # 8 bytes per double
            time = timeit_command(command, pieces, (4 if compute_overhead else size))
            # Divide by 2 because SSE instructions process two doubles each
            cycles = time * CPUHz / (size * pieces / 2)
            y.append(cycles)
        if compute_overhead:
            command = "numpy overhead"
        plt.semilogx(sizes, y, style, label = command, linewidth = 2, basex = 10)

    plt.figure()
    for command, style in zip(commands, styles):
        print command
        draw_graph(command, style)
    # Plot overhead
    draw_graph("A+=1.0", '-', compute_overhead=True)

    plt.legend(loc = 'best', prop = {'size':9}, handlelength = 3)
    plt.xlabel('Array size in bytes')
    plt.ylabel('CPU cycles per SSE instruction')

    # Draw vertical and horizontal lines
    ymin, ymax = plt.ylim()
    plt.vlines(L2cachesize, ymin, ymax, color = 'orange', linewidth = 2)
    plt.vlines(L3cachesize, ymin, ymax, color = 'red', linewidth = 2)
    xmin, xmax = plt.xlim()
    plt.hlines(divpd_cycles, xmin, xmax, color = 'blue', linewidth = 2)

这篇关于numpy:除以0.5有什么特别之处?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆