加快numpy嵌套循环 [英] Speed up numpy nested loop

查看:213
本文介绍了加快numpy嵌套循环的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用numpy和cython在python中编写一个无线网络的仿真模型,假设在2d平面上随机散布了许多节点 no_nodes 一些波形和它们各自的接收机,再次散布在二维平面上。每个发送节点产生一个波形,我调用 output (每一个都可以产生不同长度的输出)。

我想要做的是将这些来自每个节点的输出汇总成一个大的波形,这些波形将会是每个接收机的解调输入。现在有两个关键点:$ b​​
$ b $ ul b $ b

  • 发送器异步发送,因此每个发送节点必须维护一个 start_clock 和一个 end_clock 以适当地将波形求和
    $发送节点的输出将被<$ c $ $接收之前衰减根据一个函数 attenuate(i,j)



  • 所以这里是代码:

    $ p $ #create empty 2d array(no_rx_nodes x no_samples for each waveform)
    波形= np.zeros((no_nodes,max(end_clock)))

    为范围内的(no_nodes):#计算每个接收器的波形
    ange(no_nodes):#sum每个发射器产生的波形
    波形[i,start_clock [j]:end_clock [j]] + = output [j ,:] * attenuate(i,j)
    返回波形

    上面的一些评论:


    • 输出[j,:] 是发射器j的输出波形
    • waveforms [i,:] 是接收器收到的波形我希望这是相当公平的清楚我在这里试图完成什么。由于产生的波形非常大(大约10 ^ 6个样本),我也尝试把这个代码转换成cython,但是没有注意到任何特别的加速(可能5-10x更好,但是没有更多)。我想知道是否还有别的办法可以加快速度,因为它是整个仿真的真正瓶颈(需要计算几乎与代码的其余部分一样多的时间,实际上它是相当的比这更复杂)。解决方案这是一个内存带宽约3GB / s的内存带宽问题,你可以得到最好的这是内循环约2-4毫秒。
      达到这个界限,你需要阻止你的内部循环,以更好地利用cpu缓存(numexpr为你做):

       对于范围(no_nodes)中的i:
      对于范围内的j(no_nodes):应该选择
      #使得所有操作数都适合(下一个)最后一级高速缓存
      #第一级通常太小,无法使用,因为python的开销
      s = 15000
      a = attenuation [i,j]
      o =输出[j]
      w =波形[i] $ (0,w.size,s):b $ b = min(k + s,w.size)
      w [k:u] + = o [k:u] * a
      #or:numexpr.evaluate(w + o * a,out = w)



      使用float32数据而不是float64应该也是内存带宽需求的一半,并且性能是双倍的。

      为了获得更大的加速,你必须重新设计你的完整算法来得到更好的数据局部性

      I am writing a simulation for a wireless network in python using numpy and cython, where suppose there is a number of nodes no_nodes scattered randomly on the 2d plane which send out some waveforms and their respective receivers, again scattered randomly on the 2d plane. Each transmitting node produces a waveform I call output (each one can produce an output of different length).

      What I want to do is to sum these outputs from each node to one big waveform that will be the input to each receiver for demodulation etc. Now two key-points:

      • the transmitters send asynchronously and therefore a start_clock and an end_clock has to be maintened per transmitting node so as to sum the waveforms properly
      • the output of the j transmitting node will be attenuated before being received by the i node according to a function attenuate(i,j)

      So here is the code:

      #create empty 2d array (no_rx_nodes x no_samples for each waveform)
      waveforms = np.zeros((no_nodes, max(end_clock))) 
      
      for i in range(no_nodes): #calculate the waveform for each receiver
          for j in range(no_nodes): #sum the waveforms produced by each transmitter
              waveforms[i, start_clock[j]:end_clock[j]] += output[j,:] * attenuate(i,j)
      return waveforms
      

      Some comments on the above:

      • output[j, :] is the output waveform of transmitter j
      • waveforms[i,:] is the waveform received by receiver i

      I hope it is fairly clear what I am trying to accomplish here. Because the waveforms produced are very large (about 10^6 samples), I also tried turning this code into cython but without noticing any particular speedup (maybe 5-10x better but no more). I was wondering if there is anything else I can resort to so as to get a speed up because it is a real bottleneck to the whole simulation (it takes to compute almost as much time as the rest of the code, which in fact is quite more complicated than that).

      解决方案

      this is a memory bandwidth bound problem with about 3GB/s memory bandwidth the best you can get out of this is around 2-4ms for the inner loop. to reach that bound you need to block your inner loop to utilize the cpu caches better (numexpr does this for you):

      for i in range(no_nodes):
          for j in range(no_nodes):
              # should be chosen so all operands fit in the (next-to-)last level cache
              # first level is normally too small to be usable due to python overhead
              s  = 15000 
              a = attenuation[i,j]
              o = output[j]
              w = waveforms[i]
              for k in range(0, w.size, s): 
                  u = min(k + s, w.size)
                  w[k:u] += o[k:u] * a
              # or: numexpr.evaluate("w + o * a", out=w)
      

      using float32 data instead of float64 should also half the memory bandwidth requirements and double the performance.

      To get larger speedups you have to redesign your full algorithm to have a better data locality

      这篇关于加快numpy嵌套循环的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

    查看全文
    登录 关闭
    扫码关注1秒登录
    发送“验证码”获取 | 15天全站免登陆