Cython中的数组总和 [英] Sum array of arrays in Cython

查看:73
本文介绍了Cython中的数组总和的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试找到使用Cython水平求和numpy数组的最快方法.首先,让我们假设我有一个随机浮点数为10 x 100,000的单个2D数组.我可以创建一个object数组,并将每一列作为数组中的值,如下所示:

I am attempting to find the fastest way to sum horizontally a numpy array of arrays using Cython. To get things started, let's assume I have a single 2D array of random floats 10 x 100,000. I can create an object array with each column as a value in the array as follows:

n = 10 ** 5
a = np.random.rand(10, n)
a_obj = np.empty(n, dtype='O')
for i in range(n):
    a_obj[i] = a[:, i]

我要做的就是找到每一行的总和.它们都可以这样简单地计算:

All I would like to do is find the sum of each row. They can both be trivially computed as such:

%timeit a.sum(1)
414 µs ± 24.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit a_obj.sum()
113 ms ± 7.01 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

对象数组要慢250倍.

The object array is 250x slower.

在尝试总结之前,我想安排时间访问每个元素.当直接遍历每个项目时,Cython无法加快对对象数组每个成员的访问速度:

Before even trying to sum things up, I wanted to time access to each element. Cython is unable to speed up access to each member of the object array when iterating through each item directly:

def access_obj(ndarray[object] a):
    cdef int i
    cdef int nc = len(a)
    cdef int nr = len(a[0])
    for i in range(nc):
        for j in range(nr):
            a[i][j]

%timeit access_obj(a_obj)
42.1 ms ± 665 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

但是,我相信我可以使用data属性(即memoryview对象)访问基础C数组,并获得更快的访问权限:

But, I believe I can access the underlying C arrays with the data attribute which is a memoryview object and get much faster access:

def access_obj_data(ndarray[object] a):
    cdef int i
    cdef int nc = len(a)
    cdef int nr = len(a[0])
    cdef double **data = <double**>a.data
    for i in range(nc):
        for j in range(nr):
            data[i][j]

我认为这会在计时时缓存结果,所以我必须做一次.

I think this caches results when timing so I had to do it once.

%timeit -n 1 -r 1 access_obj_data(a_obj)
8.17 µs ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)

问题在于您不能将a.data强制转换为C的double数组.如果我打印出第一列,则会得到以下内容.

The problem is that you can't just cast a.data into a C array of doubles. If I print out the first column, I get the following.

5e-324
2.1821467023e-314
2.2428810855e-314
2.1219957915e-314
6.94809615162997e-310
6.94809615163037e-310
2.2772194067e-314
2.182150145e-314
2.1219964234e-314
0.0

存在类似的问题,但是没有任何明确的答案可以有效地执行有效的代码来快速完成此操作.

There are similar questions, but nothing explicitly answers a way to do this operation fast with code that works.

推荐答案

我猜有两个不同的问题:

I guess there are two different questions:

  1. 为什么a_obj.sum()这么慢?
  2. 为什么cython功能代码比较慢?
  1. Why is a_obj.sum() so slow?
  2. Why is the cython-function code slow?

我们将看到a_obj.sum()的慢度可以跟踪到Python调度+缓存未命中的开销.但是,这是多余的分析结果,可能还有更细微的原因.

We will see that slowness of the a_obj.sum() can be tracked to the overhead of Python-dispatch + cache misses. However, this is a result of a quite superfluous analysis, there might be more subtle reasons.

第二个问题没那么有趣:我们将看到,Cython对数组中的对象了解得不够多,无法加速该功能.

The second question is less interesting: We will see, that Cython doesn't know enough about the objects in the array to be able to speed the function up.

还有一个问题:最快的方法是什么?".答案是这取决于您的数据和硬件",因为这种问题几乎总是如此.

There is also the question "what is the fastest way?". The answer is "it depends on your data and your hardware", as it is almost always for this kind of question.

但是,我们将使用获得的见解对现有版本进行一些改进.

However, we will use the gained insights to improve slightly on the existing versions.

让我们从"a_obj.sum()慢"开始.

作为您的示例在我的计算机上的基线(是,系数200):

As baseline for your example on my machine (yep, factor 200):

>>> %timeit a.sum(1)
510 µs ± 14.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
>>> %timeit a_obj.sum()
90.3 ms ± 494 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

首先,a_obj.sum()中发生了什么?约简算法如下:

First what is happening in a_obj.sum()? The reduction algorithm goes like:

result=a_obj[0]+a_obj[1]+a_obj[2]+...

但是,该算法并不知道a_obj[i]是numpy数组,因此它必须使用Python调度来知道a_obj[i]+a_obj[i+1]实际上是两个numpy数组的加法.总结10倍的费用是一笔相当大的开销,我们必须支付10**5倍的费用.如果您对更多详细信息感兴趣,我前段时间回答了问题有关Python分发成本的问题.

However, the algorithm doesn't know that a_obj[i] are numpy-arrays, so it has to use the Python-dispatch to know, that a_obj[i]+a_obj[i+1] is actually addition of two numpy-arrays. This is quite an overhead to sum up 10 doubles which we have to pay 10**5 times. I answered a question about cost of Python-dispatch some time ago, if you are interested in more details.

如果数组的形状为10**5x10,则与10**5附加工作相比,我们只需支付10倍的开销就不会很大:

If the shape of our array would be 10**5x10, we would pay the overhead only 10 times it would be not large compared with the work of 10**5 additions:

>>> b=np.random.rand(n,10)
>>> b_obj=to_obj_array(b) # see listing at the end for code
>>> %timeit b.sum(1)
2.43 ms ± 13.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)    
>>> %timeit b_obj.sum()
5.53 ms ± 17.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

如您所见,因子(大约2)现在要小得多.但是,Python派遣开销理论无法解释一切.

As you can see, the factor (about 2) is much smaller now. However, the Python-dispatch-overhead theory cannot explain everything.

让我们看一下数组的以下大小:

Let's take a look at the following sizes of array:

>>> c=np.random.rand(10**4,10**4)
>>> c_obj=to_obj_array(c)
>>> %timeit c.sum(1)
52.5 ms ± 354 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
>>> %timeit c_obj.sum()
369 ms ± 2.53 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

这次因子约为8!原因是obj版本中的缓存丢失.这是一项内存受限的任务,因此基本上,内存速度是瓶颈.

This time the factor is about 8! The reason is the cache-misses in the obj-version. This is a memory-bound task, so basically the memory-speed is the bottle-neck.

在numpy数组中,元素连续存储在内存中,一行又一行.每次从内存中获取一个double时,都会使用7个double邻居获取该double,并将这8个double邻居保存在缓存中.

In the numpy-array the elements are stored continuously in memory, one row after another. Every time a double is fetched from memory it is fetched with 7 double-neighbors and these 8 double are saved in cache.

因此,当c.sum(1)运行时,需要逐行提供该信息:每次从内存中将一个double提取到高速缓存中时,都会预提取接下来的7个数字,以便对其求和.这是最有效的内存访问模式.

So when c.sum(1) runs the information is needed row-wise: every time a double is fetched from memory into the cache, the next 7 numbers are prefetched, so they can be sum-up. This is the most efficient memory-access-pattern.

对象版本不同.该信息是按列的.因此,每次我们向缓存中获取额外的7 double时,我们都不需要它,并且当我们可以处理这些double时,它们已经从缓存中逐出了(因为cache不足以容纳列中的所有数字. ),并且需要再次从慢速存储器中获取.

Differently the object-version. The information is needed column-wise. So every time we fetch additional 7 doubles into the cache we don't need it yet, and at time when we could process these doubles they are already evicted from cache (because cache isn't big enough for all numbers from a column) and need to be fetched from the slow memory again.

这意味着使用第二种方法,每两次读取都会从内存中读取8次,或者从内存/高速缓存未命中读取的数据将增加8倍.

That means with the second approach, every double will be read 8 times from memory or there will be 8 times more reads from memory/cache-misses.

我们可以轻松地使用valgrind的 cachegrind 来测量缓存丢失对于这两种变体(请参见最后的清单test_cache.py`):

We can easily verify that using valgrind's cachegrind to measure cache-misses for both variants (see listings at the end for test_cache.py`):

valgrind --tool=cachegrind python test_cache.py array
...
D1  misses:       144,246,083 
...

valgrind --tool=cachegrind python test_cache.py object
...
D1  misses:     1,285,664,578
...

c_obj版本的D1缓存丢失次数增加了约8倍.

i.e c_obj-version has about 8-times more D1-cache-misses.

但是有更多的乐趣,对象版本可能会更快!

But there is more fun, the object-version could be faster!

>>> d=np.random.rand(800,6)
>>> d_obj=to_obj_array(d)
>>> %timeit d.sum(1)
20.2 µs ± 217 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
>>> %timeit d_obj.sum()
12.4 µs ± 155 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

那怎么可能?它们可能很快就差不多了,因为

How could that be? They could be similar fast, because

  1. 整个数组可以保存在缓存中,因此没有其他缓存丢失.
  2. python-dispatch的开销是可以承受的.

但是速度要快两倍,否则版本会更慢?

but two times faster for otherwise slower version?

我不确定100%,但是这是我的理论:对于数组版本,我们计算行i的总和:

I'm not 100% sure, but here is my theory: For the array-version, we calculate the sum for the row i:

...
sum_i=sum_i+a[i][j]
sum_i=sum_i+a[i][j+1]
sum_i=sum_i+a[i][j+2]
...

因此硬件无法并行执行指令,因为需要前一个操作的结果来计算下一个.

so there hardware cannot execute the instructions in parallel, because the result of the previous operation is needed to calculate the next.

但是对于对象版本,计算如下:

However for the object-version, the calculation looks as follows:

...
sum_i=sum_i+a[i][j]
sum_k=sum_k+a[k][j]
sum_m=sum_m+a[i][j]
...

这些步骤可以由CPU并行完成,因为它们之间没有依赖性.

These steps can be done in parallel by CPU because there is no dependency between them.

那么,为什么cython这么慢?因为cython对数组中的对象一无所知,所以它并没有比通常的Python代码更快,这很慢!

So, why is cython so slow? Because cython doesn't know anything about the objects in the array it is not really faster than the usual Python-code, which is very slow!

让我们分解cython函数access_obj:

Let's break down the cython-function access_obj:

  1. Cython对a[i]一无所知,并假定它是一个通用的Python对象.
  2. 这意味着,对于a[i][j],它必须在Python基础结构的帮助下调用a[i]__get_item__.顺便提一句.为此,必须将j转换为完整的Python整数.
  3. __get_item__从存储的c-double构造完整的Python浮点数,并将其返回给cython.
  4. cast将Python-float返回到c-float.
  1. Cython knows nothing about a[i] and assumes it is a generic Python-object.
  2. That means, for a[i][j] it must call __get_item__ of a[i] with help of Python-infrastructure. Btw. for that j must be converted to a full-fledged Python-integer.
  3. __get_item__ constructs a full-fledged Python-float from the stored c-double and returns it to cython.
  4. cast returned Python-float to a c-float.

在底线下方,我们使用了Python调度,并创建了两个临时的Python对象-您可以看到所需的费用.

Below the bottom line, we hat used Python-dispatch and created two temporary Python-objects - you can see the needed costs.

加快求和速度

我们已经看到,求和是一个内存绑定任务(至少在我的机器上),所以最重要的是避免高速缓存丢失(例如,这一重要之处还意味着我们需要不同的算法用于C阶和F阶numpy数组.

We have seen, that the summation is a memory-bound-task (at least on my machine), so the most important thing is to avoid cache-misses (this paramount could also mean for example, that we need different algorithms for C-order and F-order numpy-arrays).

但是,如果我们的任务必须与具有CPU任务的任务共享一个CPU,它实际上可能会再次受到CPU的限制,因此它有一些优点,而不是仅关注现金不足的情况.

However, if our task has to share a CPU with a CPU-heave task it might again become de facto CPU-bound, so it has some merits to focus on more than only the cash-misses.

在稍后将介绍的C版本中,我们希望结合两种方法的良好特性:

In the C-version, which will be introduced later on, we would like to combine the good properties of both approaches:

  1. 类似于对象版本,我们希望并行计算多个行总和,以便能够使用CPU的固有并行性.
  2. 类似于numpy-array-sum,我们希望具有最少的高速缓存未命中数,因此我们不能同时计算所有行,而仅计算一个小块(例如8).这种逐块计算的优点是,一旦真正缓存了数据,就不会在真正需要之前将其从缓存中逐出.

这是一个简化的版本,它以8串为一组处理行:

Here is a simplified version, which processes rows in bunches of 8:

//fast.c
#define MAX 8
void sum_rows_blocks(double *orig, double *out, int n_rows, int n_cols){
   int n_blocks=n_rows/MAX;
   for (int b=0; b<n_blocks; b++){   //for every block
       double res[MAX]={0.0};        //initialize to 0.0
       for(int j=0;j<n_cols;j++){
           for(int i=0;i<MAX;i++){   //calculate sum for MAX-rows simultaniously
              res[i]+=orig[(b*MAX+i)*n_cols+j];
           }
       }
       for(int i=0;i<MAX;i++){
             out[b*MAX+i]=res[i];
       }
   }  
   //left_overs:
   int left=n_rows-n_blocks*MAX;
   double res[MAX]={0.0};         //initialize to 0.0
   for(int j=0;j<n_cols;j++){
       for(int i=0;i<left;i++){   //calculate sum for left rows simultaniously
              res[i]+=orig[(n_blocks*MAX+i)*n_cols+j];
       }
   }
   for(int i=0;i<left;i++){
         out[n_blocks*MAX+i]=res[i];
   }
}

现在,我们可以使用cython来包装此功能:

Now we can use cython to wrap this function:

//c_sum.pyx
cimport numpy as np    
import numpy as np

cdef extern from "fast.c":
    void sum_rows_blocks(double *orig, double *out, int n_rows, int n_cols)

def sum_rows_using_blocks(double[:,:] arr):
    cdef int n_rows
    cdef int n_cols
    cdef np.ndarray[np.double_t] res
    n_rows=arr.shape[0] 
    n_cols=arr.shape[1]
    res=np.empty((n_rows, ),'d')
    sum_rows_blocks(&arr[0,0], &res[0], n_rows, n_cols)
    return res

构建扩展名(请参见setup.py的清单)之后,我们可以进行比较:

And after building the extension (see listing for setup.py), we can do the comparison:

import c_sum
%timeit c_sum.sum_rows_using_blocks(d) #shape=(800,6)
4.26 µs ± 95.8 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

它的因子3比迄今为止最快的版本(12 µs)更快.大型阵列的效果如何:

It factor 3 faster than the fastest version so far (12 µs). How does it fare for large arrays:

%timeit c_sum.sum_rows_using_blocks(c) #shape=(10**4,10**4)
49.7 ms ± 600 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

好,它比c.sum(1)快一点,而c.sum(1)花费了52毫秒,这意味着我们没有其他缓存未命中.

Good, it is slightly faster than c.sum(1) which takes 52 ms, that means we don't have additional cache misses.

可以改进吗?我确实是这样认为的:例如,不使用默认的编译标志-fwrapv(通过在安装程序中包含extra_compile_args = ["-fno-wrapv"])将改善生成的

Can it be improved? I do think so: For example not using the default compile flag -fwrapv (by including extra_compile_args = ["-fno-wrapv"] in the setup) will improve the resulting assembly and lead to a 10% speedup for the summation for the shape (800, 6).

列表:

def to_obj_array(a):   
     n=a.shape[1]
     obj=np.empty((n,),dtype='O')
     for i in range(n):
          obj[i]=a[:,i]
     return obj

check_cache.py:

import sys
import numpy as np

def to_obj_array(a):   
     n=a.shape[1]
     obj=np.empty((n,),dtype='O')
     for i in range(n):
          obj[i]=a[:,i]
     return obj

c=np.random.rand(10**4,10**4)
c_obj=to_obj_array(c)


for i in range(10):
  if sys.argv[1]=="array":
     c.sum(1)
  elif sys.argv[1]=="object":
     c_obj.sum()
  else:
     raise Exception("unknown parameter")

setup.py:

#setup.py
from distutils.core import setup, Extension
from Cython.Build import cythonize
import numpy as np

setup(ext_modules=cythonize(Extension(
            name='c_sum',
            sources = ["c_sum.pyx"],
            #extra_link_args = ["fast.o"],
            include_dirs=[np.get_include()]

    )))

这篇关于Cython中的数组总和的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆