在Cython与NumPy中对int与float求和时的性能差异很大 [英] Large Performance difference when summing ints vs floats in Cython vs NumPy
问题描述
我正在使用Cython或NumPy对一维数组中的每个元素求和.总结 integers 时,Cython的速度提高了约20%.当总结浮动时,Cython的速度是慢的约2.5倍.下面是使用的两个简单函数.
I am summing each element in a 1D array using either Cython or NumPy. When summing integers Cython is ~20% faster. When summing floats, Cython is ~2.5x slower. Below are the two simple functions used.
#cython: boundscheck=False
#cython: wraparound=False
def sum_int(ndarray[np.int64_t] a):
cdef:
Py_ssize_t i, n = len(a)
np.int64_t total = 0
for i in range(n):
total += a[i]
return total
def sum_float(ndarray[np.float64_t] a):
cdef:
Py_ssize_t i, n = len(a)
np.float64_t total = 0
for i in range(n):
total += a[i]
return total
时间
创建两个数组,每个数组包含100万个元素:
Timings
Create two arrays of 1 million elements each:
a_int = np.random.randint(0, 100, 10**6)
a_float = np.random.rand(10**6)
%timeit sum_int(a_int)
394 µs ± 30 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit a_int.sum()
490 µs ± 34.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit sum_float(a_float)
982 µs ± 10.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit a_float.sum()
383 µs ± 4.42 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
其他要点
- NumPy在浮点数方面的表现(相当大的优势)甚至超过了自己的整数和.
-
sum_float
的性能差异与缺少boundscheck
和wraparound
指令的相同.为什么? - 将
sum_int
中的整数numpy数组转换为C指针(np.int64_t *arr = <np.int64_t *> a.data
)可将性能提高25%.这样做对浮游物没有任何作用 - NumPy is outperforming (by quite a large margin) with floats and even beats its own integer sum.
- The performance difference for
sum_float
is the same with theboundscheck
andwraparound
directives missing. Why? - Converting the integer numpy array in
sum_int
to a C pointer (np.int64_t *arr = <np.int64_t *> a.data
) improves performance by an additional 25%. Doing so for the floats does nothing
Additional points
如何在Cython中获得与浮点数相同的性能?
How can I get the same performance in Cython with floats that I do with integers?
我写了一个甚至更简单的函数,它只计算迭代次数.前者将计数存储为整数,后者存储为双精度数.
I wrote an even simpler function that just counts the number of iterations. The first stores the count as an int, the latter as a double.
def count_int():
cdef:
Py_ssize_t i, n = 1000000
int ct=0
for i in range(n):
ct += 1
return ct
def count_double():
cdef:
Py_ssize_t i, n = 1000000
double ct=0
for i in range(n):
ct += 1
return ct
计数时间
我只运行了一次(怕缓存).不知道循环是否实际上是针对整数执行的,但是count_double
与上面的sum_float
具有相同的性能.这太疯狂了...
Timings of counting
I ran these just once (afraid of caching). No idea if the loop is actually being executed for the integer, but count_double
has the same performance as the sum_float
from above. This is crazy...
%timeit -n 1 -r 1 count_int()
1.1 µs ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
%timeit -n 1 -r 1 count_double()
971 µs ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
推荐答案
我不会回答您所有的问题,而只是(在我看来)最有趣的问题.
I'm not going to answer all your questions, but only (in my eyes) the most interesting ones.
让我们从您的计数示例开始:
Let start with your counting example:
- 编译器能够在整数情况下优化for循环-生成的二进制文件不会计算任何内容-它只需要返回在编译阶段预先计算的值即可.
- 双重情况不是这种情况,因为由于舍入错误,结果将不会是
1.0*10**6
,并且因为cython会在
- the compiler is able to optimize the for-loop in the integer-case - the resulting binary doesn't not calculate anything - it only hast to return the value precalculated during the compilation phase.
- This is not the case for the double-case, because due to rounding errors, the result will not be
1.0*10**6
and because cython compiles in IEEE 754 (not-ffast-math
) mode per default.
在查看cython代码时,请务必牢记:不允许编译器重新排列求和(IEEE 754),因为下一个需要第一个求和的结果,因此在其中仅长行所有操作都在等待.
This you must keep it mind when looking at your cython-code: the compiler is not allowed to rearrange the summations (IEEE 754) and because the result of the first summations is needed for the next there is only one long line in which all operations wait.
但最关键的见解:numpy的功能与您的cython代码不同:
But the most crucial insight: the numpy is not doing the same as your cython code:
>>> sum_float(a_float)-a_float.sum()
2.9103830456733704e-08
是的,没有人告诉numpy(与您的cython代码不同),总和必须这样计算
Yep, nobody told numpy (differently from your cython code) that the sum has to be calculated like this
((((a_1+a2)+a3)+a4)+...
和numpy通过两种方式利用它:
And numpy take advantage of it in two ways:
-
它执行成对求和(种类),从而导致较小的舍入错误.
it performs pairwise summation (kind of), which leads to a smaller rounding error.
it calculates sum in chunks (the code of python is somewhat hard to understand, here is the corresponding template and further down the listing of the used function pairwise_sum_DOUBLE
)
第二点是您要加快速度的原因,其计算类似于以下模式(至少从下面的源代码中可以理解):
The second point is the reason for the speed-up your are observing, the calculation happens similar to the following schema (at least what I understand from the source code below):
a1 + a9 + ..... = r1
a2 + a10 + ..... = r2
..
a8 + a16 + = r8
----> sum=r1+....+r8
这种求和的优点:a2+a10
的结果不依赖于a1+a9
,并且这两个值可以同时计算(例如
The advantage of this kind of summation: the result of a2+a10
doesn't depend on a1+a9
and these both values can be calculated simultaneously (e.g. pipelining) on modern CPUs, which leads to the speed-up you are observing.
对于它的价值,在我的计算机上,cython-integer-sum比numpy的慢.
For what it is worth, on my machine the cython-integer-sum is slower than numpy's.
需要考虑numpy数组的步幅(仅在运行时才知道,另请参见此关于矢量化的问题)阻止了一些优化.一种解决方法是使用内存视图,您可以清楚地知道该数据是连续的,即:
The need of taking the stride of the numpy-array into account (which is only known at the run-time, see also this question about vectorization) prevents some optimizations. A workaround is to use memory-views, for which you can make clear, that data is continuous, i.e.:
def sum_int_cont(np.int64_t[::1] a):
这会导致我的机器显着提高速度(因数2):
Which leads on my machine to a significant speed-up (factor 2):
%timeit sum_int(a_int)
2.64 ms ± 46.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit sum_int_cont(a_int)
1.31 ms ± 19 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit a_int.sum()
2.1 ms ± 105 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
是的,在这种情况下,使用内存视图进行双打不会带来任何提速(不知道为什么),但通常它会简化优化器的寿命.例如,将memory-view-variant与-ffast-math
编译选项结合使用将允许关联,从而产生与numpy相当的性能:
It's true, that in this case using memory-views for doubles doesn't bring any speed-up (don't know why), but in general it simplifies the life of the optimizer. For example combining the memory-view-variant with -ffast-math
compile options, which would allow the associativity, leads to a performance comparable with numpy:
%%cython -c=-ffast-math
cimport numpy as np
def sum_float_cont(np.float64_t[::1] a):
cdef:
Py_ssize_t i, n = len(a)
np.float64_t total = 0
for i in range(n):
total += a[i]
return total
现在:
>>> %timeit sum_float(a_float)
3.46 ms ± 226 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
>>> %timeit sum_float_cont(a_float)
1.87 ms ± 44 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
>>> %timeit a_float.sum()
1.41 ms ± 88.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
pairwise_sum_DOUBLE
的列表:
Listing of pairwise_sum_DOUBLE
:
/*
* Pairwise summation, rounding error O(lg n) instead of O(n).
* The recursion depth is O(lg n) as well.
* when updating also update similar complex floats summation
*/
static npy_double
pairwise_sum_DOUBLE(npy_double *a, npy_uintp n, npy_intp stride)
{
if (n < 8) {
npy_intp i;
npy_double res = 0.;
for (i = 0; i < n; i++) {
res += (a[i * stride]);
}
return res;
}
else if (n <= PW_BLOCKSIZE) {
npy_intp i;
npy_double r[8], res;
/*
* sum a block with 8 accumulators
* 8 times unroll reduces blocksize to 16 and allows vectorization with
* avx without changing summation ordering
*/
r[0] = (a[0 * stride]);
r[1] = (a[1 * stride]);
r[2] = (a[2 * stride]);
r[3] = (a[3 * stride]);
r[4] = (a[4 * stride]);
r[5] = (a[5 * stride]);
r[6] = (a[6 * stride]);
r[7] = (a[7 * stride]);
for (i = 8; i < n - (n % 8); i += 8) {
r[0] += (a[(i + 0) * stride]);
r[1] += (a[(i + 1) * stride]);
r[2] += (a[(i + 2) * stride]);
r[3] += (a[(i + 3) * stride]);
r[4] += (a[(i + 4) * stride]);
r[5] += (a[(i + 5) * stride]);
r[6] += (a[(i + 6) * stride]);
r[7] += (a[(i + 7) * stride]);
}
/* accumulate now to avoid stack spills for single peel loop */
res = ((r[0] + r[1]) + (r[2] + r[3])) +
((r[4] + r[5]) + (r[6] + r[7]));
/* do non multiple of 8 rest */
for (; i < n; i++) {
res += (a[i * stride]);
}
return res;
}
else {
/* divide by two but avoid non-multiples of unroll factor */
npy_uintp n2 = n / 2;
n2 -= n2 % 8;
return pairwise_sum_DOUBLE(a, n2, stride) +
pairwise_sum_DOUBLE(a + n2 * stride, n - n2, stride);
}
}
这篇关于在Cython与NumPy中对int与float求和时的性能差异很大的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!