有和没有SSE的不同结果(浮点数组乘法) [英] different results with and without SSE ( float arrays multiplication)
问题描述
你能解释为什么结果不一样吗?我可以做什么与功能有相同的结果?
函数与SSE
float ** sse_multiplication(float ** array1,float ** array2,float ** arraycheck)
{
int i,j,k;
float * ms1,* ms2,result;
float * end_loop; (j = 0; j {
(i = 0; i {
$ b结果= 0;
ms1 = array1 [i];
ms2 = array2 [j];
end_loop =& array1 [i] [columns1];
__asm {
mov rax,ms1
mov rbx,ms2
mov rdx,end_loop
xorps xmm2,xmm2
循环:
movups xmm0,[rax]
movups xmm1,[rbx]
movups xmm3,[rax + 16]
movups xmm4,[rbx + 16]
mulps xmm0,xmm1
mulps xmm3,xmm4
addps xmm2,xmm0
add rax,32
add rbx,32
cmp rdx,rax
jne循环
haddps xmm2,xmm2
haddps xmm2,xmm2
movups结果,xmm2
}
arraycheck [i] [j] = result;
}
}
return arraycheck;
没有任何优化功能
float ** multiplication(float ** array1,float ** array2,float ** arraycheck)
{
for(int i = 0; i< ; row1; i ++)
for(int j = 0; j for(int k = 0; k arraycheck [i] [ j] + = array1 [i] [k] * array2 [k] [j];
return arraycheck;
$ b $ div class =h2_lin>解决方案
FP加法并不完美,所以不同的操作顺序会产生稍微不同的舍入错误。
您的C按顺序累加元素。 (除非你使用 -ffast-math
来让编译器做出与你相同的假设,那么FP操作就足够接近关联了)。
你的asm在4个不同的偏移量上总结每4个元素,然后水平地求和它们。每个向量元素的总和在每个点上都是不同的。
你的矢量化版本看起来不符合C版。索引看起来不一样。 AFAICT,矢量化 arraycheck [i] [j] + = array1 [i] [k] * array2 [k] [j]; C $ C>Ĵ
。在 k
上循环将需要从 array2
的分步加载,并循环到 i
将需要从 array1
。
的分步加载。它从两个数组中加载连续的值。 它也会在 loop >
code>,所以我认为这只是一个错误。
自循环 然而,这意味着要做一个读取修改 - 写入 这种方法不使用水平总和,但看到该链接,如果你想更好的代码。 它以与标量C相同的顺序执行操作,所以结果应该完全匹配。 如果你真的关心关于性能,您将制作FMA版本,也可能是Sandybridge的AVX-without-FMA版本。你可能每个时钟做两个256b FMA,而不是每个时钟128b的加法和mul。 (当然,你甚至没有那么做,因为除非循环足够短,无序的窗口才能看到来自下一次迭代的独立指令,否则瓶颈会延迟。) 您将需要循环平铺,也就是缓存阻止,以避免大问题。这是一个矩阵乘法,对吗?有很好的库,这是调整高速缓存的大小,并打败了裤子这样一个简单的尝试。例如 ATLAS 上次我查了很好,但那是几年前。 使用内在函数,除非您将整个函数写入asm。编译器理解他们所做的事情,所以可以做出很好的优化,比如在适当的时候循环展开。 I have two functions of 2d arrays multiplication. One of them with SSE. Another function without any optimization. Both functions work well. But the results are slightly different. For example 20.333334 and 20.333332. Can you explain why the results are different? And what can I do with functions to have the same result? function with SSE function without any optimization
Your C sums elements in order. (Unless you use Your asm sums up every 4th element at 4 different offsets, then horizontally sums those. The sum in each vector element is rounded differently at each point. Your vectorized version doesn't seem to match the C version. The indexing looks different. AFAICT, the only sane way to vectorize Am I missing something about your asm? It's loading contiguous values from both arrays. It's also throwing away the Since looping over However, that means doing a read-modify-write of This way doesn't use a horizontal sums, but see that link if you want better code for that. It does operations in the same order as the scalar C, so results should match exactly. Also note that you need to use more than one accumulator to saturate the execution units of a CPU. I'd suggest 8, to saturate Skylake's 4c latency, one per 0.5c throughput Actually, since my suggested change doesn't use a single accumulator at all, every loop iteration accesses independent memory. You'll need a bit of loop unrolling to saturate the FP execution units with two loads and store in the loop (even though you only need two pointers, since the store is back into the same location as one of the loads). But anyway, if your data fits in L1 cache, out-of-order execution should keep the execution units pretty well supplied with work from separate iterations. If you really care about performance, you'll make an FMA version, and maybe an AVX-without-FMA version for Sandybridge. You could be doing two 256b FMAs per clock instead of one 128b add and mul per clock. (And of course you're not even getting that, because you bottleneck on latency unless the loop is short enough for the out-of-order window to see independent instructions from the next iteration). You're going to need "loop tiling", aka "cache blocking" to make this not suck for big problem sizes. This is a matrix multiply, right? There are very good libraries for this, which are tuned for cache sizes, and will beat the pants off a simple attempt like this. e.g. ATLAS was good last time I checked, but that was several years ago. Use intrinsics, unless you write the whole function in asm. Compilers "understand" what they do, so can make nice optimizations, like loop unrolling when appropriate. 这篇关于有和没有SSE的不同结果(浮点数组乘法)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋! j
in内部向量循环不会改变 array1 [i] [k]
,只是广播 - 在循环之外加载一次( _mm256_set1_ps $ c
arraycheck [i] [j]
为每个不同的 j
值。即a1 [i] [k],a2 [k] [j + 0..3],ac [i] [j + 0..3],即 ac [i] [j + 0..3])
。为了避免这种情况,你必须先转换其中一个数组。 (但对于一个NxN矩阵来说,这是O(N ^ 2),它仍然比multiply便宜)。
另请注意,您需要使用多个累加器来饱和CPU的执行单元。我建议8,以饱和Skylake的4c延迟,每0.5c吞吐量 addps
一个。 Haswell有3c延迟,每1c addps
一次,但是Skylake放弃了单独的FP加单元,并在FMA单元中执行。 (请参阅 x86 标记wiki的问题,特别是
float** sse_multiplication(float** array1, float** array2, float** arraycheck)
{
int i, j, k;
float *ms1, *ms2, result;
float *end_loop;
for( i = 0; i < rows1; i++)
{
for( j = 0; j < columns2; j++)
{
result = 0;
ms1 = array1[i];
ms2 = array2[j];
end_loop = &array1[i][columns1];
__asm{
mov rax, ms1
mov rbx, ms2
mov rdx, end_loop
xorps xmm2, xmm2
loop:
movups xmm0, [rax]
movups xmm1, [rbx]
movups xmm3, [rax+16]
movups xmm4, [rbx+16]
mulps xmm0, xmm1
mulps xmm3, xmm4
addps xmm2, xmm0
add rax, 32
add rbx, 32
cmp rdx, rax
jne loop
haddps xmm2, xmm2
haddps xmm2, xmm2
movups result, xmm2
}
arraycheck[i][j] = result;
}
}
return arraycheck;
}
float** multiplication(float** array1, float** array2, float** arraycheck)
{
for (int i = 0; i < rows1; i++)
for (int j = 0; j < columns2; j++)
for (int k = 0; k < rows1; k++)
arraycheck[i][j] += array1[i][k] * array2[k][j];
return arraycheck;
}
FP addition is not perfectly associative, so a different order of operations produces slightly different rounding errors.
-ffast-math
to allow the compiler to make the same assumption you did, that FP operations are close enough to associative).
arraycheck[i][j] += array1[i][k] * array2[k][j];
is over j
. Looping over k
would require strided loads from array2
, and looping over i
would require strided loads from array1
.mulps
result in xmm3
every iteration of loop
, so I think it's just buggy.j
in the inner vector loop doesn't change array1[i][k]
, just broadcast-load it once outside the loop (_mm256_set1_ps
).arraycheck[i][j]
for every different j
value. i.e. ac[i][j + 0..3] = fma(a1[i][k], a2[k][j + 0..3], ac[i][j + 0..3])
. To avoid this, you'd have to transpose one of the arrays first. (But that's O(N^2) for an NxN matrix, which is still cheaper than multiply).
addps
. Haswell has 3c latency, one per 1c addps
, but Skylake dropped the separate FP add unit and does it in the FMA unit. (See the x86 tag wiki, esp. Agner Fog's guides)