有和没有SSE的不同结果(浮点数组乘法) [英] different results with and without SSE ( float arrays multiplication)

查看:208
本文介绍了有和没有SSE的不同结果(浮点数组乘法)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有两个数组乘法的功能。其中之一与上证所。另一个功能没有任何优化。这两个功能都很好。但结果稍有不同。例如20.333334和20.333332。

你能解释为什么结果不一样吗?我可以做什么与功能有相同的结果?

函数与SSE

  float ** sse_multiplication(float ** array1,float ** array2,float ** arraycheck)
{
int i,j,k;
float * ms1,* ms2,result;
float * end_loop; (j = 0; j {

(i = 0; i {
$ b结果= 0;
ms1 = array1 [i];
ms2 = array2 [j];
end_loop =& array1 [i] [columns1];

__asm {
mov rax,ms1
mov rbx,ms2
mov rdx,end_loop
xorps xmm2,xmm2

循环:
movups xmm0,[rax]
movups xmm1,[rbx]
movups xmm3,[rax + 16]
movups xmm4,[rbx + 16]

mulps xmm0,xmm1
mulps xmm3,xmm4

addps xmm2,xmm0

add rax,32
add rbx,32

cmp rdx,rax
jne循环

haddps xmm2,xmm2
haddps xmm2,xmm2

movups结果,xmm2
}

arraycheck [i] [j] = result;
}
}
return arraycheck;





没有任何优化功能

  float ** multiplication(float ** array1,float ** array2,float ** arraycheck)
{
for(int i = 0; i< ; row1; i ++)
for(int j = 0; j for(int k = 0; k arraycheck [i] [ j] + = array1 [i] [k] * array2 [k] [j];

return arraycheck;



$ b $ div class =h2_lin>解决方案

FP加法并不完美,所以不同的操作顺序会产生稍微不同的舍入错误。



您的C按顺序累加元素。 (除非你使用 -ffast-math 来让编译器做出与你相同的假设,那么FP操作就足够接近关联了)。

你的asm在4个不同的偏移量上总结每4个元素,然后水平地求和它们。每个向量元素的总和在每个点上都是不同的。




你的矢量化版本看起来不符合C版。索引看起来不一样。 AFAICT,矢量化 arraycheck [i] [j] + = array1 [i] [k] * array2 [k] [j]; C $ C>Ĵ。在 k 上循环将需要从 array2 的分步加载,并循环到 i 将需要从 array1



的分步加载。它从两个数组中加载连续的值。 它也会在 loop > 循环的每次迭代中抛弃<​​code> mulps code>,所以我认为这只是一个错误



自循环 j in内部向量循环不会改变 array1 [i] [k] ,只是广播 - 在循环之外加载一次( _mm256_set1_ps

然而,这意味着要做一个读取修改 - 写入 arraycheck [i] [j] 为每个不同的 j 值。即a1 [i] [k],a2 [k] [j + 0..3],ac [i] [j + 0..3],即 ac [i] [j + 0..3])。为了避免这种情况,你必须先转换其中一个数组。 (但对于一个NxN矩阵来说,这是O(N ^ 2),它仍然比multiply便宜)。

这种方法不使用水平总和,但看到该链接,如果你想更好的代码。



它以与标量C相同的顺序执行操作,所以结果应该完全匹配。



另请注意,您需要使用多个累加器来饱和​​CPU的执行单元。我建议8,以饱和Skylake的4c延迟,每0.5c吞吐量 addps 一个。 Haswell有3c延迟,每1c addps 一次,但是Skylake放弃了单独的FP加单元,并在FMA单元中执行。 (请参阅 x86 标记wiki的问题,特别是

如果你真的关心关于性能,您将制作FMA版本,也可能是Sandybridge的AVX-without-FMA版本。你可能每个时钟做两个256b FMA,而不是每个时钟128b的加法和mul。 (当然,你甚至没有那么做,因为除非循环足够短,无序的窗口才能看到来自下一次迭代的独立指令,否则瓶颈会延迟。)

您将需要循环平铺,也就是缓存阻止,以避免大问题。这是一个矩阵乘法,对吗?有很好的库,这是调整高速缓存的大小,并打败了裤子这样一个简单的尝试。例如 ATLAS 上次我查了很好,但那是几年前。






使用内在函数,除非您将整个函数写入asm。编译器理解他们所做的事情,所以可以做出很好的优化,比如在适当的时候循环展开。

I have two functions of 2d arrays multiplication. One of them with SSE. Another function without any optimization. Both functions work well. But the results are slightly different. For example 20.333334 and 20.333332.

Can you explain why the results are different? And what can I do with functions to have the same result?

function with SSE

float** sse_multiplication(float** array1, float** array2, float** arraycheck)
{
    int i, j, k;
    float *ms1, *ms2, result;
    float *end_loop;

    for( i = 0; i < rows1; i++)
    {
        for( j = 0; j < columns2; j++)
        {
            result = 0;
            ms1 = array1[i];
            ms2 = array2[j];
            end_loop = &array1[i][columns1];

            __asm{
                     mov rax, ms1
                     mov rbx, ms2
                     mov rdx, end_loop
                     xorps xmm2, xmm2

                loop:
                     movups xmm0, [rax]
                     movups xmm1, [rbx]
                     movups xmm3, [rax+16]
                     movups xmm4, [rbx+16]

                     mulps xmm0, xmm1
                     mulps xmm3, xmm4

                     addps xmm2, xmm0

                     add rax, 32
                     add rbx, 32

                     cmp rdx, rax
                     jne loop

                     haddps xmm2, xmm2
                     haddps xmm2, xmm2

                     movups result, xmm2
               }

             arraycheck[i][j] = result;
        }
    }
    return arraycheck;
}

function without any optimization

float** multiplication(float** array1, float** array2, float** arraycheck)
{
    for (int i = 0; i < rows1; i++)
        for (int j = 0; j < columns2; j++)
            for (int k = 0; k < rows1; k++)
                arraycheck[i][j] += array1[i][k] * array2[k][j];

    return arraycheck;
}

解决方案

FP addition is not perfectly associative, so a different order of operations produces slightly different rounding errors.

Your C sums elements in order. (Unless you use -ffast-math to allow the compiler to make the same assumption you did, that FP operations are close enough to associative).

Your asm sums up every 4th element at 4 different offsets, then horizontally sums those. The sum in each vector element is rounded differently at each point.


Your vectorized version doesn't seem to match the C version. The indexing looks different. AFAICT, the only sane way to vectorize arraycheck[i][j] += array1[i][k] * array2[k][j]; is over j. Looping over k would require strided loads from array2, and looping over i would require strided loads from array1.

Am I missing something about your asm? It's loading contiguous values from both arrays. It's also throwing away the mulps result in xmm3 every iteration of loop, so I think it's just buggy.

Since looping over j in the inner vector loop doesn't change array1[i][k], just broadcast-load it once outside the loop (_mm256_set1_ps).

However, that means doing a read-modify-write of arraycheck[i][j] for every different j value. i.e. ac[i][j + 0..3] = fma(a1[i][k], a2[k][j + 0..3], ac[i][j + 0..3]). To avoid this, you'd have to transpose one of the arrays first. (But that's O(N^2) for an NxN matrix, which is still cheaper than multiply).

This way doesn't use a horizontal sums, but see that link if you want better code for that.

It does operations in the same order as the scalar C, so results should match exactly.


Also note that you need to use more than one accumulator to saturate the execution units of a CPU. I'd suggest 8, to saturate Skylake's 4c latency, one per 0.5c throughput addps. Haswell has 3c latency, one per 1c addps, but Skylake dropped the separate FP add unit and does it in the FMA unit. (See the tag wiki, esp. Agner Fog's guides)

Actually, since my suggested change doesn't use a single accumulator at all, every loop iteration accesses independent memory. You'll need a bit of loop unrolling to saturate the FP execution units with two loads and store in the loop (even though you only need two pointers, since the store is back into the same location as one of the loads). But anyway, if your data fits in L1 cache, out-of-order execution should keep the execution units pretty well supplied with work from separate iterations.

If you really care about performance, you'll make an FMA version, and maybe an AVX-without-FMA version for Sandybridge. You could be doing two 256b FMAs per clock instead of one 128b add and mul per clock. (And of course you're not even getting that, because you bottleneck on latency unless the loop is short enough for the out-of-order window to see independent instructions from the next iteration).

You're going to need "loop tiling", aka "cache blocking" to make this not suck for big problem sizes. This is a matrix multiply, right? There are very good libraries for this, which are tuned for cache sizes, and will beat the pants off a simple attempt like this. e.g. ATLAS was good last time I checked, but that was several years ago.


Use intrinsics, unless you write the whole function in asm. Compilers "understand" what they do, so can make nice optimizations, like loop unrolling when appropriate.

这篇关于有和没有SSE的不同结果(浮点数组乘法)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆