需要在我的SSE /装配尝试一些建设性的批评 [英] Need some constructive criticism on my SSE/Assembly attempt

查看:239
本文介绍了需要在我的SSE /装配尝试一些建设性的批评的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的工作有点code的转换为SSE,虽然我有正确的输出它原来是比标准C ++ code。

I'm working on converting a bit of code to SSE, and while I have the correct output it turns out to be slower than standard c++ code.

code的,我需要为做到这一点位为:

The bit of code that I need to do this for is:

float ox = p2x - (px * c - py * s)*m;
float oy = p2y - (px * s - py * c)*m;

我有什么上证所code是:

What I've got for SSE code is:

void assemblycalc(vector4 &p, vector4 &sc, float &m, vector4 &xy)
{
    vector4 r;
    __m128 scale = _mm_set1_ps(m);

__asm
{
    mov     eax,    p       //Load into CPU reg
    mov     ebx,    sc
    movups  xmm0,   [eax]   //move vectors to SSE regs
    movups  xmm1,   [ebx]

    mulps   xmm0,   xmm1    //Multiply the Elements

    movaps  xmm2,   xmm0    //make a copy of the array  
    shufps  xmm2,   xmm0,  0x1B //shuffle the array     

    subps   xmm0,   xmm2    //subtract the elements

    mulps   xmm0,   scale   //multiply the vector by the scale

    mov     ecx,    xy      //load the variable into cpu reg
    movups  xmm3,   [ecx]   //move the vector to the SSE regs

    subps   xmm3,   xmm0    //subtract xmm3 - xmm0

    movups  [r],    xmm3    //Save the retun vector, and use elements 0 and 3
    }
}

自从很难读取code,我将解释我做了什么:

Since its very difficult to read the code, I'll explain what I did:

装的Vector4,XMM0 _ P = [PX,PY,PX,PY]结果
MULT。通过的Vector4,将xmm1 _ CS = [C,C,S,S]结果
_____________mult----------------------------

因此,
的__ _ 的___ XMM0 = [像素* C,PY * C,PX * S,PY * S]搜索结果

loaded vector4 , xmm0 _ p = [px , py , px , py ]
mult. by vector4, xmm1 _ cs = [c , c , s , s ]
_____________mult----------------------------
result,
______ xmm0 = [px*c, py*c, px*s, py*s]

再利用的结果,XMM0 = [像素* C,PY * C,PX * S,PY * S]结果
洗牌的结果,XMM2 = PY * S,像素* S,PY * C,PX * C]结果
_ __ _ __ _ 的___ _subtract --------------- -------------结果
因此,XMM0 = [PX * C-PY * S,PY * C-PX * S,PX * S-PY * C,PY * S-PX * C]搜索结果

reuse result, xmm0 = [px*c, py*c, px*s, py*s]
shuffle result, xmm2 = [py*s, px*s, py*c, px*c]
___________subtract----------------------------
result, xmm0 = [px*c-py*s, py*c-px*s, px*s-py*c, py*s-px*c]

再利用的结果,XMM0 = [PX * C-PY * S,PY * C-PX * S,PX * S-PY * C,PY * S-PX * C]结果
负载M的Vector4,比例= [M,M,M,M]结果
______________mult----------------------------

因此,XMM0 = [(PX * C-PY * S)* M,(PY * C-PX * S)* M,(PX * S-PY * C)* M,(PY * S-PX * C) * M]搜索结果
结果
加载XY的Vector4,XMM3 = [P2X,P2X,P2Y,P2Y]结果
重用,XMM0 = [(PX * C-PY * S)* M,(PY * C-PX * S)* M,(PX * S-PY * C)* M,(PY * S-PX * C) * M]结果
_ __ _ __ _ 的___ _subtract --------------- -------------结果
因此,XMM3 = [p2x-(PX * C-PY * S)* M,p2x-(PY * C-PX * S)* M,p2y-(PX * S-PY * C)* M,p2y-( PY * S-PX * C)* M]搜索结果

reuse result, xmm0 = [px*c-py*s, py*c-px*s, px*s-py*c, py*s-px*c]
load m vector4, scale = [m, m, m, m]
______________mult----------------------------
result, xmm0 = [(px*c-py*s)*m, (py*c-px*s)*m, (px*s-py*c)*m, (py*s-px*c)*m]


load xy vector4, xmm3 = [p2x, p2x, p2y, p2y]
reuse, xmm0 = [(px*c-py*s)*m, (py*c-px*s)*m, (px*s-py*c)*m, (py*s-px*c)*m]
___________subtract----------------------------
result, xmm3 = [p2x-(px*c-py*s)*m, p2x-(py*c-px*s)*m, p2y-(px*s-py*c)*m, p2y-(py*s-px*c)*m]

再牛= XMM3 [0],OY = XMM3 [3],所以我基本上不使用XMM3 [1]或XMM3 [4]

then ox = xmm3[0] and oy = xmm3[3], so I essentially don't use xmm3[1] or xmm3[4]

我的阅读困难为此道歉,但我希望有人也许能为我提供一些指导,为标准的C ++ code在0.001444ms运行,上证所code在0.00198ms运行

I apologize for the difficulty reading this, but I'm hoping someone might be able to provide some guidance for me, as the standard c++ code runs in 0.001444ms and the SSE code runs in 0.00198ms.

让我知道如果有什么我可以做进一步解释/无尘这了一点。我试图使用SSE的原因是因为我运行这个计算数百万次,这是什么拖慢我目前的code的一部分。

Let me know if there is anything I can do to further explain/clean this up a bit. The reason I'm trying to use SSE is because I run this calculation millions of times, and it is a part of what is slowing down my current code.

在此先感谢您的帮助!
布雷特

Thanks in advance for any help! Brett

推荐答案

通常的方法来做这种矢量化是把问题在它的边。相反,计算的单一价值和 OY ,你计算4 值和四个 OY 值同时进行。这最大限度地减少浪费的计算和洗牌。

The usual way to do this sort of vectorization is to turn the problem "on its side". Instead of computing a single value of ox and oy, you compute four ox values and four oy values simultaneously. This minimizes wasted computation and shuffles.

在为了做到这一点,就捆绑了几个 X P2X P2Y 数值为连续阵列(也就是说你可能有 X ,等)四个值的数组。然后,你可以这样做:

In order to do this, you bundle up several x, y, p2x and p2y values into contiguous arrays (i.e. you might have an array of four values of x, an array of four values of y, etc). Then you can just do:

movups  %xmm0,  [x]
movups  %xmm1,  [y]
movaps  %xmm2,  %xmm0
mulps   %xmm0,  [c]    // cx
movaps  %xmm3,  %xmm1
mulps   %xmm1,  [s]    // sy
mulps   %xmm2,  [s]    // sx
mulps   %xmm3,  [c]    // cy
subps   %xmm0,  %xmm1  // cx - sy
subps   %xmm2,  %xmm3  // sx - cy
mulps   %xmm0,  scale  // (cx - sy)*m
mulps   %xmm2,  scale  // (sx - cy)*m
movaps  %xmm1,  [p2x]
movaps  %xmm3,  [p2y]
subps   %xmm1,  %xmm0  // p2x - (cx - sy)*m
subps   %xmm3,  %xmm2  // p2y - (sx - cy)*m
movups  [ox],   %xmm1
movups  [oy],   %xmm3

使用这种方法,我们4个结果同时计算在18个指令,与一个单一的结果与你的方法13条指令。我们也没有浪费任何结果。

Using this approach, we compute 4 results simultaneously in 18 instructions, vs. a single result in 13 instructions with your approach. We're also not wasting any results.

这可能仍然有所改进;因为你将不得不重新整理数据结构反正使用这种方法,你应该对准阵列和使用对齐加载和存储,而不是对齐。您应该加载C和S到寄存器,并使用它们来处理的许多的x和y矢量,代替重装他们对每个矢量。为了获得最佳性能,价值计算的两个或多个载体应交错,以确保处理器有足够的工作做了一个prevent流水线停顿。

It could still be improved on; since you would have to rearrange data structures anyway to use this approach, you should align the arrays and use aligned loads and stores instead of unaligned. You should load c and s into registers and use them to process many vectors of x and y, instead of reloading them for each vector. For the best performance, two or more vectors worth of computation should be interleaved to make sure the processor has enough work to do an prevent pipeline stalls.

(在一个侧面说明:它应该是 CX + SY 而不是 CX - SY 这将使?你一个标准的旋转矩阵)

(On a side note: should it be cx + sy instead of cx - sy? That would give you a standard rotation matrix)

修改

您在你做的pretty多少你计时的硬件评论清除了一切:奔腾4 HT,2.79GHz。这是一个很老的微体系结构,其上未对齐的动作和洗牌是相当缓慢;你没有足够的工作在管道隐藏算术运算的延迟,并重新排序引擎是不是几乎一样聪明,因为它是新的微架构。

Your comment on what hardware you're doing your timings on pretty much clears everything up: "Pentium 4 HT, 2.79GHz". That's a very old microarchitecture, on which unaligned moves and shuffles are quite slow; you don't have enough work in the pipeline to hide the latency of the arithmetic operations, and the reorder engine isn't nearly as clever as it is on newer microarchitectures.

我期待您的载体code的将会的证明是比i7的标量code更快,而且很可能在酷睿2为好。在另一方面,在同一时间做四,如果你能,会更快仍。

I expect that your vector code would prove to be faster than the scalar code on i7, and probably on Core2 as well. On the other hand, doing four at a time, if you could, would be much faster still.

这篇关于需要在我的SSE /装配尝试一些建设性的批评的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆