使用模拟只有不断变化可变比特转变? [英] Emulating variable bit-shift using only constant shifts?

查看:162
本文介绍了使用模拟只有不断变化可变比特转变?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图找出实际上并没有使用可变移运或任何分支机构执行间接移左/右操作的一种方式。

具体PowerPC处理器我的工作有怪癖,转向按恒定眼前,像

  INT ShiftByConstant(INT X){返回X<< 3; }

快,单运,和超标量,而换档变量,如

  INT ShiftByVar(INT X,int y)对{返回X<< ÿ; }

是<一个href=\"http://www.cellperformance.com/articles/2006/04/avoiding_micro$c$cd_instructio.html\">micro$c$cd而管道的整个休息站死执行的操作,需要7-11个周期。

我想要做的是找出哪些非微codeD整数PPC OPS的<一个href=\"http://publib.boulder.ibm.com/infocenter/systems/index.jsp?topic=/com.ibm.aix.aixassem/doc/alangref/sraw.htm\">sraw德codeS进入,然后分别发给他们。这不会的<一个延迟帮助href=\"http://publib.boulder.ibm.com/infocenter/systems/index.jsp?topic=/com.ibm.aix.aixassem/doc/alangref/sraw.htm\">sraw本身&MDASH;这将有六个&MDASH更换一次运算;但在这六个之间OPS我可以双派遣一些工作,其他的执行单元,并获得净收益。

我似乎无法找到任何地方和什么亩; OPS将sRAW德codeS进入&MDASH;没有人知道我可以代替可变比特移恒定的变化和基本的整数操作的顺序? (for循环或交换机或任何在它的一个分支不会工作,因为分支惩罚是比微code惩罚更大。)

这不需要在装配回答;我希望能学习算法,而不是特定的code,所以C中的答案或者一个高级语言,甚至伪code将非常有帮助的。

编辑:的一对夫妇澄清,我要补充:


  1. 我甚至有点担心
    可移植性

  2. PPC有一个有条件的举动,所以我们可以假设
    一个网点的存在
    内部函数

    INT ISEL(A,B,C){返回> = 0? B:C; }

    (如果你写了一个三元的
    做同样的事情,我会得到什么
    你的意思)


  3. 整数乘法也
    微codeD比将sRAW更慢。 : - (


解决方案

在这里你去...

我决定尝试这些了,以及因为迈克·阿克顿声称这将是快于用他的CellPerformance网站上的CELL / PS3微codeD转变,其中<一个href=\"http://cellperformance.beyond3d.com/articles/2006/04/avoiding-micro$c$cd-instructions-on-the-ppu.html\">he建议避免间接转移。然而,在我所有的测试,使用微$ C $光盘版不仅不是一个完整的通用无分支替换间接换挡速度更快,花费了code(1个指令)的方式较少的内存。

我做这些作为模板的唯一原因是为了得到这两个右输出签署(通常是算术)和无符号(逻辑)的变化。

 模板&LT; typename的T&GT; FORCEINLINEŧVariableShiftLeft(T NVAL,诠释nShift)
{// 31位偏移能力(在32位轧)
    const int的bMask1 = - (1和; nShift);
    const int的bMask2 = - (1及(nShift&GT;→1));
    const int的bMask3 = - (1及(nShift&GT;&→2));
    const int的bMask4 = - (1及(nShift&GT;→3));
    const int的bMask5 = - (1及(nShift&GT;→4));
    NVAL =(NVAL&安培; bMask1)+ NVAL; // NVAL =((NVAL&所述;&。1)及bMask1)| (NVAL及(〜bMask1));
    NVAL =((NVAL&所述;&下;(1 <<;&所述; 1))及bMask2)| (NVAL及(〜bMask2));
    NVAL =((NVAL&所述;&下;(1 <<; 2))及bMask3)| (NVAL及(〜bMask3));
    NVAL =((NVAL&所述;&下;(1 <<; 3;))及bMask4)| (NVAL及(〜bMask4));
    NVAL =((NVAL&所述;&下;(1 <<; 4;))及bMask5)| (NVAL及(〜bMask5));
    返回(NVAL);
}
模板&LT; typename的T&GT; FORCEINLINEŧVariableShiftRight(T NVAL,诠释nShift)
{// 31位偏移能力(在32位轧)
    const int的bMask1 = - (1和; nShift);
    const int的bMask2 = - (1及(nShift&GT;→1));
    const int的bMask3 = - (1及(nShift&GT;&→2));
    const int的bMask4 = - (1及(nShift&GT;→3));
    const int的bMask5 = - (1及(nShift&GT;→4));
    NVAL =((NVAL&GT;→1)及bMask1)| (NVAL及(〜bMask1));
    NVAL =((NVAL&GT;&GT;(1 <<;&所述; 1))及bMask2)| (NVAL及(〜bMask2));
    NVAL =((NVAL&GT;&GT;(1 <<; 2))及bMask3)| (NVAL及(〜bMask3));
    NVAL =((NVAL&GT;&GT;(1 <<; 3;))及bMask4)| (NVAL及(〜bMask4));
    NVAL =((NVAL&GT;&GT;(1 <<; 4;))及bMask5)| (NVAL及(〜bMask5));
    返回(NVAL);
}

编辑:注意在ISEL()
我看见你的<一个href=\"http://assemblyrequired.crashworks.org/2009/01/04/fcmp-conditional-moves-for-branchless-math/comment-page-1/#comment-5026\">isel() code在您的网站。

  //如果&GT; = 0,返回X,否则ÿ
INT ISEL(int类型的,诠释的x,int y)对
{
    INT面具= A&GT;&GT; 31; //算术右移,图示出签位
    //面膜是0xFFFFFFFF如果(一℃下),并为0x00,否则。
    返回X +((Y - X)及掩模);
};

FWIW,如果你重写你的ISEL()做一个面具,面具的补充,它会在你的PowerPC目标更快,因为编译器是足够聪明,产生'ANDC'运code。这是相同数量的运算codeS的但在运算codeS少一个结果至输入寄存器的依赖。两个掩模操作也可以在并行发出一个超标量处理器。它可以是2-3个周期更快,如果一切都正确一字排开。你只需要改变返回到本作的PowerPC版本:

 则返回(x及(〜面罩))+(Y&安培;面罩);

I'm trying to find a way to perform an indirect shift-left/right operation without actually using the variable shift op or any branches.

The particular PowerPC processor I'm working on has the quirk that a shift-by-constant-immediate, like

int ShiftByConstant( int x ) { return x << 3 ; }

is fast, single-op, and superscalar, whereas a shift-by-variable, like

int ShiftByVar( int x, int y ) { return x << y ; }

is a microcoded operation that takes 7-11 cycles to execute while the entire rest of the pipeline stops dead.

What I'd like to do is figure out which non-microcoded integer PPC ops the sraw decodes into and then issue them individually. This won't help with the latency of the sraw itself — it'll replace one op with six — but in between those six ops I can dual-dispatch some work to the other execution units and get a net gain.

I can't seem to find anywhere what μops sraw decodes into — does anyone know how I can replace a variable bit-shift with a sequence of constant shifts and basic integer operations? (A for loop or a switch or anything with a branch in it won't work because the branch penalty is even bigger than the microcode penalty.)

This needn't be answered in assembly; I'm hoping to learn the algorithm rather than the particular code, so an answer in C or a highlevel language or even pseudocode would be perfectly helpful.

edit: A couple of clarifications that I should add:

  1. I'm not even a little bit worried about portability
  2. PPC has a conditional-move, so we can assume the existence of a branchless intrinsic function

    int isel(a, b, c) { return a >= 0 ? b : c; }

    (if you write out a ternary that does the same thing I'll get what you mean)

  3. integer multiply is also microcoded and even slower than sraw. :-(

解决方案

Here you go...

I decided to try these out as well since Mike Acton claimed it would be faster than using the CELL/PS3 microcoded shift on his CellPerformance site where he suggests to avoid the indirect shift. However, in all my tests, using the microcoded version was not only faster than a full generic branch-free replacement for indirect shift, it takes way less memory for the code (1 instruction).

The only reason I did these as templates was to get the right output for both signed (usually arithmetic) and unsigned (logical) shifts.

template <typename T> FORCEINLINE T VariableShiftLeft(T nVal, int nShift)
{   // 31-bit shift capability (Rolls over at 32-bits)
    const int bMask1=-(1&nShift);
    const int bMask2=-(1&(nShift>>1));
    const int bMask3=-(1&(nShift>>2));
    const int bMask4=-(1&(nShift>>3));
    const int bMask5=-(1&(nShift>>4));
    nVal=(nVal&bMask1) + nVal;   //nVal=((nVal<<1)&bMask1) | (nVal&(~bMask1));
    nVal=((nVal<<(1<<1))&bMask2) | (nVal&(~bMask2));
    nVal=((nVal<<(1<<2))&bMask3) | (nVal&(~bMask3));
    nVal=((nVal<<(1<<3))&bMask4) | (nVal&(~bMask4));
    nVal=((nVal<<(1<<4))&bMask5) | (nVal&(~bMask5));
    return(nVal);
}
template <typename T> FORCEINLINE T VariableShiftRight(T nVal, int nShift)
{   // 31-bit shift capability (Rolls over at 32-bits)
    const int bMask1=-(1&nShift);
    const int bMask2=-(1&(nShift>>1));
    const int bMask3=-(1&(nShift>>2));
    const int bMask4=-(1&(nShift>>3));
    const int bMask5=-(1&(nShift>>4));
    nVal=((nVal>>1)&bMask1) | (nVal&(~bMask1));
    nVal=((nVal>>(1<<1))&bMask2) | (nVal&(~bMask2));
    nVal=((nVal>>(1<<2))&bMask3) | (nVal&(~bMask3));
    nVal=((nVal>>(1<<3))&bMask4) | (nVal&(~bMask4));
    nVal=((nVal>>(1<<4))&bMask5) | (nVal&(~bMask5));
    return(nVal);
}

EDIT: Note on isel() I saw your isel() code on your website.

// if a >= 0, return x, else y
int isel( int a, int x, int y )
{
    int mask = a >> 31; // arithmetic shift right, splat out the sign bit
    // mask is 0xFFFFFFFF if (a < 0) and 0x00 otherwise.
    return x + ((y - x) & mask);
};

FWIW, if you rewrite your isel() to do a mask and mask complement, it will be faster on your PowerPC target since the compiler is smart enough to generate an 'andc' opcode. It's the same number of opcodes but there is one fewer result-to-input-register dependency in the opcodes. The two mask operations can also be issued in parallel on a superscalar processor. It can be 2-3 cycles faster if everything is lined up correctly. You just need to change the return to this for the PowerPC versions:

return (x & (~mask)) + (y & mask);

这篇关于使用模拟只有不断变化可变比特转变?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆