指令数量最少的最快整数平方根 [英] Fastest Integer Square Root in the least amount of instructions

查看:76
本文介绍了指令数量最少的最快整数平方根的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要不涉及任何明确除法的快速整数平方根.目标RISC体系结构可以在一个周期内执行 add mul sub shift 之类的操作(嗯-运算的结果确实是在第三个周期中写的-但存在交织),因此使用这些操作且速度很快的任何Integer算法都将受到赞赏.

这就是我现在所拥有的,并且我认为二进制搜索应该更快,因为下面的循环每次执行16次(无论值如何).我尚未对其进行广泛的调试(但很快),所以也许有可能在那早退:

  unsigned short int int_sqrt32(unsigned int x){无符号short int res = 0;无符号short int add = 0x8000;我for(i = 0; i< 16; i ++){unsigned short int temp = res |添加;无符号整数g2 = temp * temp;如果(x> = g2){res = temp;}加>> = 1;}返回资源;} 

类似于[在目标RISC的上下文中]以上的当前性能成本是5条指令(位集,多路,比较,存储,移位)的循环.缓存中可能没有空间可以完全展开(但这肯定是部分展开的主要候选对象(例如,肯定是4而不是16的循环)).因此,成本为16 * 5 = 80条指令(如果不展开,则会增加循环开销).如果完全交错,则只需花费80个周期(最后一条指令为+2)周期.

我能否在82个周期内获得其他一些sqrt实现(仅使用add,mul,bitshift,store/cmp)?

常见问题解答:

  • 您为什么不依靠编译器来生成良好的快速代码?

    该平台没有可用的C→RISC编译器.我将把当前的参考C代码移植到手写的RISC ASM中.

  • 您是否配置了代码以查看 sqrt 是否实际上是瓶颈?

    不,没有必要.目标RISC芯片约为20 MHz,因此每一个指令都在计数.使用此 sqrt 的核心循环(计算射击者和接收者补丁之间的能量传递形状因数)将在每个渲染帧中运行约1000次(当然,假设它足够快)),最高每秒60,000次,整个演示大约1,000,000次.

  • 您是否尝试过优化算法以删除 sqrt ?

    是的,我已经做到了.实际上,我已经摆脱了2个 sqrt 和许多除法(通过移位删除或替换).即使在我的千兆赫笔记本电脑上,我也可以看到性能上的巨大提升(与参考浮动版本相比).

  • 什么是应用程序?

    这是compo演示的实时渐进细化光能传递渲染器.想法是每帧有一个拍摄周期,因此它在每个渲染帧上都可以收敛并看起来更好(例如,每秒60次,尽管SW光栅化器可能不会那么快[但是至少它可以运行在另一块与RISC并行的芯片上-因此,如果需要2-3帧渲染场景,则RISC将并行处理2-3帧的光能传递数据]).

  • 您为什么不直接在目标ASM中工作?

    由于光能传递是一种涉及程度很小的算法,因此我需要即时编辑&继续Visual Studio的调试功能.上周末我在VS中所做的工作(几百次代码更改,将浮点数学转换为仅整数)将使我在目标平台上花费6个月的时间,仅进行打印调试."

  • 为什么不能使用除法?

    因为在目标RISC上它比以下任何一项都要慢16倍:多,加,减,移位,比较,加载/存储(仅需1个周期).因此,只有在绝对需要时才使用它(不幸的是,已经有几次不能使用变速功能了.).

  • 可以使用查询表吗?

    引擎已经需要其他LUT,并且从主RAM复制到RISC的小缓存非常昂贵(而且绝对不是每一帧).但是,如果它可以使我至少提高100-200%的 sqrt .

  • ,我也许可以保留128-256字节.
  • sqrt 的值范围是什么?

    我设法将其简化为无符号的32位int(4,294,967,295)

解决方案

此处.

例如,在3(a)处存在此方法,该方法可以轻松适应64-> 32位平方根,也可以轻松转录为汇编程序:

 /*作者Jim Ulery */静态无符号julery_isqrt(无符号长值){无符号长时间,g = 0,b = 0x8000,bshft = 15;做 {如果(val> =(temp =(((g<< 1)+ b)<< bshft--))){g + = b;val-= temp;}} while(b>> = 1);返回g;} 

无除法,无乘法,仅移位.但是,所花费的时间将是无法预测的,特别是如果您使用分支(在ARM RISC上有条件的指令会起作用).

通常,此页面列出了计算平方根的方法.如果您碰巧想产生一个快速的 inverse 平方根(即 x **(-0.5)),或者只是对以惊人的方式优化代码感兴趣,看看.

I am in a need of fast integer square root that does not involve any explicit division. The target RISC architecture can do operations like add, mul, sub, shift in one cycle (well - the operation's result is written in third cycle, really - but there's interleaving), so any Integer algorithm that uses these ops and is fast would be very appreciated.

This is what I have right now and I'm thinking that a binary search should be faster, since the following loop executes 16 times every single time (regardless of the value). I haven't debugged it extensively yet (but soon), so perhaps it's possible to have an early exit there:

unsigned short int int_sqrt32(unsigned int x)
{
    unsigned short int res=0;
    unsigned short int add= 0x8000;   
    int i;
    for(i=0;i<16;i++)
    {
        unsigned short int temp=res | add;
        unsigned int g2=temp*temp;      
        if (x>=g2)
        {
            res=temp;           
        }
        add>>=1;
    }
    return res;
}

Looks like the current performance cost of the above [in the context of the target RISC] is a loop of 5 instructions (bitset, mul, compare, store, shift). Probably no space in cache to unroll fully (but this will be the prime candidate for a partial unroll [e.g. A loop of 4 rather than 16], for sure). So, the cost is 16*5 = 80 instructions (plus loop overhead, if not unrolled). Which, if fully interleaved, would cost only 80 (+2 for last instruction) cycles.

Can I get some other sqrt implementation (using only add, mul, bitshift, store/cmp) under 82 cycles?

FAQ:

  • Why don't you rely on the compiler to produce a good fast code?

    There is no working C → RISC compiler for the platform. I will be porting the current reference C code into hand-written RISC ASM.

  • Did you profile the code to see if sqrt is actually a bottleneck?

    No, there is no need for that. The target RISC chip is about twenty MHz, so every single instruction counts. The core loop (calculating the energy transfer form factor between the shooter and receiver patch), where this sqrt is used, will be run ~1,000 times each rendering frame (assuming it will be fast enough, of course), up to 60,000 per second, and roughly 1,000,000 times for whole demo.

  • Have you tried to optimize the algorithm to perhaps remove the sqrt?

    Yes, I did that already. In fact, I got rid of 2 sqrts already and lots of divisions (removed or replaced by shifting). I can see a huge performance boost (compared to the reference float version) even on my gigahertz notebook.

  • What is the application?

    It's a real-time progressive-refinement radiosity renderer for the compo demo. The idea is to have one shooting cycle each frame, so it would visibly converge and look better with each rendered frame (e.g. Up 60-times per second, though the SW rasterizer won't probably be that fast [but at least it can run on the other chip in parallel with the RISC - so if it takes 2-3 frames to render the scene, the RISC will have worked through 2-3 frames of radiosity data, in parallel]).

  • Why don't you work directly in target ASM?

    Because radiosity is a slightly involved algorithm and I need the instant edit & continue debugging capability of Visual Studio. What I've done over the weekend in VS (couple hundred code changes to convert the floating-point math to integer-only) would take me 6 months on the target platform with only printing debugging".

  • Why can't you use a division?

    Because it's 16-times slower on the target RISC than any of the following: mul, add, sub, shift, compare, load/store (which take just 1 cycle). So, it's used only when absolutely required (a couple times already, unfortunately, when shifting could not be used).

  • Can you use look-up tables?

    The engine needs other LUTs already and copying from main RAM to RISC's little cache is prohibitively expensive (and definitely not each and every frame). But, I could perhaps spare 128-256 Bytes if it gave me at least a 100-200% boost for sqrt.

  • What's the range of the values for sqrt?

    I managed to reduce it to mere unsigned 32-bit int (4,294,967,295)

解决方案

Have a look here.

For instance, at 3(a) there is this method, which is trivially adaptable to do a 64->32 bit square root, and also trivially transcribable to assembler:

/* by Jim Ulery */
static unsigned julery_isqrt(unsigned long val) {
    unsigned long temp, g=0, b = 0x8000, bshft = 15;
    do {
        if (val >= (temp = (((g << 1) + b)<<bshft--))) {
           g += b;
           val -= temp;
        }
    } while (b >>= 1);
    return g;
}

No divisions, no multiplications, bit shifts only. However, the time taken will be somewhat unpredictable particularly if you use a branch (on ARM RISC conditional instructions would work).

In general, this page lists ways to calculate square roots. If you happen to want to produce a fast inverse square root (i.e. x**(-0.5) ), or are just interested in amazing ways to optimise code, take a look at this, this and this.

这篇关于指令数量最少的最快整数平方根的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆