如何仅在一个时钟周期内获得 32 位输入的平方根? [英] How to get a square root for 32 bit input in one clock cycle only?

查看:18
本文介绍了如何仅在一个时钟周期内获得 32 位输入的平方根?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想用 Verilog 设计一个可综合的模块,它只需要一个周期来计算给定的 32 位输入的平方根.

解决方案

[Edit1] 修复代码

最近发现结果不正常,即使测试确定一切正常,所以我深入挖掘并发现我的方程中有一个愚蠢的错误,并且由于与我的 pgm 环境的名称冲突,测试得到了误报,所以我忽略了它前.现在它可以在所有情况下正常工作.

我能想到的最好的事情(除了近似或大型LUT)是没有乘法的二分搜索,这里是C++ 代码:

//---------------------------------------------------------------------字 u32_sqrt(DWORD xx)//16 T{双字 x,m,a0,a1,i;const DWORD lut[16]={//    毫米0x40000000,0x10000000,0x04000000,0x01000000,0x00400000,0x00100000,0x00040000,0x00010000,0x00004000,0x00001000,0x00000400,0x00000100,0x00000040,0x00000010,0x00000004,0x00000001,};对于 (x=0,a0=0,m=0x8000,i=0;m;m>>=1,i++){a1=a0+lut[i]+(x<<(16-i));如果 (a1<=xx) { a0=a1;x|=m;}}返回 x;}//---------------------------------------------------------------------------

标准二分查找sqrt(xx) 是将x 的位从MSB 设置为LSB 以便结果x*x <= xx.幸运的是,我们可以通过简单地将事物重写为递增乘法来避免乘法……在每次迭代中,旧的 x*x 结果可以这样使用:

x1 = x0+mx1*x1 = (x0+m)*(x0+m) = (x0*x0) + (2*m*x0) + (m*m)

其中 x0 是上次迭代的 x 值,x1 是实际值.m 是实际处理位的权重.(2*m)(m*m) 是常数,可以用作 LUT 和位移,所以不需要相乘.只需要添加.可悲的是,迭代绑定到顺序计算,禁止并行化,所以结果最好是 16T.

在代码中a0代表最后的x*xa1代表实际迭代的x*x

如您所见,sqrt 是在 16 x (BitShiftLeft,BitShiftRight,OR,Plus,Compare) 中完成的,其中位移和 LUT 可以是硬接线.

如果与其他门相比,您有超快的门,您可以将输入时钟乘以 16 并将其用作 SQRT 模块的内部时序.类似于过去在旧的 Intel CPU/MCU s 中有 MC 时钟作为源 CPU 时钟的分工的东西......这种方式您可以获得 1T 计时(或它的倍数取决于乘法比).

I want to design a synthesizable module in Verilog which will take only one cycle in calculating square root of given input of 32 bit.

解决方案

[Edit1] repaired code

Recently found the results where off even if tests determine all was OK so I dig deeper and found out that I had a silly bug in my equation and due to name conflicts with my pgm environment the tests got false positives so I overlooked it before. Now it work in all cases as it should.

The best thing I can think of (except approximation or large LUT) is binary search without multiplication, here C++ code:

//---------------------------------------------------------------------------
WORD u32_sqrt(DWORD xx) // 16 T
    {
    DWORD x,m,a0,a1,i;
    const DWORD lut[16]=
        {
        //     m*m
        0x40000000,
        0x10000000,
        0x04000000,
        0x01000000,
        0x00400000,
        0x00100000,
        0x00040000,
        0x00010000,
        0x00004000,
        0x00001000,
        0x00000400,
        0x00000100,
        0x00000040,
        0x00000010,
        0x00000004,
        0x00000001,
        };
    for (x=0,a0=0,m=0x8000,i=0;m;m>>=1,i++)
        {
        a1=a0+lut[i]+(x<<(16-i));
        if (a1<=xx) { a0=a1; x|=m; }
        }
    return x;
    }
//---------------------------------------------------------------------------

Standard binary search sqrt(xx) is setting bits of x from MSB to LSB so that result of x*x <= xx. Luckily we can avoid the multiplication by simply rewrite the thing as incrementing multiplicant... in each iteration the older x*x result can be used like this:

x1 = x0+m
x1*x1 = (x0+m)*(x0+m) = (x0*x0) + (2*m*x0) + (m*m)

Where x0 is value of x from last iteration and x1 is actual value. The m is weight of actual processed bit. The (2*m) and (m*m) are constant and can be used as LUT and bit-shift so no need to multiply. Only addition is needed. Sadly the iteration is bound to sequential computation forbid paralelisation so the result is 16T at best.

In the code a0 represents last x*x and a1 represents actual iterated x*x

As you can see the sqrt is done in 16 x (BitShiftLeft,BitShiftRight,OR,Plus,Compare) where the bit shift and LUT can be hardwired.

If you got super fast gates for this in comparison to the rest you can multiply the input clock by 16 and use that as internal timing for SQRT module. Something similar to the old days when there was MC clock as Division of source CPU clock in old Intel CPU/MCUs ... This way you can get 1T timing (or multiple of it depends on the multiplication ratio).

这篇关于如何仅在一个时钟周期内获得 32 位输入的平方根?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆