签名转换为分数无符号固定点加法和乘法 [英] Converting SIGNED fractions to UNSIGNED fixed point for addition and multiplication

查看:182
本文介绍了签名转换为分数无符号固定点加法和乘法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我们如何才能将浮点数字来他们的定点再presentations,并使用定点运算的定点再presentations,如加法和乘法?转换回浮点时,在定点运算结果必须要服从于正确的答案。

说:

 (双)(xb_double)+(双)(xb_double)=?

然后,我们都加数转换为固定点重新presentation(整数),

 (INT)(xa_fixed)+(INT)(xb_fixed)=(INT)(xsum_fixed)

要得到(双)(xsum_double),我们将(INT)(sum_fixed)回浮点并产生相同的答案,

  FixedToDouble(xsum_fixed)=> xsum_double

具体地,如果xa_double和xb_double的值的范围是-1.65和1.65之间,我想xa_double和xb_double转换在各自的10位定点重presentations(0x0000到0x03FF的)

我曾尝试

  INT fixed_MAX = 1023;
INT fixed_MIN = 0;
双Value_MAX = 1.65;
双Value_MIN = -1.65;双斜率=((fixed_MAX) - (fixed_MIN))/((Value_MAX) - (Value_MIN));INT DoubleToFixed(双X)
{
返回圆(((X) - Value_MIN)*斜率+ fixed_MIN); //通过插值法
}双FixedToDouble(INT X)
{
回报(双)((((X)+ fixed_MIN)/斜率)+ Value_MIN);
}INT sum_fixed(INT X,int y)对
{
    回报(X + Y - (1.65 *斜率)); //分析,只是基本的数学
}INT subtract_fixed(INT X,int y)对
{
    回报(X - Y +(1.65 *斜率));
}INT product_fixed(INT X,int y)对
{
    返回(((X * Y) - (斜率*斜率*((1.65 * FixedToDouble(X))+(1.65 * FixedToDouble(Y))+(1.65 * 1.65)))+(斜率*斜率* 1.65))/坡);
}

如果我想添加(双)(1.00)+(双)(2.00)=应屈服于(双)(3.00),

使用我的code,

  xsum_fixed = DoubleToFixed(1.00)+ DoubleToFixed(2.00);
xsum_double = FixedToDouble(xsum_fixed);

我得到的答案是:

  xsum_double = 3.001613

这是非常接近正确答案(双)(3.00)

另外,如果我执行乘法和减法分别我得到2.004839和-1.001613。

这里的渔获:

所以我知道我的code是工作,但我怎么能对这些定点再presentations执行加法,乘法和减法,而无需内部浮点运算键和数字

因此​​在上面的code,功能的 sum_fixed,product_fixed和subtract_fixed 的内部有浮点数(斜率和1.65,1.65作为MAX浮动输入)。我得出我的code。通过基本的数学,真的。

所以,我想要实现加,减,且无任何内部浮点运算或数字产品功能。

更新:

我还发现了一个简单的code在转换小数定点:

  // const int的规模= 16;在32位// 1/16 ^#定义DoubleToFixed(X)(INT)((X)*(双)(1 LT;<标))
#定义FixedToDouble(X)((双)(X)/(双)(1 LT;<标))
#定义FractionPart(X)((X)及FractionMask)#定义MUL(X,Y)(((久长)(X)*(加长)(Y))>>标)
#定义DIV(X,Y)(((久长)(X)<< 16)/(Y))

不过,这仅转换UNSIGNED分数为UNSIGNED定点。我想签署分数(-1.65〜1.65)转换为UNSIGNED定点(0x0000到0x03FF的)。我怎样才能做到这一点与使用该code以上的?是范围还是位数已经是与转换过程?这是code只为​​阳性的部分?

学分@chux


解决方案

您可以有你的电话号码的浮点再presentation的尾数等于它的固定点重新presentation。由于FP除了转移较小的操作数的尾数,直到两个操作数具有相同的指数,你可以添加一定的幻数给力吧。对于双,这是1所述;≤(52- precision)(52是双的尾数大小,'precision'被所需的二进制precision位数)。因此,转换是这样的:

 工会{双F;长长的我; } U = {xfloat +(1LL<< 52- precision)}; //移X的尾数
长长的xfixed = u.i&安培; (1LL&所述;&下; 52)-1; //提取尾数

之后,你可以在整数运算使用xfixed(乘法,你必须用正确的'precision'的结果而变化)。要将其转换回双,只需1.0 /乘以(1 LT;< precision);

请注意,它不处理底片。如果你需要的话,你必须将它们转换成互补重新$ P $手动psentation(第一晶圆厂双,然后否定INT结果,如果输入的是负数)。

How can we convert floating point numbers to their "fixed-point representations", and use their "fixed-point representations" in fixed-point operations such as addition and multiplication? The result in the fixed-point operation must yield to the correct answer when converted back to floating point.

Say:

(double)(xb_double) + (double)(xb_double) = ?

Then we convert both addends to a fixed point representation (integer),

(int)(xa_fixed) + (int)(xb_fixed) = (int) (xsum_fixed)

To get (double)(xsum_double), we convert (int)(sum_fixed) back to floating point and yield same answer,

FixedToDouble(xsum_fixed) => xsum_double

Specifically, if the range of the values of xa_double and xb_double is between -1.65 and 1.65, I want to convert xa_double and xb_double in their respective 10-bit fixed point representations (0x0000 to 0x03FF)

WHAT I HAVE TRIED

int fixed_MAX = 1023;
int fixed_MIN = 0;
double Value_MAX = 1.65;
double Value_MIN = -1.65;

double slope = ((fixed_MAX) - (fixed_MIN))/((Value_MAX) - (Value_MIN));

int DoubleToFixed(double x)
{
return round(((x) - Value_MIN)*slope + fixed_MIN); //via interpolation method
}

double FixedToDouble(int x)
{
return (double)((((x) + fixed_MIN)/slope) + Value_MIN);
}

int sum_fixed(int x, int y)
{
    return (x + y - (1.65*slope)); //analysis, just basic math
}

int subtract_fixed(int x, int y)
{
    return (x - y + (1.65*slope));
}

int product_fixed(int x, int y)
{
    return (((x * y) - (slope*slope*((1.65*FixedToDouble(x)) + (1.65*FixedToDouble(y)) + (1.65*1.65))) + (slope*slope*1.65)) / slope);
}

And if I want to add (double)(1.00) + (double)(2.00) = which should yield to (double)(3.00),

With my code,

xsum_fixed = DoubleToFixed(1.00) + DoubleToFixed(2.00);
xsum_double = FixedToDouble(xsum_fixed);

I get the answer:

xsum_double = 3.001613

Which is very close to the correct answer (double)(3.00)

Also, if I perform multiplication and subtraction I get 2.004839 and -1.001613, respectively.

HERE'S THE CATCH:

So I know my code is working, but how can I perform addition, multiplication and subtraction on these fixed-point representations without having INTERNAL FLOATING POINT OPERATIONS AND NUMBERS.

So in the code above, the functions sum_fixed, product_fixed, and subtract_fixed have internal floating point numbers (slope and 1.65, 1.65 being the MAX float input). I derived my code by basic math, really.

So I want to implement add, subtract, and product functions without any internal floating point operations or numbers.

UPDATE:

I also found a simpler code in converting fractional numbers to fixed-point:

//const int scale = 16; //1/2^16 in 32 bits

#define DoubleToFixed(x) (int)((x) * (double)(1<<scale))
#define FixedToDouble(x) ((double)(x) / (double)(1<<scale))
#define FractionPart(x) ((x) & FractionMask)

#define MUL(x,y) (((long long)(x)*(long long)(y)) >> scale)
#define DIV(x, y) (((long long)(x)<<16)/(y)) 

However, this converts only UNSIGNED fractions to UNSIGNED fixed-point. And I want to convert SIGNED fractions (-1.65 to 1.65) to UNSIGNED fixed-point (0x0000 to 0x03FF). How can I do this with the use of this code above? Is the range or number of bits have something to do with the conversion process? Is this code only for positive fractions?

credits to @chux

解决方案

You can have the mantissa of the floating point representation of your number be equal to its fixed point representation. Since FP addition shifts the smaller operand's mantissa until both operands have the same exponent, you can add a certain 'magic number' to force it. For double, it's 1<<(52-precision) (52 is double's mantissa size, 'precision' is the required number of binary precision digits). So the conversion would look like this:

union { double f; long long i; } u = { xfloat+(1ll<<52-precision) }; // shift x's mantissa
long long xfixed = u.i & (1ll<<52)-1; // extract the mantissa

After that you can use xfixed in integer math (for multiplication, you'd have to shift the result right by 'precision'). To convert it back to double, simply multiply it by 1.0/(1 << precision);

Note that it doesn't handle negatives. If you need them, you'd have to convert them to the complementary representation manually (first fabs the double, then negate the int result if the input was negative).

这篇关于签名转换为分数无符号固定点加法和乘法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆