计算双产品的两个词(签字)给出的低字产品 [英] Compute the doubleword product (signed) of two words given the lower word product

查看:101
本文介绍了计算双产品的两个词(签字)给出的低字产品的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在黑客的喜悦有一个算法来计算的双字产品两(签字)字样的。

In Hacker's delight there is an algorithm to calculate the double word product of two (signed) words.

功能 muldws1 使用四个乘法和五个加法计算
双字由两个单词。

The function muldws1 uses four multiplications and five additions to calculate the double word from two words.

为实现这一code的尽头有注释掉行

Towards the end of that code there is a line commented out

/* w[1] = u*v;                  // Alternative. */

此替代使用五个乘法和4个加法,即其交流的除了换了乘法。

This alternative uses five multiplications and four addition, i.e. it exchanges an addition for a multiplication.

但我认为,这种替代方法可以改善。我没有说有关硬件任何事情。让我们假设一个假想的CPU可以计算两个单词产物的低位字,但不高字(例如,对于32位字32×32,以降低32)。在这种情况下,我认为这种算法可以得到改善。以下是我想出了
假定32位字(同样的概念将工作的64位字)。

But I think this alternative method can be improved. I have not said anything about hardware yet. Let's assume a hypothetical CPU which can calculate the lower word of the product of two words but not the upper word (e.g. for 32-bit words 32x32 to lower 32). In this case it seems to me that this algorithm can be improved. Here is what I have come up with assuming 32-bit words (the same concept would work for 64-bit words).

void muldws1_improved(int w[], int32_t x, int32_t y) {
    uint16_t xl = x; int16_t xh = x >> 16;
    uint16_t yl = y; int16_t yh = y >> 16;

    uint32 lo = x*y;
    int32_t t = xl*yh + xh*yl;

    uint16_t tl = t; int16_t th = t >>16;
    uint16_t loh = lo >> 16;

    int32_t cy = loh<tl; //carry
    int32_t hi = xh*yh + th + cy;
    w[0] = hi; w[1] = lo;
}

本使用四个乘法,三加法,一个比较。这是一个小的改进,然后我所希望的。

This uses four multiplications, three additions, and one comparison. This is a smaller improvement then I had hoped for.

这能提高?有没有更好的方法来确定进位标志我要指出,我还假设硬件没有进位(例如没有ADDC指令),但话可以比较(例如字1&LT?;字)。

Can this be improved? Is there a better way to determine the carry flag? I should point out I am also assuming the hardware has no carry flag (e.g. no ADDC instruction) but words can be compared (e.g. word1<word).

编辑:为桑德德Dycker指出我的函数调用失败的单元测试。这里是一个通过单元测试版本,但它的效率较低。我认为它可以得到改善。

as Sander De Dycker pointed out my function fails the unit tests. Here is a version which passes the unit tests but it's less efficient. I think it can be improved.

void muldws1_improved_v2(int w[], int32_t x, int32_t y) {
    uint16_t xl = x; int16_t xh = x >> 16;
    uint16_t yl = y; int16_t yh = y >> 16;

    uint32_t lo = x*y;
    int32_t  t2 = xl*yh;
    int32_t  t3 = xh*yl;
    int32_t  t4 = xh*yh;

    uint16_t t2l = t2; int16_t t2h = t2 >>16;
    uint16_t t3l = t3; int16_t t3h = t3 >>16;
    uint16_t loh = lo >> 16;

    uint16_t t = t2l + t3l;
    int32_t carry = (t<t2l) + (loh<t);
    int32_t hi = t4 + t2h + t3h + carry;
    w[0] = hi; w[1] = lo;
}

此使用四个乘法,五增加和两个比较是恶化原始功能

This uses four multiplications, five additions, and two comparisons which is worse that the original function.

推荐答案

有两个问题,我的 muldws1_improved 在我的问题的功能。其中之一是,它错过了随身携带的时候我做了 XL * YH + XH *基。这就是为什么它失败的单元测试。 但是其他的是有签名的*,它需要比在C code看到了很多机器逻辑无符号的产品。(见下面我的编辑)。 <一href=\"https://stackoverflow.com/questions/22845801/32-bit-signed-multiplication-without-using-64-bit-data-type/22847373#22847373\">I找到一个更好的解决方案这是优化的无符号产品功能 muldwu1 ,然后再去做

There were two problems with my muldws1_improved function in my question. One of them is that it missed a carry when I did xl*yh + xh*yl. This is why it failed the unit tests. But the other is that there are signed*unsigned products which require a lot more machine logic than is seen in the C code. (see my edit below). I found a better solution which is to optimized the unsigned product function muldwu1 first and then do

muldwu1(w,x,y);
w[0] -= ((x<0) ? y : 0)  + ((y<0) ? x : 0);

以校正的符号

下面是我在改善 muldwu1 使用低字 LO = X * Y (是这个函数的尝试从黑客的喜悦通过单元测试)。

Here is my attempt at improving the muldwu1 using the lower word lo = x*y (yes this function passes the unit tests from Hacker's delight).

void muldwu1_improved(uint32_t w[], uint32_t x, uint32_t y) {
    uint16_t xl = x; uint16_t xh = x >> 16;
    uint16_t yl = y; uint16_t yh = y >> 16;

    uint32_t lo   = x*y;    //32x32 to 32
    uint32_t t1   = xl*yh;  //16x16 to 32
    uint32_t t2   = xh*yl;  //16x16 to 32
    uint32_t t3   = xh*yh;  //16x16 to 32

    uint32_t t    = t1 + t2;
    uint32_t tl   = 0xFFFF & t;
    uint32_t th   = t >> 16;
    uint32_t loh  = lo >> 16;

    uint32_t cy   = ((t<t1) << 16) + (loh<tl); //carry
             w[1] = lo;
             w[0] = t3 + th + cy;
}

此使用一个除了比黑客的喜悦原有的功能较少,但它必须做两个比较

This uses one less addition than the original function from Hacker's delight but it has to do two comparisons

 1 mul32x32 to 32
 3 mul16x16 to 32
 4 add32
 5 shift logical (or shuffles)
 1 and
 2 compare32
***********
16 operations

编辑:

我被一个声明黑客的喜悦(第二版),它说,在关于该mulhs和mulhu算法困扰。

I was bothered by a statement in Hacker's Delight (2nd Edition) which says in regards to the mulhs and mulhu algorithm.

该算法要求在任一符号或无符号版本16基本RISC指令,其中4个是乘法。

The algorithm requires 16 basic RISC instructions in either the signed or unsigned version, four of which are multiplications.

我实现了签名算法<一个href=\"https://stackoverflow.com/questions/28807341/simd-signed-with-unsigned-multiplication-for-64-bit-64-bit-to-128-bit/28827013#28827013\">only 16 SSE指令的,但需要我的签名版本更多的指令。我想通了,为什么我现在可以回答我的问题。

I implemented the unsigned algorithm in only 16 SSE instructions but my signed version required more instructions. I figured out why and I can now answer my own question.

我没能找到一个更好的版本,在黑客的喜悦是他们的假设RISC处理器具有计算的两个词了产品的低字的指令的原因。 换句话说,他们的算法已经为这种情况下进行了优化,所以它不可能有一个更好的版本比他们已经有一个。

The reason I failed to find a better version that in Hacker's Delight is that their hypothetical RISC processor has an instruction which calculates the lower word of the product of two words. In other words, their algorithm is already optimized for this case and so it's unlikely there is a better version than the one they already have.

他们列出一个替代的原因是因为他们认为乘法(和除法)可能比其他指令更贵,所以他们离开了替代作为一个案例来优化。

The reason they list an alternative is because they assume multiplication (and division) may be more expensive than other instructions and so they left the alternative as a case to optimize on.

因此​​,C code不掩饰显著机器逻辑。它假定机器可以做文字*字低位字。

So the C code does not hide significant machine logic. It assumes the machine can do word * word to lower word.

为什么这件事情?在他们的算法,他们先做

Why does this matter? In their algorithm they do first

u0 = u >> 16;

和后来

t = u0*v1 + k;

如果 U = 0x80000000的 U0 = 0xffff8000 。但是,如果你的CPU只能采取半字产品,以获得一个完整的字 U0 被忽略的上半字,你会得到错误的结果签署

if u = 0x80000000 u0 = 0xffff8000. However, if your CPU can only take half word products to get a full word the upper half word of u0 is ignored and you get the wrong signed result.

在此情况下,您应该计算未签名上字,然后正确使用喜 - =((X℃下)Y:0)+(?(Y℃下)X:0); ,因为我已经说过。

In this case you should calculate the unsigned upper word and then correct using hi -= ((x<0) ? y : 0) + ((y<0) ? x : 0); as I already stated.

我感兴趣的原因是,英特尔的SIMD指令(SSE2通过AVX2)没有这的确64×64到64的指令,他们只有32×32到64,这就是为什么我的签名版本需要更多的指令。

The reason I am interested in this is that Intel's SIMD instruction (SSE2 through AVX2) do not have an instruction which does 64x64 to 64, they only have 32x32 to 64. That's why my signed version requires more instructions.

但AVX512有一个64×64到64个指令。因此与AVX512签名版本应该采取相同数量的作为无符号的指令。然而,由于64×64至64个指令可大于32×32至64个指令要慢得多,可能更有意义无论如何做无符号版本,然后正确

But AVX512 has a 64x64 to 64 instruction. Therefore with AVX512 the signed version should take the same number of instructions as the unsigned. However, since the 64x64 to 64 instruction may be much slower than the 32x32 to 64 instruction it may make more sense to do the unsigned version anyway and then correct.

这篇关于计算双产品的两个词(签字)给出的低字产品的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆