高效乘法/双128位的整数x86上的鸿沟(无64位) [英] Efficient Multiply/Divide of two 128-bit Integers on x86 (no 64-bit)

查看:346
本文介绍了高效乘法/双128位的整数x86上的鸿沟(无64位)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

编译器:的MinGW / GCC
问题:允许没有GPL / LGPL code(GMP或与此有关的任何BIGNUM库,是大材小用对于这个问题,因为我已经有该类实现)

Compiler: MinGW/GCC
Issues: No GPL/LGPL code allowed (GMP or any bignum library for that matter, is overkill for this problem, as I already have the class implemented).

我已经构建了自己的 128位固定大小的大整数类(拟用于游戏引擎,但可以推广到任何使用情况下),我觉得目前的乘法的性能并划分操作是相当糟糕的(是的,我已经超时他们,见下文),和我想提高(或更改)做低层次的数字运算的算法。

I have constructed my own 128-bit fixed-size big integer class (intended for use in a game engine but may be generalized to any usage case) and I find the performance of the current multiply and divide operations to be quite abysmal (yes, I have timed them, see below), and I'd like to improve on (or change) the algorithms that do the low-level number crunching.

当谈到乘法和除法运算符,它们相比,只是一切在课堂上都不能忍受缓慢。

When it comes to the multiply and divide operators, they are unbearably slow compared to just about everything else in the class.

这些都是大致测量相对于我自己的电脑:

These are the approximate measurements relative to my own computer:

Raw times as defined by QueryPerformanceFrequency:
1/60sec          31080833u
Addition:              ~8u
Subtraction:           ~8u
Multiplication:      ~546u
Division:           ~4760u (with maximum bit count)

正如你所看到的,只是在做乘法是很多,很多时候慢于增加或减少。司比乘法慢10倍左右。

As you can see, just doing the multiplication is many, many times slower than add or subtract. Division is about 10 times slower than multiplication.

我想提高这两个运营商的速度,因为有可能是非常高的量的计算的每帧正在做(点产品,各种碰撞检测方法,等等)。

I'd like to improve the speed of these two operators because there may be a very high amount of computations being done per frame (dot products, various collision detection methods, etc).

结构(方法略)看起来有点像:

The structure (methods omitted) looks somewhat like:

class uint128_t
{
    public:
        unsigned long int dw3, dw2, dw1, dw0;
  //...
}

目前使用典型的长乘方式(组装,这样我能赶上做的是 EDX 输出),而忽略了走出去的范围(即,我只是做了的话10 '相比,16 S)。

Multiplication is currently done using the typical long-multiplication method (in assembly so that I can catch the EDX output) while ignoring the words that go out of range (that is, I'm only doing 10 mull's compared to 16).

使用的移减去算法(速度取决于操作数的位计数)。然而,这不是在组装完成。我发现有点太难鼓起,并决定让编译器优化。

Division uses the shift-subtract algorithm (speed depends on bit counts of the operands). However, it is not done in assembly. I found that a little too difficult to muster and decided to let the compiler optimize it.

我已经Google'd各地数天在看描述算法,如 Karatsuba的乘页,高基数师和<一href="http://en.wikipedia.org/wiki/Division_%28digital%29#Newton.E2.80.93Raphson_division">Newton-Rapson司但数学符号是有点过分了我的头。我想使用其中的一些先进的方法,以加快我的code,但我不得不翻译希腊到的东西融为一体prehensible第一。

I have Google'd around for several days looking at pages describing algorithms such as Karatsuba Multiplication, high-radix division, and Newton-Rapson Division but the math symbols are a little too far over my head. I'd like to use some of these advanced methods to speed up my code, but I'd have to translate the "Greek" into something comprehensible first.

对于那些可能认为我的努力premature优化;我认为这是code是一个瓶颈,因为在非常初级的,数学运算本身变得缓慢。我可以忽略这种类型的更高级别的code优化,但是这code将被调用/使用足以为它没关系。

For those that may deem my efforts "premature optimization"; I consider this code to be a bottleneck because the very elementary-math operations themselves become slow. I can ignore such types of optimization on higher-level code, but this code will be called/used enough for it to matter.

我想上的算法,我应该用它来改善乘法和除法(如果可能的话)的建议,并就如何建议的算法的工作基础(希望很容易理解)的解释是的的AP preciated。

I'd like suggestions on which algorithm I should use to improve multiply and divide (if possible), and a basic (hopefully easy to understand) explanation on how the suggested algorithm works would be highly appreciated.

我能够通过内联code到符* =提高了乘法运算,似乎尽可能快。

I was able to improve the multiply operation by inlining code into operator*= and it seems as fast as possible.

Updated raw times:
1/60sec          31080833u
Addition:              ~8u
Subtraction:           ~8u
Multiplication:      ~100u (lowest ~86u, highest around ~256u)
Division:           ~4760u (with maximum bit count)

下面是一些裸机code为你检查(请注意,我喜欢的类型名称实际上是不同的,这是编辑为简单起见):

Here's some bare-bones code for you to examine (note that my type names are actually different, this was edited for simplicity):

//File: "int128_t.h"
class int128_t
{
    uint32_t dw3, dw2, dw1, dw0;

    // Various constrctors, operators, etc...

    int128_t& operator*=(const int128_t&  rhs) __attribute__((always_inline))
    {
        int128_t Urhs(rhs);
        uint32_t lhs_xor_mask = (int32_t(dw3) >> 31);
        uint32_t rhs_xor_mask = (int32_t(Urhs.dw3) >> 31);
        uint32_t result_xor_mask = (lhs_xor_mask ^ rhs_xor_mask);
        dw0 ^= lhs_xor_mask;
        dw1 ^= lhs_xor_mask;
        dw2 ^= lhs_xor_mask;
        dw3 ^= lhs_xor_mask;
        Urhs.dw0 ^= rhs_xor_mask;
        Urhs.dw1 ^= rhs_xor_mask;
        Urhs.dw2 ^= rhs_xor_mask;
        Urhs.dw3 ^= rhs_xor_mask;
        *this += (lhs_xor_mask & 1);
        Urhs += (rhs_xor_mask & 1);

        struct mul128_t
        {
            int128_t dqw1, dqw0;
            mul128_t(const int128_t& dqw1, const int128_t& dqw0): dqw1(dqw1), dqw0(dqw0){}
        };

        mul128_t data(Urhs,*this);
        asm volatile(
        "push      %%ebp                            \n\
        movl       %%eax,   %%ebp                   \n\
        movl       $0x00,   %%ebx                   \n\
        movl       $0x00,   %%ecx                   \n\
        movl       $0x00,   %%esi                   \n\
        movl       $0x00,   %%edi                   \n\
        movl   28(%%ebp),   %%eax #Calc: (dw0*dw0)  \n\
        mull             12(%%ebp)                  \n\
        addl       %%eax,   %%ebx                   \n\
        adcl       %%edx,   %%ecx                   \n\
        adcl       $0x00,   %%esi                   \n\
        adcl       $0x00,   %%edi                   \n\
        movl   24(%%ebp),   %%eax #Calc: (dw1*dw0)  \n\
        mull             12(%%ebp)                  \n\
        addl       %%eax,   %%ecx                   \n\
        adcl       %%edx,   %%esi                   \n\
        adcl       $0x00,   %%edi                   \n\
        movl   20(%%ebp),   %%eax #Calc: (dw2*dw0)  \n\
        mull             12(%%ebp)                  \n\
        addl       %%eax,   %%esi                   \n\
        adcl       %%edx,   %%edi                   \n\
        movl   16(%%ebp),   %%eax #Calc: (dw3*dw0)  \n\
        mull             12(%%ebp)                  \n\
        addl       %%eax,   %%edi                   \n\
        movl   28(%%ebp),   %%eax #Calc: (dw0*dw1)  \n\
        mull              8(%%ebp)                  \n\
        addl       %%eax,   %%ecx                   \n\
        adcl       %%edx,   %%esi                   \n\
        adcl       $0x00,   %%edi                   \n\
        movl   24(%%ebp),   %%eax #Calc: (dw1*dw1)  \n\
        mull              8(%%ebp)                  \n\
        addl       %%eax,   %%esi                   \n\
        adcl       %%edx,   %%edi                   \n\
        movl   20(%%ebp),   %%eax #Calc: (dw2*dw1)  \n\
        mull              8(%%ebp)                  \n\
        addl       %%eax,   %%edi                   \n\
        movl   28(%%ebp),   %%eax #Calc: (dw0*dw2)  \n\
        mull              4(%%ebp)                  \n\
        addl       %%eax,   %%esi                   \n\
        adcl       %%edx,   %%edi                   \n\
        movl   24(%%ebp),  %%eax #Calc: (dw1*dw2)   \n\
        mull              4(%%ebp)                  \n\
        addl       %%eax,   %%edi                   \n\
        movl   28(%%ebp),   %%eax #Calc: (dw0*dw3)  \n\
        mull               (%%ebp)                  \n\
        addl       %%eax,   %%edi                   \n\
        pop        %%ebp                            \n"
        :"=b"(this->dw0),"=c"(this->dw1),"=S"(this->dw2),"=D"(this->dw3)
        :"a"(&data):"%ebp");

        dw0 ^= result_xor_mask;
        dw1 ^= result_xor_mask;
        dw2 ^= result_xor_mask;
        dw3 ^= result_xor_mask;
        return (*this += (result_xor_mask & 1));
    }
};

至于分工,检查code是相当没有意义的,因为我需要改变的数学算法来看到任何实质性的好处。唯一可行的选择似乎是高基数分裂,但我还没有化解(在我心中)刚刚的如何的它会工作。

推荐答案

我也不太担心倍增。你在做什么似乎是非常有效的。我并没有真正按照希腊的Karatsuba的乘法,但我的感觉是,这将是更有效地只比你处理更大的数字。

I wouldn't worry much about multiplication. What you're doing seems quite efficient. I didn't really follow the Greek on the Karatsuba Multiplication, but my feeling is that it would be more efficient only with much larger numbers than you're dealing with.

一个建议,我确实有就是尽量使用内联汇编的最小的块,而不是编码的逻辑组件。你可以写一个函数:

One suggestion I do have is to try to use the smallest blocks of inline assembly, rather than coding your logic in assembly. You could write a function:

struct div_result { u_int x[2]; };
static inline void mul_add(int a, int b, struct div_result *res);

该函数将内联汇编实现,你会从C + + code调用它。它应该是高效的纯汇编,并更容易code。

The function would be implemented in inline assembly, and you'll call it from C++ code. It should be as efficient as pure assembly, and much easier to code.

关于分工,我不知道。大多数算法只见谈渐近效率,这可能意味着它们是有效的仅适用于非常高的数字位。

About division, I don't know. Most algorithms I saw talk about asymptotic efficiency, which may mean they're efficient only for very high numbers of bits.

这篇关于高效乘法/双128位的整数x86上的鸿沟(无64位)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆