我可以将此宏更改为内联函数而不影响性能吗? [英] Can I change this macro to an inline function without a performance hit?

查看:247
本文介绍了我可以将此宏更改为内联函数而不影响性能吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

(编辑:让我们标题这个,如何测量可能出错的教训。我仍然没有弄清楚究竟是什么导致差异。)



我发现了一个非常快的整数平方根函数,由Mark Crowne 这里。至少在我的机器上使用GCC,这显然是我测试过的最快的整数平方根函数(包括Hacker's Delight中的函数,此页面和来自标准库的floor(sqrt())。



清除格式化后,重命名变量并使用固定宽度类型,它如下所示:

  static uint32_t mcrowne_isqrt(uint32_t val)
{
uint32_t temp,root = 0;

if(val> = 0x40000000)
{
root = 0x8000;
val - = 0x40000000;
}

#define INNER_ISQRT \
do \
{\
temp =(root< )+(1 <<((s)* 2-2)); \
if(val> = temp)\
{\
root + = 1< ((s)-1); \
val - = temp; \
} \
} while(0)

INNER_ISQRT(15);
INNER_ISQRT(14);
INNER_ISQRT(13);
INNER_ISQRT(12);
INNER_ISQRT(11);
INNER_ISQRT(10);
INNER_ISQRT(9);
INNER_ISQRT(8);
INNER_ISQRT(7);
INNER_ISQRT(6);
INNER_ISQRT(5);
INNER_ISQRT(4);
INNER_ISQRT(3);
INNER_ISQRT(2);

#undef INNER_ISQRT

temp = root + root + 1;
if(val> = temp)
root ++;
return root;
}

INNER_ISQRT宏不是太邪恶,因为它是本地的,立即未定义它不再需要。然而,我仍然想把它转换为内联函数作为一个原则问题。我在几个地方(包括GCC文档)读取断言,内联函数的跟快速一样宏,但我有麻烦转换没有速度命中。



我的当前迭代看起来像这样(注意always_inline属性,我投入的很好的措施):

  static inline void inner_isqrt(const uint32_t s,uint32_t& val,uint32_t& root)__attribute __((always_inline)); 
static inline void inner_isqrt(const uint32_t s,uint32_t& val,uint32_t& root)
{
const uint32_t temp =(root<< s)+(1< (s <1)-2));
if(val> = temp)
{
root + = 1< (s-1)。
val - = temp;
}
}

//注意,我刚刚改名为mcrowne_inline_isqrt,所以人们可以编译我的全部测试。
static uint32_t mcrowne_inline_isqrt(uint32_t val)
{
uint32_t root = 0;

if(val> = 0x40000000)
{
root = 0x8000;
val - = 0x40000000;
}

inner_isqrt(15,val,root);
inner_isqrt(14,val,root);
inner_isqrt(13,val,root);
inner_isqrt(12,val,root);
inner_isqrt(11,val,root);
inner_isqrt(10,val,root);
inner_isqrt(9,val,root);
inner_isqrt(8,val,root);
inner_isqrt(7,val,root);
inner_isqrt(6,val,root);
inner_isqrt(5,val,root);
inner_isqrt(4,val,root);
inner_isqrt(3,val,root);
inner_isqrt(2,val,root);

const uint32_t temp = root + root + 1;
if(val> = temp)
root ++;
return root;
}

无论我做什么,内联函数总是比宏慢。宏版本通常乘以2.92秒(2 ^ 28 - 1)迭代与-O2构建,而内联版本通常时间在3.25秒。编辑:我说2 ^ 32 - 1次迭代之前,但我忘了我改变了。



这可能是因为编译器只是愚蠢和拒绝内联它(再次注意always_inline属性!),但是,如果是这样,这将使得宏版本通常更好。 (我试过检查程序集看到,但它太复杂作为程序的一部分优化程序省略了一切,当我试图编译只是函数,当然,我有编译它作为一个库,由于noobishness与GCC 。)



简而言之,是否有一种方法可以将其写为inline而不受速度影响? (我没有剖析,但sqrt是其中一个基本的操作,应该总是快速,因为我可能会使用它在许多其他程序,而不仅仅是我现在感兴趣的一个。此外,我只是好奇。)



我甚至尝试过使用模板来烘焙常量值,但我得到的感觉其他两个参数更可能导致命中(和宏可以避免,因为它直接使用局部变量)...好吧,或者编译器固执地拒绝内联。






UPDATE:user1034749下面是从两个函数得到相同的汇编输出,当他把他们放在单独的文件和编译它们。我试着他的确切的命令行,我得到与他相同的结果。对于所有的意图和目的,这个问题被解决了。



然而,我仍然想知道为什么我的测量出来不同。显然,我的测量代码或原始构建过程导致事情不同。我会在下面发布代码。有谁知道这笔交易是什么?也许我的编译器实际上在我的main()函数的循环中嵌入了整个mcrowne_isqrt()函数,但它不内联整个其他版本?



UPDATE 2 (在测试代码之前挤压):请注意,如果我交换测试的顺序,并使内联版本第一,内联版本的发布比宏版本相同的金额。这是一个缓存问题,还是编译器内联一个调用而不是其他调用,或什么?

  #include< iostream> ; 
#include< time.h> // Linux高分辨率定时器
#include< stdint.h>

/ *函数在这里* /

timespec timespecdiff(const timespec& start,const timespec& end)
{
timespec elapsed;
timespec endmod = end;
if(endmod.tv_nsec< start.tv_nsec)
{
endmod.tv_sec - = 1;
endmod.tv_nsec + = 1000000000;
}

elapsed.tv_sec = endmod.tv_sec - start.tv_sec;
elapsed.tv_nsec = endmod.tv_nsec - start.tv_nsec;
return elapsed;
}


int main()
{
uint64_t inputlimit = 4294967295;
//测试大范围的值
uint64_t widestep = 16;

timespec start,end;

//时间宏版本:
uint32_t sum = 0;
clock_gettime(CLOCK_PROCESS_CPUTIME_ID,& start);
for(uint64_t num =(broadestep - 1); num <= inputlimit; num + = broadestep)
{
sum + = mcrowne_isqrt(uint32_t(num));
}
clock_gettime(CLOCK_PROCESS_CPUTIME_ID,& end);
timespec markcrowntime = timespecdiff(start,end);
std :: cout<< 完成定时Mark Crowne的sqrt变体。结果总和=< sum< (以避免过度优化)。 << std :: endl;


//时间内联版本:
sum = 0;
clock_gettime(CLOCK_PROCESS_CPUTIME_ID,& start);
for(uint64_t num =(broadestep - 1); num <= inputlimit; num + = broadestep)
{
sum + = mcrowne_inline_isqrt(uint32_t(num));
}
clock_gettime(CLOCK_PROCESS_CPUTIME_ID,& end);
timespec markcrowninlinetime = timespecdiff(start,end);
std :: cout<< 完成定时Mark Crowne的内联sqrt变体。结果总和=< sum< (以避免过度优化)。 << std :: endl;

//结果:
std :: cout<< Mark Crowne sqrt variant time:\t< markcrowntime.tv_sec<< s,< markcrowntime.tv_nsec< ns< std :: endl;
std :: cout<< Mark Crowne inline sqrt variant time:\t<< markcrowninlinetime.tv_sec<< s,< markcrowninlinetime.tv_nsec< ns< std :: endl;
std :: cout<< std :: endl;
}






更新3:不知道如何可靠地比较不同功能的定时,而不取决于测试的顺序的定时。非常感谢任何提示!



但是,如果任何人读这个感兴趣的快速sqrt实现,我应该提到:Mark Crowne的代码测试比任何其他纯C / C ++版本我已经尝试了一个良好的边缘(尽管测试的可靠性问题),但下面的SSE代码看起来像一个标量32位整数sqrt仍然有点快。它不能被泛化为完整的64位无符号整数输入,而不会丢失精度(并且第一个带符号的转换也必须由处理值> = 2 ^ 63的内在负载代替):

  uint32_t sse_sqrt(uint64_t num)
{
//使用64位输入,因为SSE转换函数将所有
//整数作为有符号(因此从32位值到> = 2 ^ 31
//的转换将被解释为否定的)。当然,对于值> = 2 ^ 63,这个函数
//也将失败。
//它也可以做得更快,因为它在sqrtsd之前产生一个奇怪的/
//无用的movsd%xmm0,%xmm0指令。它清除
// xmm0首先也用xorpd(似乎不必要,但我可能错了)。
__m128d result;
__m128d num_as_sse_double = _mm_cvtsi64_sd(result,num);
result = _mm_sqrt_sd(num_as_sse_double,num_as_sse_double);
return _mm_cvttsd_si32(result);
}


解决方案

4.5.3。
我修改了你的第二个版本的代码以匹配第一个版本,
例如:

  <<((s)* 2  -  2)

vs

 (1 <<((s< 1) -  1)

是,s * 2 == s <1,但是-2和-1?



另外我修改了你的类型替换uint32_t与unsigned long,因为在我的64位机器长不是32位数。



运行:

  g ++ -ggdb -O2 -march = native -c -pipe inline.cpp 
g ++ -ggdb - O2 -march = native -c -pipe macros.cpp
objdump -d inline.o> inline.s
objdump -d macros.o> macros.s



我可以使用-S而不是-c汇编器,但我想看到汇编没有额外的信息。 p>

你知道吗?

汇编程序完全一样,在第一个和第二个版本中

我认为你的时间测量是错误的。


(EDIT: Let's title this, "Lessons in how measurements can go wrong." I still haven't figured out exactly what's causing the discrepancy though.)

I found a very fast integer square root function here by Mark Crowne. At least with GCC on my machine, it's clearly the fastest integer square root function I've tested (including the functions in Hacker's Delight, this page, and floor(sqrt()) from the standard library).

After cleaning up the formatting a bit, renaming a variable, and using fixed-width types, it looks like this:

static uint32_t mcrowne_isqrt(uint32_t val)
{
    uint32_t temp, root = 0;

    if (val >= 0x40000000)
    {
        root = 0x8000;
        val -= 0x40000000;
    }

    #define INNER_ISQRT(s)                              \
    do                                                  \
    {                                                   \
        temp = (root << (s)) + (1 << ((s) * 2 - 2));    \
        if (val >= temp)                                \
        {                                               \
            root += 1 << ((s)-1);                       \
            val -= temp;                                \
        }                                               \
    } while(0)

    INNER_ISQRT(15);
    INNER_ISQRT(14);
    INNER_ISQRT(13);
    INNER_ISQRT(12);
    INNER_ISQRT(11);
    INNER_ISQRT(10);
    INNER_ISQRT( 9);
    INNER_ISQRT( 8);
    INNER_ISQRT( 7);
    INNER_ISQRT( 6);
    INNER_ISQRT( 5);
    INNER_ISQRT( 4);
    INNER_ISQRT( 3);
    INNER_ISQRT( 2);

    #undef INNER_ISQRT

    temp = root + root + 1;
    if (val >= temp)
        root++;
    return root;
}

The INNER_ISQRT macro isn't too evil, since it's local and immediately undefined after it's no longer needed. Nevertheless, I'd still like to convert it to an inline function as a matter of principle. I've read assertions in several places (including the GCC documentation) that inline functions are "just as fast" as macros, but I've had trouble converting it without a speed hit.

My current iteration looks like this (note the always_inline attribute, which I threw in for good measure):

static inline void inner_isqrt(const uint32_t s, uint32_t& val, uint32_t& root) __attribute__((always_inline));
static inline void inner_isqrt(const uint32_t s, uint32_t& val, uint32_t& root)
{
    const uint32_t temp = (root << s) + (1 << ((s << 1) - 2));
    if(val >= temp)
    {
        root += 1 << (s - 1);
        val -= temp;
    }
}

//  Note that I just now changed the name to mcrowne_inline_isqrt, so people can compile my full test.
static uint32_t mcrowne_inline_isqrt(uint32_t val)
{
    uint32_t root = 0;

    if(val >= 0x40000000)
    {
        root = 0x8000; 
        val -= 0x40000000;
    }

    inner_isqrt(15, val, root);
    inner_isqrt(14, val, root);
    inner_isqrt(13, val, root);
    inner_isqrt(12, val, root);
    inner_isqrt(11, val, root);
    inner_isqrt(10, val, root);
    inner_isqrt(9, val, root);
    inner_isqrt(8, val, root);
    inner_isqrt(7, val, root);
    inner_isqrt(6, val, root);
    inner_isqrt(5, val, root);
    inner_isqrt(4, val, root);
    inner_isqrt(3, val, root);
    inner_isqrt(2, val, root);

    const uint32_t temp = root + root + 1;
    if (val >= temp)
        root++;
    return root;
}

No matter what I do, the inline function is always slower than the macro. The macro version commonly times at around 2.92s for (2^28 - 1) iterations with an -O2 build, whereas the inline version commonly times at around 3.25s. EDIT: I said 2^32 - 1 iterations before, but I forgot that I had changed it. They take quite a bit longer for the full gamut.

It's possible that the compiler is just being stupid and refusing to inline it (note again the always_inline attribute!), but if so, that would make the macro version generally preferable anyway. (I tried checking the assembly to see, but it was too complicated as part of a program. The optimizer omitted everything when I tried compiling just the functions of course, and I'm having issues compiling it as a library due to noobishness with GCC.)

In short, is there a way to write this as an inline without a speed hit? (I haven't profiled, but sqrt is one of those fundamental operations that should always be made fast, since I may be using it in many other programs than just the one I'm currently interested in. Besides, I'm just curious.)

I've even tried using templates to "bake in" the constant value, but I get the feeling the other two parameters are more likely to be causing the hit (and the macro can avoid that, since it uses local variables directly)...well, either that or the compiler is stubbornly refusing to inline.


UPDATE: user1034749 below is getting the same assembly output from both functions when he puts them in separate files and compiles them. I tried his exact command line, and I'm getting the same result as him. For all intents and purposes, this question is solved.

However, I'd still like to know why my measurements are coming out differently. Obviously, my measurement code or original build process was causing things to be different. I'll post the code below. Does anyone know what the deal was? Maybe my compiler is actually inlining the whole mcrowne_isqrt() function in the loop of my main() function, but it's not inlining the entirety of the other version?

UPDATE 2 (squeezed in before testing code): Note that if I swap the order of the tests and make the inline version come first, the inline version comes out faster than the macro version by the same amount. Is this a caching issue, or is the compiler inlining one call but not the other, or what?

#include <iostream>
#include <time.h>      //  Linux high-resolution timer
#include <stdint.h>

/*  Functions go here */

timespec timespecdiff(const timespec& start, const timespec& end)
{
    timespec elapsed;
    timespec endmod = end;
    if(endmod.tv_nsec < start.tv_nsec)
    {
        endmod.tv_sec -= 1;
        endmod.tv_nsec += 1000000000;
    }

    elapsed.tv_sec = endmod.tv_sec - start.tv_sec;
    elapsed.tv_nsec = endmod.tv_nsec - start.tv_nsec;
    return elapsed;
}


int main()
{
    uint64_t inputlimit = 4294967295;
    //  Test a wide range of values
    uint64_t widestep = 16;

    timespec start, end;

    //  Time macro version:
    uint32_t sum = 0;
    clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &start);
    for(uint64_t num = (widestep - 1); num <= inputlimit; num += widestep)
    {
        sum += mcrowne_isqrt(uint32_t(num));
    }
    clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &end);
    timespec markcrowntime = timespecdiff(start, end);
    std::cout << "Done timing Mark Crowne's sqrt variant.  Sum of results = " << sum << " (to avoid over-optimization)." << std::endl;


    //  Time inline version:
    sum = 0;
    clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &start);
    for(uint64_t num = (widestep - 1); num <= inputlimit; num += widestep)
    {
        sum += mcrowne_inline_isqrt(uint32_t(num));
    }
    clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &end);
    timespec markcrowninlinetime = timespecdiff(start, end);
    std::cout << "Done timing Mark Crowne's inline sqrt variant.  Sum of results = " << sum << " (to avoid over-optimization)." << std::endl;

    //  Results:
    std::cout << "Mark Crowne sqrt variant time:\t" << markcrowntime.tv_sec << "s, " << markcrowntime.tv_nsec << "ns" << std::endl;
    std::cout << "Mark Crowne inline sqrt variant time:\t" << markcrowninlinetime.tv_sec << "s, " << markcrowninlinetime.tv_nsec << "ns" << std::endl;
    std::cout << std::endl;
}


UPDATE 3: I still have no idea how to reliably compare the timing of different functions without the timing depending on the order of the tests. I'd greatly appreciate any tips!

However, if anyone else reading this is interested in fast sqrt implementations, I should mention: Mark Crowne's code tests faster than any other pure C/C++ version I've tried by a decent margin (despite reliability issues with testing), but the following SSE code seems like it might be a little bit faster still for a scalar 32-bit integer sqrt. It can't be generalized for full-blown 64-bit unsigned integer inputs without losing precision though (and the first signed conversion would also have to be replaced by a load intrinsic to handle values >= 2^63):

uint32_t sse_sqrt(uint64_t num)
{
    //  Uses 64-bit input, because SSE conversion functions treat all
    //  integers as signed (so conversion from a 32-bit value >= 2^31
    //  will be interpreted as negative).  As it stands, this function
    //  will similarly fail for values >= 2^63.
    //  It can also probably be made faster, since it generates a strange/
    //  useless movsd %xmm0,%xmm0 instruction before the sqrtsd.  It clears
    //  xmm0 first too with xorpd (seems unnecessary, but I could be wrong).
    __m128d result;
    __m128d num_as_sse_double = _mm_cvtsi64_sd(result, num);
    result = _mm_sqrt_sd(num_as_sse_double, num_as_sse_double);
    return _mm_cvttsd_si32(result);
}

解决方案

I tried your code with gcc 4.5.3. I modified your second version of code to match the first one, for example:

(1 << ((s) * 2 - 2)

vs

(1 << ((s << 1) - 1)

yes, s * 2 == s << 1, but "-2" and "-1"?

Also I modified your types replace uint32_t with "unsigned long", because of on my 64 bit machine "long" is not 32bit number.

And then I run:

g++ -ggdb -O2 -march=native -c -pipe inline.cpp
g++ -ggdb -O2 -march=native -c -pipe macros.cpp
objdump -d inline.o > inline.s
objdump -d macros.o > macros.s

I could use "-S" instead of "-c" to assembler, but I would like to see assembler without additional info.

and you know what?
The assembler completly the same, in the first and in the second verison. So I think your time measurements are just wrong.

这篇关于我可以将此宏更改为内联函数而不影响性能吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆