C ++ / VS2008:宏的性能与内联函数的性能 [英] C++ / VS2008: Performance of Macros vs. Inline functions

查看:178
本文介绍了C ++ / VS2008:宏的性能与内联函数的性能的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

全部,

我正在写一些性能敏感的代码,包括一个3d矢量类,将执行大量的交叉产品。作为一个长期的C ++程序员,我知道宏的邪恶和内联函数的各种好处。我一直以来的印象是,内联函数应该与宏大致相同的速度。但是,在性能测试宏vs内联函数中,我发现了一个有趣的发现,我希望是我在某个地方犯了一个愚蠢的错误的结果:我的函数的宏版本似乎是内联版本的8倍!

I'm writing some performance sensitive code, including a 3d vector class that will be doing lots of cross-products. As a long-time C++ programmer, I know all about the evils of macros and the various benefits of inline functions. I've long been under the impression that inline functions should be approximately the same speed as macros. However, in performance testing macro vs inline functions, I've come to an interesting discovery that I hope is the result of me making a stupid mistake somewhere: the macro version of my function appears to be over 8 times as fast as the inline version!

首先,一个简单的向量类的修剪版本:

First, a ridiculously trimmed down version of a simple vector class:


class Vector3d
{
public:
    double m_tX, m_tY, m_tZ;

    Vector3d() : m_tX(0), m_tY(0), m_tZ(0) {}
    Vector3d(const double &tX, const double &tY, const double &tZ):
        m_tX(tX), m_tY(tY), m_tZ(tZ) {}

    static inline void CrossAndAssign ( const Vector3d& cV1, const Vector3d& cV2, Vector3d& cV )
    {
        cV.m_tX = cV1.m_tY * cV2.m_tZ - cV1.m_tZ * cV2.m_tY;
        cV.m_tY = cV1.m_tZ * cV2.m_tX - cV1.m_tX * cV2.m_tZ;
        cV.m_tZ = cV1.m_tX * cV2.m_tY - cV1.m_tY * cV2.m_tX;
    }

#define FastVectorCrossAndAssign(cV1,cV2,cVOut) { \
    cVOut.m_tX = cV1.m_tY * cV2.m_tZ - cV1.m_tZ * cV2.m_tY; \
    cVOut.m_tY = cV1.m_tZ * cV2.m_tX - cV1.m_tX * cV2.m_tZ; \
    cVOut.m_tZ = cV1.m_tX * cV2.m_tY - cV1.m_tY * cV2.m_tX; }
};

这是我的示例基准代码:

Here's my sample benchmarking code:


Vector3d right;
Vector3d forward(1.0,2.2,3.6);
Vector3d up(3.2,1.4,23.6);

Vector3d right; Vector3d forward(1.0, 2.2, 3.6); Vector3d up(3.2, 1.4, 23.6);

clock_t start = clock();
for (long l=0; l < 100000000; l++)
{
    Vector3d::CrossAndAssign(forward, up, right); // static inline version
}

clock_t end = clock();
std::cout << end - start << endl;


clock_t start2 = clock();
for (long l=0; l<100000000; l++)
{
    FastVectorCrossAndAssign(forward, up, right); // macro version
}
clock_t end2 = clock();

std::cout << end2 - start2 << endl;

最终结果:优化完全关闭后,内联版本需要3200个ticks和宏版本500 ticks ...优化打开(/ O2,最大化速度和其他速度调整),我可以得到内联版本到了1100分钟,这是更好,但仍然不一样。

The end result: With optimizations turned completely off, the inline version takes 3200 ticks, and the macro version 500 ticks... With optimization turned on (/O2, maximize speed, and other speed tweaks), I can get the inline version down to 1100 ticks, which is better but still not the same.

所以我呼吁大家:这是真的吗?我在某个地方犯了一个愚蠢的错误?

So I appeal to all of you: is this really true? Have I made a stupid mistake somewhere? Or are inline functions really this much slower -- and if so, why?

推荐答案

注意: 发布此答案后,原始问题已编辑删除此问题。

NOTE: After posting this answer, the original question was edited to remove this problem. I'll leave the answer as it is instructive on several levels.

这些循环在他们的工作上有所不同!

The loops differ in what they do!

如果我们手动扩展宏,我们得到:

if we manually expand the macro, we get:

for (long l=0; l<100000000; l++) 
    right.m_tX = forward.m_tY * up.m_tZ - forward.m_tZ * up.m_tY;
    right.m_tY = forward.m_tZ * up.m_tX - forward.m_tX * up.m_tZ;
    right.m_tZ = forward.m_tX * up.m_tY - forward.m_tY * up.m_tX;

注意大括号的缺失。所以编译器看到这样:

Note the absense of curly brackets. So the compiler sees this as:

for (long l=0; l<100000000; l++)
{
    right.m_tX = forward.m_tY * up.m_tZ - forward.m_tZ * up.m_tY;
}
right.m_tY = forward.m_tZ * up.m_tX - forward.m_tX * up.m_tZ;
right.m_tZ = forward.m_tX * up.m_tY - forward.m_tY * up.m_tX;

这使得明显为什么第二个循环要快得多。

Which makes it obvious why the second loop is so much faster.

Udpate:这也是为什么宏是邪恶的一个很好的例子:)

Udpate: This is also a good example of why macros are evil :)

这篇关于C ++ / VS2008:宏的性能与内联函数的性能的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆