C ++优化的if / else条件 [英] C++ Optimize if/else condition

查看:532
本文介绍了C ++优化的if / else条件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有code,消耗25%的单行 - 我的应用程序运行时的30%。这是一个小比较一个的std ::集(集与红黑树实现)。
这28秒内叫约180万次。

 结构进入{
  常量浮动_cost;
  常量长_id;  //一些其他的增值经销商    输入(浮动费用,浮动ID):_cost(成本),_id(ID){
    }
};模板<类T>
结构lt_entry:公共binary_function的< T,T,BOOL>
{
    布尔运算符()(常量T&放大器; L,常量T&安培; R)常量
    {
        //最可读的形状
        如果(l._cost!= r._cost){
            返回r._cost< l._cost;
        }其他{
            返回l._id< r._id;
        }
    }
};

中的条目应该由成本进行排序,如果成本是它们的ID是一样的。
我对最低每次提取大量插入。我想过用斐波那契 - 堆,但有人告诉我,他们在理论上是好的,但高常数的折磨,pretty复杂的实现。而且,由于插入是为O(log(n))的运行时间增加是大的n几乎不变。所以我觉得它好坚持设定。

要提高性能,我想前preSS它在不同的形状:

 返回l._cost< r._cost || r._cost> l._cost || l._id< r._id;返回l._cost< r._cost || (l._cost == r._cost&放大器;&安培; l._id< r._id);

即使是这样:

 的typedef工会{
    浮_f;
    INT _i;
}燧石;// ...火石差异;
diff._f =(l._cost - r._cost);
回报(diff._i&安培;&安培; diff._i>> 31)|| l._id< r._id;

但是,编译器似乎是足够聪明不已,因为我一直无法改善的运行时间。

我也想过SSE但这个问题确实不是很适用于SSE ...

该组件看起来有点像这样的:

  MOVSS(%RBX),xmm1中的%
MOV $为0x1,%r8d
MOVSS为0x20(%RDX),%XMM0
ucomiss%将xmm1,%XMM0
JA 0x410600< _ZNSt8_Rb_tree [..] + 96>
ucomiss%XMM0,xmm1中的%
JP 0x4105fd< _ZNSt8_Rb _ [..] _ + 93基
JNE 0x4105fd< _ZNSt8_Rb _ [..] _ + 93基
MOV 0x28(%RDX),RAX%
CMP%RAX,地址0x8(%RBX)
JB 0x410600< _ZNSt8_Rb _ [..] _ + 96>
XOR%r8d,%r8d

我和汇编语言非常点点经验,但是不是真的了。

我认为这将是最好的(唯一?)指向挤出一些性能,但是它真的值得去努力?你能看到任何快捷方式,可以节省一些周期?

在code将运行在该平台是一个Ubuntu 12与多核心英特尔机器上的gcc 4.6(-STL =的C ++ 0x)。只有库提振,OpenMP和TBB。在我的4yr岁的笔记本电脑(酷睿2)进行30秒的基准。

我真的停留在这一个我,它似乎很简单,但需要那么多时间。我一直在捣鼓我的头,因为天思我怎么能改善这一行...

您能给我一个建议,如何提高这部分,还是已经处于最佳状态?

编辑1:使用Jerrys的建议后,我实现了约为4.5秒的加速。
编辑2:后试图提振斐波那契堆的比较去174神达调用不到功能


解决方案

让我preface这与事实是什么,我要在这里概括是脆弱的,不完全的便携式 - 但在正常情况下(这是你指定什么太大pretty)我有理由相信它应该正常工作。

这取决于有一点是事实,IEEE浮点数字都经过精心设计,这样如果你把自己的位模式为整数,他们还会到排序正确的顺序(模像NaN的几件事,为此,真的有没有正确顺序)。

要充分利用这一点,我们要做的是收拾条目,以便有两件,使我们的重点之间没有填充。然后我们保证结构作为一个整体被对准以一个8字节的边界。我也改变了 _id int32_t 到,以确保它保持32位,即使是在64位系统/编译器(这几乎肯定会产生最好的code这个比较)。

然后,我们投结构的地址,以便我们可以一起查看浮点数和整数作为一个单一的64位整数。由于您使用的一个小端处理器,支持我们需要把少显著部分( ID )第一个,更显著部分(费用)第二次,所以当我们把它们当作一个64位整数,浮点部分将成为最显著位,整数部分不太显著位:

 结构__attribute__((__packed__))__attribute __((对齐(8)){入口
  // *不*重新排序以下两个字段或比较将打破。
  常量int32_t _id;
  常量浮动_cost;  //一些其他的增值经销商    项(长ID,浮动费用):_cost(成本),_id(ID){}
};

然后,我们有我们的丑陋的小对比功能:

 布尔运算符≤(输入常量和放大器;一,进入常量和b){
   返回*(的int64_t常量*)及一个与所述; *(常量的int64_t *)和b;
}

一旦我们的结构正确定义,比较变得相当简单:只取每个结构的前64位,并且,好像他们是64位整数进行比较

最后一个位测试code,给予至少有一点保证能够正常工作的一些值:

  INT的main(){
    一个入口(1236,1.234f),B(1234,1.235f),C(1235,1.235f);    性病::法院LT&;<的std :: boolalpha;    性病::法院LT&;< (B<一)LT;< \\ n;
    性病::法院LT&;< (A< B)<< \\ n;
    性病::法院LT&;< (B< C)<< \\ n;
    性病::法院LT&;< (C< B)<< \\ n;
    返回0;
}

至少对我来说,能产生预期的结果:

 
真正
真正

现在,一些可能出现的问题:如果这两个项目将重新安排自己要么之间或结构的任何其他部分被前或它们之间的说,比较肯定会突破。第二,我们是完全依赖于剩余的32位每个项目的大小,所以当他们连接起来他们将64位。第三,如果有人将删除结构定义的 __包装__ 属性,我们可以结束了 _id 之间的填充和 _cost ,再次打破了比较。同样地,如果某人移除对齐(8),则code可以失去一些速度,因为它试图加载未对准以8字节边界的8字节的数量(以及在另一个处理器,这可能完全失败)。

要总结一下:这是脆弱的。毫无疑问,在所有有关。尽管如此,它应该是pretty快 - 特别是如果你使用的是64位编译器,在这种情况下,我期望的比较跳出来两个负载和一个比较。为了使长话短说,你在,你可能无法做出比较本身的任何速度在所有的点 - 所有你能做的就是尝试做多并行,优化内存使用模式等

I have a single line of code, that consumes 25% - 30% of the runtime of my application. It is a less-than comparator for an std::set (the set is implemented with a Red-Black-Tree). It is called about 180 Million times within 28 seconds.

struct Entry {
  const float _cost;
  const long _id;

  // some other vars

    Entry(float cost, float id) : _cost(cost), _id(id) {
    } 
};



template<class T>
struct lt_entry: public binary_function <T, T, bool>
{
    bool operator()(const T &l, const T &r) const
    {
        // Most readable shape
        if(l._cost != r._cost) {
            return r._cost < l._cost;
        } else {
            return l._id < r._id;
        }
    }
};

The entries should be sorted by cost and if the cost is the same by their id. I have many insertions for each extraction of the minimum. I thought about using Fibonacci-Heaps, but I have been told that they are theoretically nice, but suffer from high constants and are pretty complicated to implement. And since insert is in O(log(n)) the runtime increase is nearly constant with large n. So I think its okay to stick to the set.

To improve performance I tried to express it in different shapes:

return l._cost < r._cost || r._cost > l._cost || l._id < r._id;

return l._cost < r._cost || (l._cost == r._cost && l._id < r._id);

Even this:

typedef union {
    float _f;
    int _i;
} flint;

//...

flint diff;
diff._f = (l._cost - r._cost);
return (diff._i && diff._i >> 31) || l._id < r._id;

But the compiler seems to be smart enough already, because I haven't been able to improve the runtime.

I also thought about SSE but this problem is really not very applicable for SSE...

The assembly looks somewhat like this:

movss  (%rbx),%xmm1
mov    $0x1,%r8d
movss  0x20(%rdx),%xmm0
ucomiss %xmm1,%xmm0
ja     0x410600 <_ZNSt8_Rb_tree[..]+96>
ucomiss %xmm0,%xmm1
jp     0x4105fd <_ZNSt8_Rb_[..]_+93>
jne    0x4105fd <_ZNSt8_Rb_[..]_+93>
mov    0x28(%rdx),%rax
cmp    %rax,0x8(%rbx)
jb     0x410600 <_ZNSt8_Rb_[..]_+96>
xor    %r8d,%r8d

I have a very tiny bit experience with assembly language, but not really much.

I thought it would be the best (only?) point to squeeze out some performance, but is it really worth the effort? Can you see any shortcuts that could save some cycles?

The platform the code will run on is an ubuntu 12 with gcc 4.6 (-stl=c++0x) on a many-core intel machine. Only libraries available are boost, openmp and tbb. The 30 second benchmark was performed on my 4yr old laptop (core 2 duo).

I am really stuck on this one, it seems so simple, but takes that much time. I have been crunching my head since days thinking how I could improve this line...

Can you give me a suggestion how to improve this part, or is it already at its best?

EDIT 1: After using Jerrys suggestion I achieved a speed up of ~4.5 seconds. EDIT 2: After trying boost Fibonacci heaps the comparison went to 174 Mio calls to the less than function.

解决方案

Let me preface this with the fact that what I'm going to outline here is fragile and not entirely portable -- but under the right circumstances (which are pretty much what you've specified) I'm reasonably certain that it should work correctly.

One point it depends upon is the fact that IEEE floating point numbers are carefully designed so that if you treat their bit pattern as an integer, they'll still sort into the correct order (modulo a few things like NaNs, for which there really is no "correct order").

To make use of that, what we do is pack the Entry so there's no padding between the two pieces that make up our key. Then we ensure the structure as a whole is aligned to an 8-byte boundary. I've also changed the _id to int32_t to ensure that it stays 32 bits, even on a 64-bit system/compiler (which will almost certainly produce the best code for this comparison).

Then, we cast the address of the structure so we can view the floating point number and the integer together as a single 64-bit integer. Since you're using a little-endian processor, to support that we need to put the less significant part (the id) first, and the more significant part (the cost) second, so when we treat them as a 64-bit integer, the floating point part will become the most significant bits, and the integer part the less significant bits:

struct __attribute__ ((__packed__)) __attribute__((aligned(8)) Entry {
  // Do *not* reorder the following two fields or comparison will break.
  const int32_t _id;
  const float _cost;

  // some other vars

    Entry(long id, float cost) : _cost(cost), _id(id) {} 
};

Then we have our ugly little comparison function:

bool operator<(Entry const &a, Entry const &b) { 
   return *(int64_t const *)&a < *(int64_t const *)&b;
}

Once we've defined the struct correctly, the comparison becomes fairly straightforward: just take the first 64 bits of each struct, and compare them as if they were 64-bit integers.

Finally a bit of test code to give at least a little assurance that it works correctly for some values:

int main() { 
    Entry a(1236, 1.234f), b(1234, 1.235f), c(1235, 1.235f);

    std::cout << std::boolalpha;

    std::cout << (b<a) << "\n";
    std::cout << (a<b) << "\n";
    std::cout << (b<c) << "\n";
    std::cout << (c<b) << "\n";
    return 0;
}

At least for me, that produces the expected results:

false
true
true
false

Now, some of the possible problems: if the two items get rearranged either between themselves, or any other part of the struct gets put before or between them, comparison will definitely break. Second, we're completely dependent on the sizes of the items remaining 32 bits apiece, so when they're concatenated they'll be 64 bits. Third, if somebody removes the __packed__ attribute from the struct definition, we could end up with padding between _id and _cost, again breaking the comparison. Likewise, if somebody removes the aligned(8), the code may lose some speed, because it's trying to load 8-byte quantities that aren't aligned to 8-byte boundaries (and on another processor, this might fail completely). [Edit: Oops. @rici reminded me of something I intended to list here, but forgot: this only works correctly when both the _id and cost are positive. If _cost is negative, comparisons will be messed up by the fact that IEEE floating point used a signed magnitude representation. If an _id is negative, its sign bit will be treated just like a normal bit in the middle of a number, so a negative _id will show up as larger than a positive _id.]

To summarize: this is fragile. No question at all about that. Nonetheless, it should be pretty fast -- especially if you're using a 64-bit compiler, in which case I'd expect the comparison to come out to two loads and one comparison. To make a long story short, you're at the point that you probably can't make the comparison itself any faster at all -- all you can do is try to do more in parallel, optimize memory usage patterns, etc.

这篇关于C ++优化的if / else条件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆