位对齐空间和性能提升 [英] Bit Aligning for Space and Performance Boosts

查看:388
本文介绍了位对齐空间和性能提升的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在书游戏编码完成,第3版中,作者提到了一项技术以减少数据结构大小和提高访问性能。实质上,它依赖于这样的事实,即当成员变量是内存对齐时,您获得性能。这是一个明显的潜在优化,编译器将利用它,但通过确保每个变量是对齐的,他们最终膨胀了数据结构的大小。



他声称至少。



他说,真正的性能提高是通过使用你的大脑,并确保你的结构设计合理,以利用速度增加,同时防止编译器膨胀。他提供了以下代码片段:

  #pragma pack(push,1)

struct SlowStruct
{
char c;
__int64 a;
int b;
char d;
};

struct FastStruct
{
__int64 a;
int b;
char c;
char d;
char unused [2]; // fill to 8-byte boundary for array use
};

#pragma pack(pop)

使用上述 struct 对象在一个未指定的测试他报告的性能增加 15.6%( 222ms 192ms ),小于 FastStruct 的大小。这对我来说很有意义,但是在我的测试下没有成功:





同时搜索结果 > char unused [2] )!



现在如果 #pragma pack c $ c>仅隔离到 FastStruct (或完全删除),我们看到一个区别:





最后,这里是问题:现代编译器(具体来说VS2010)已经针对位对齐进行了优化,因此缺乏性能提高(但是增加了结构尺寸作为副作用,如Mike Mcshaffry所说)。或者是我的测试不够密集/不确定返回任何重要的结果?



对于测试,我做了各种任务从数学运算,列主要多维数组成员中的未对齐的 __ int64 成员执行遍历/检查,矩阵运算等。没有一个对任一结构产生不同的结果。



最后,即使他们没有性能提高,这仍然是一个有用的tidbit记住保持记忆使用量降至最低。但是我会喜欢它,如果有一个性能提升(无论多么小),我只是看不到。

解决方案

>

让我演示:

  #pragma pack(push,1)

struct SlowStruct
{
char c;
__int64 a;
int b;
char d;
};

struct FastStruct
{
__int64 a;
int b;
char c;
char d;
char unused [2]; // fill to 8-byte boundary for array use
};

#pragma pack(pop)

int main(void){

int x = 1000;
int iterations = 10000000;

SlowStruct * slow = new SlowStruct [x];
FastStruct * fast = new FastStruct [x];



//加热缓存。
memset(slow,0,x * sizeof(SlowStruct));
clock_t time0 = clock();
for(int c = 0; c< iterations; c ++){
for(int i = 0; i slow [i] .a + C;
}
}
clock_t time1 = clock();
cout<< slow =< (double)(time1-time0)/ CLOCKS_PER_SEC < endl;

//加热缓存。
memset(fast,0,x * sizeof(FastStruct));
time1 = clock();
for(int c = 0; c< iterations; c ++){
for(int i = 0; i fast [i] .a + C;
}
}
clock_t time2 = clock();
cout<< fast =< (double)(time2-time1)/ CLOCKS_PER_SEC<< endl;



//打印以避免死代码消除
__int64 sum = 0;
for(int c = 0; c sum + = slow [c] .a;
sum + = fast [c] .a;
}
cout<< sum =< sum< endl;


return 0;
}






Core i7 920 @ 3.5 GHz

  slow = 4.578 
fast = 4.434
sum = 99999990000000000

好吧,差别不大。但是它在多次运行中仍然是一致的。
因此,对齐方式在Nehalem Core i7上有一点小差别。






Intel Xeon X5482 Harpertown @ 3.2 GHz (核心2代Xeon)

  slow = 22.803 
fast = 3.669
sum = 99999990000000000

看看...



6.2倍更快!!!






结论:



您会看到结果。






编辑:



相同的基准,但没有 #pragma pack



Core i7 920 @ 3.5 GHz

  slow = 4.49 
= 4.442
sum = 99999990000000000

Intel Xeon X5482 Harpertown @ strong>

  slow = 3.684 
fast = 3.717
sum = 99999990000000000




  • Core i7号码没有改变。显然,它可以处理
    不匹配,没有麻烦为这个基准。

  • Core 2 Xeon现在显示两个版本相同的时间。这证实了未对准是Core 2体系结构的问题。



取自我的评论:



c $ c> #pragma pack ,编译器将保持一切对齐,所以你不会看到这个问题。因此,这实际上是一个例子,如果你 滥用 #pragma pack 。 $ b

In the book Game Coding Complete, 3rd Edition, the author mentions a technique to both reduce data structure size and increase access performance. In essence it relies on the fact that you gain performance when member variables are memory aligned. This is an obvious potential optimization that compilers would take advantage of, but by making sure each variable is aligned they end up bloating the size of the data structure.

Or that was his claim at least.

The real performance increase, he states, is by using your brain and ensuring that your structure is properly designed to take take advantage of speed increases while preventing the compiler bloat. He provides the following code snippet:

#pragma pack( push, 1 )

struct SlowStruct
{
    char c;
    __int64 a;
    int b;
    char d;
};

struct FastStruct
{
    __int64 a;
    int b;
    char c;
    char d;  
    char unused[ 2 ]; // fill to 8-byte boundary for array use
};

#pragma pack( pop )

Using the above struct objects in an unspecified test he reports a performance increase of 15.6% (222ms compared to 192ms) and a smaller size for the FastStruct. This all makes sense on paper to me, but it fails to hold up under my testing:

Same time results and size (counting for the char unused[ 2 ])!

Now if the #pragma pack( push, 1 ) is isolated only to FastStruct (or removed completely) we do see a difference:

So, finally, here lies the question: Do modern compilers (VS2010 specifically) already optimize for the bit alignment, hence the lack of performance increase (but increase the structure size as a side-affect, like Mike Mcshaffry stated)? Or is my test not intensive enough/inconclusive to return any significant results?

For the tests I did a variety of tasks from math operations, column-major multi-dimensional array traversing/checking, matrix operations, etc. on the unaligned __int64 member. None of which produced different results for either structure.

In the end, even if their was no performance increase, this is still a useful tidbit to keep in mind for keeping memory usage to a minimum. But I would love it if there was a performance boost (no matter how minor) that I am just not seeing.

解决方案

It is highly dependent on the hardware.

Let me demonstrate:

#pragma pack( push, 1 )

struct SlowStruct
{
    char c;
    __int64 a;
    int b;
    char d;
};

struct FastStruct
{
    __int64 a;
    int b;
    char c;
    char d;  
    char unused[ 2 ]; // fill to 8-byte boundary for array use
};

#pragma pack( pop )

int main (void){

    int x = 1000;
    int iterations = 10000000;

    SlowStruct *slow = new SlowStruct[x];
    FastStruct *fast = new FastStruct[x];



    //  Warm the cache.
    memset(slow,0,x * sizeof(SlowStruct));
    clock_t time0 = clock();
    for (int c = 0; c < iterations; c++){
        for (int i = 0; i < x; i++){
            slow[i].a += c;
        }
    }
    clock_t time1 = clock();
    cout << "slow = " << (double)(time1 - time0) / CLOCKS_PER_SEC << endl;

    //  Warm the cache.
    memset(fast,0,x * sizeof(FastStruct));
    time1 = clock();
    for (int c = 0; c < iterations; c++){
        for (int i = 0; i < x; i++){
            fast[i].a += c;
        }
    }
    clock_t time2 = clock();
    cout << "fast = " << (double)(time2 - time1) / CLOCKS_PER_SEC << endl;



    //  Print to avoid Dead Code Elimination
    __int64 sum = 0;
    for (int c = 0; c < x; c++){
        sum += slow[c].a;
        sum += fast[c].a;
    }
    cout << "sum = " << sum << endl;


    return 0;
}


Core i7 920 @ 3.5 GHz

slow = 4.578
fast = 4.434
sum = 99999990000000000

Okay, not much difference. But it's still consistent over multiple runs.
So the alignment makes a small difference on Nehalem Core i7.


Intel Xeon X5482 Harpertown @ 3.2 GHz (Core 2 - generation Xeon)

slow = 22.803
fast = 3.669
sum = 99999990000000000

Now take a look...

6.2x faster!!!


Conclusion:

You see the results. You decide whether or not it's worth your time to do these optimizations.


EDIT :

Same benchmarks but without the #pragma pack:

Core i7 920 @ 3.5 GHz

slow = 4.49
fast = 4.442
sum = 99999990000000000

Intel Xeon X5482 Harpertown @ 3.2 GHz

slow = 3.684
fast = 3.717
sum = 99999990000000000

  • The Core i7 numbers didn't change. Apparently it can handle misalignment without trouble for this benchmark.
  • The Core 2 Xeon now shows the same times for both versions. This confirms that misalignment is a problem on the Core 2 architecture.

Taken from my comment:

If you leave out the #pragma pack, the compiler will keep everything aligned so you don't see this issue. So this is actually an example of what could happen if you misuse #pragma pack.

这篇关于位对齐空间和性能提升的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆