对齐数据成员和成员函数以提高性能 [英] Alignment of data members and member functions for performance

查看:88
本文介绍了对齐数据成员和成员函数以提高性能的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

对结构/类的数据成员进行真正的对齐是否真的不再产生它曾经使用过的好处,尤其是在因硬件改进而导致的Nehalem上?如果是这样,那么对齐是否仍会始终提供更好的性能,与过去的CPU相比,只有很小的显着改进?

Is it true aligning data members of a struct/class no longer yields the benefits it used to, especially on nehalem because of hardware improvements? If so, is it still the case that alignment will always make better performance, just very small noticeable improvements compared with on past CPUs?

成员变量的对齐是否扩展到成员函数?我相信我曾经读过(可能在Wikibooks"C ++ performance"上),有规则将成员函数打包"到各种单元"(即源文件)中,以便最佳地加载到指令缓存中? (如果我在这里用错了术语,请更正我的意思.)

Does alignment of member variables extend to member functions? I believe I once read (it could be on the wikibooks "C++ performance") that there are rules for "packing" member functions into various "units" (i.e. source files) for optimum loading into the instruction cache? (If I have got my terminology wrong here please correct me).

推荐答案

处理器仍然比RAM可以提供的速度快得多,因此它们仍然需要缓存.缓存仍然由固定大小的缓存行组成.同样,主存储器以页面形式交付,并且使用转换后备缓冲区访问页面.同样,该缓冲区具有固定大小的缓存.

Processors are still much faster than what the RAM can deliver, so they still need caches. Caches still consist of fixed-size cache lines. Also, main memory is delivered in pages and pages are accessed using a translation lookaside buffer. This buffer, again, has a fixed size cache.

这意味着时空上的地理位置都非常重要(即,您如何打包东西以及如何访问东西).很好地包装结构(按填充/对齐要求排序),而不是按某种偶然的顺序进行包装,通常会导致结构尺寸更小.

Which means that both spatial and temporal locality matter a lot (i.e. how you pack stuff, and how you access it). Packing structures well (sorted by padding/alignment requirements) as opposed to packing them in some haphazard order usually results in smaller structure sizes.

如果您有大量数据,则较小的结构尺寸表示:

Smaller structure sizes mean, if you have loads of data:

  • 更多结构适合一条缓存行(缓存未命中= 50-200个周期)
  • 需要更少的页面(页面错误= 10-20百万个CPU周期)
  • 需要更少的TLB条目,更少的TLB丢失(TLB丢失= 50-500个周期)

以线性方式处理几千兆字节的紧密包装的SoA数据,比天真地采用不良布局/打包方式做同样的事情要快3个数量级(如果涉及页面错误,则要快8-10个数量级).

Going linearly over a few gigabytes of tightly packed SoA data can be 3 orders of magnitude faster (or 8-10 orders of magnitude, if page faults are involved) than doing the same thing in a naive way with bad layout/packing.

无论您是否手动将4个或2个字节值(例如,典型的intshort)的单个对齐为2个或4个字节,对最近的Intel CPU(几乎没有引起注意).就此而言,似乎对此进行优化"很诱人,但我强烈建议您不要这样做.
通常,这是最好的事情,不用担心,让编译器找出来.如果不是因为其他原因,那是因为收益充其量是微不足道的,但是如果您弄错了,其他一些处理器体系结构将引发异常.因此,如果您试图变得太聪明,那么在其他体系结构上进行编译时,您将突然发生无法解释的崩溃.发生这种情况时,您会感到抱歉.

Whether or not you hand-align individual 4-byte or 2-byte values (say, a typical int or short) to 2 or 4 bytes makes a very small difference on recent Intel CPUs (hardly noticeable). Insofar, it may seem tempting to "optimize" on that, but I strongly advise against doing so.
This is usually something one best doesn't worry about and leaves to the compiler to figure out. If for no other reason, then because the gains are marginal at best, but some other processor architectures will raise an exception if you get it wrong. Therefore, if you try to be too smart, you'll suddenly have unexplainable crashes once you compile on some other architecture. When that happens, you'll feel sorry.

当然,如果您没有至少几十兆的数据要处理,那么您根本就不需要关心.

Of course, if you don't have at least several dozen of megabytes of data to process, you need not care at all.

这篇关于对齐数据成员和成员函数以提高性能的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆