C ++:优化成员变量顺序? [英] C++: optimizing member variable order?

查看:138
本文介绍了C ++:优化成员变量顺序?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在阅读博客帖子内向版本的游戏编码器,他正忙着挤压每一个 CPU 可以跳出代码。一个窍门,他提到的手段是

I was reading a blog post by a game coder for Introversion and he is busily trying to squeeze every CPU tick he can out of the code. One trick he mentions off-hand is to



类的成员变量重新排序成最常用和最少使用。

"re-order the member variables of a class into most used and least used."

我不熟悉C ++,也不熟悉它如何编译,但我想知道如果

I'm not familiar with C++, nor with how it compiles, but I was wondering if


  1. 此声明是准确的?

  2. 如何/为什么?

  3. 应用于其他(编译/脚本)语言?

我知道这个技巧节省的(CPU)时间是最小的,这不是一个破产者。但另一方面,在大多数函数中,很容易确定哪些变量将是最常用的变量,并且默认情况下开始编码。

I'm aware that the amount of (CPU) time saved by this trick would be minimal, it's not a deal-breaker. But on the other hand, in most functions it would be fairly easy to identify which variables are going to be the most commonly used, and just start coding this way by default.

推荐答案

这里有两个问题:



  • 如何做到这一点。

它可能有帮助的原因是内存被加载到CPU高速缓存中,这个块称为缓存行。这需要时间,一般来说,为您的对象加载的缓存行越多,所需的时间越长。另外,更多的其他东西被抛出缓存以腾出空间,这会以不可预知的方式减慢其他代码。

The reason that it might help, is that memory is loaded into the CPU cache in chunks called "cache lines". This takes time, and generally speaking the more cache lines loaded for your object, the longer it takes. Also, the more other stuff gets thrown out of the cache to make room, which slows down other code in an unpredictable way.

缓存行的大小取决于处理器。如果与对象的大小相比较大,则很少的对象将跨越缓存线边界,因此整个优化是相当无关紧要的。否则,你可能会有时候只有缓存中​​的一部分对象,其余的在主内存(或者L2缓存)中。如果您最常见的操作(访问常用字段的操作)对对象使用尽可能少的缓存,那么这是一件好事,因此将这些字段组合在一起可以使您更好地发生这种情况。

The size of a cache line depends on the processor. If it is large compared with the size of your objects, then very few objects are going to span a cache line boundary, so the whole optimization is pretty irrelevant. Otherwise, you might get away with sometimes only having part of your object in cache, and the rest in main memory (or L2 cache, perhaps). It's a good thing if your most common operations (the ones which access the commonly-used fields) use as little cache as possible for the object, so grouping those fields together gives you a better chance of this happening.

一般原则称为参考地点。您的程序访问不同的内存地址越接近,获得良好缓存行为的机会就越好。提前预测性能通常很困难:同一架构的不同处理器型号可能会有不同的表现,多线程意味着您经常不知道缓存中会发生什么,但是可以谈论什么是可能发生,大部分时间。如果你想知道任何东西,你通常必须测量它。

The general principle is called "locality of reference". The closer together the different memory addresses are that your program accesses, the better your chances of getting good cache behaviour. It's often difficult to predict performance in advance: different processor models of the same architecture can behave differently, multi-threading means you often don't know what's going to be in the cache, etc. But it's possible to talk about what's likely to happen, most of the time. If you want to know anything, you generally have to measure it.

请注意,这里有一些问题。如果您使用基于CPU的原子操作(C ++ 0x中的原子类型通常会),那么您可能会发现CPU锁定整个高速缓存行以锁定该字段。那么,如果你有几个原子场靠近在一起,不同的线程运行在不同的核心上,同时在不同的字段上运行,你会发现所有这些原子操作都是序列化的,因为它们都锁定相同的内存位置,在不同的领域进行操作。如果它们在不同的缓存行上运行,那么它们将并行运行,运行速度更快。实际上,正如Glen(通过Herb Sutter)在他的答案中指出的那样,在一致的缓存结构中,即使没有原子操作也会发生这种情况,并且可能彻底毁掉你的一天。所以参考的地方不是必须是涉及多个核心的好东西,即使它们共享缓存。您可以期望它是因为缓存未命中通常是丢失速度的来源,但在您的特定情况下可能是错误的。

Please note that there are some gotchas here. If you are using CPU-based atomic operations (which the atomic types in C++0x generally will), then you may find that the CPU locks the entire cache line in order to lock the field. Then, if you have several atomic fields close together, with different threads running on different cores and operating on different fields at the same time, you will find that all those atomic operations are serialised because they all lock the same memory location even though they're operating on different fields. Had they been operating on different cache lines then they would have worked in parallel, and run faster. In fact, as Glen (via Herb Sutter) points out in his answer, on a coherent-cache architecture this happens even without atomic operations, and can utterly ruin your day. So locality of reference is not necessarily a good thing where multiple cores are involved, even if they share cache. You can expect it to be, on grounds that cache misses usually are a source of lost speed, but be horribly wrong in your particular case.

现在,除了区分在常用和较少使用的字段之间,对象越小,占用的内存越少(因此缓存越少)。这是一个很好的消息,至少在那里你没有很大的争论。对象的大小取决于其中的字段以及必须在字段之间插入的任何填充,以确保它们正确对齐架构。 C ++(有时)根据它们被声明的顺序,在一些对象中必须显示哪些字段的顺序。这是为了使低级编程更容易。所以,如果你的对象包含:

Now, quite aside from distinguishing between commonly-used and less-used fields, the smaller an object is, the less memory (and hence less cache) it occupies. This is pretty much good news all around, at least where you don't have heavy contention. The size of an object depends on the fields in it, and on any padding which has to be inserted between fields in order to ensure they are correctly aligned for the architecture. C++ (sometimes) puts constraints on the order which fields must appear in an object, based on the order they are declared. This is to make low-level programming easier. So, if your object contains:


  • 一个int(4字节,4对齐)

  • <后跟一个char(1字节,任意对齐)
  • 后跟一个int(4字节,4对齐)

  • 后跟一个char 1字节,任何对齐方式)

  • an int (4 bytes, 4-aligned)
  • followed by a char (1 byte, any alignment)
  • followed by an int (4 bytes, 4-aligned)
  • followed by a char (1 byte, any alignment)

那么这将占用内存中的16个字节。顺便说一下,在每个平台上,int的大小和对齐方式是不一样的,但是4是很常见的,这只是一个例子。

then chances are this will occupy 16 bytes in memory. The size and alignment of int isn't the same on every platform, by the way, but 4 is very common and this is just an example.

在这种情况下,编译器将在第二个int之前插入3个字节的填充,以正确对齐它,并在最后插入3个字节的填充。对象的大小必须是其对齐的倍数,因此相同类型的对象可以放置在内存中。这就是C / C ++中的一个数组,内存中的相邻对象。如果结构体是int,int,char,char,那么相同的对象可能是12个字节,因为char没有对齐要求。

In this case, the compiler will insert 3 bytes of padding before the second int, to correctly align it, and 3 bytes of padding at the end. An object's size has to be a multiple of its alignment, so that objects of the same type can be placed adjacent in memory. That's all an array is in C/C++, adjacent objects in memory. Had the struct been int, int, char, char, then the same object could have been 12 bytes, because char has no alignment requirement.

我说,int是4对齐是平台依赖的:在ARM上绝对必须是,因为未对齐的访问会引发硬件异常。在x86上,您可以访问int对齐,但通常较慢,IIRC非原子。所以编译器通常(总是?)在x86上的4对齐。

I said that whether int is 4-aligned is platform-dependent: on ARM it absolutely has to be, since unaligned access throws a hardware exception. On x86 you can access ints unaligned, but it's generally slower and IIRC non-atomic. So compilers usually (always?) 4-align ints on x86.

编写代码时的经验法则,如果你关心打包,是查看对齐要求的结构的每个成员。然后首先排列最大排列类型的字段,然后再排序下一个最小的字段,依此类推,直到不符合要求的成员。例如,如果我想编写可移植代码,我可能会想出这一点:

The rule of thumb when writing code, if you care about packing, is to look at the alignment requirement of each member of the struct. Then order the fields with the biggest-aligned types first, then the next smallest, and so on down to members with no aligment requirement. For example if I'm trying to write portable code I might come up with this:

struct some_stuff {
    double d;   // I expect double is 64bit IEEE, it might not be
    uint64_t l; // 8 bytes, could be 8-aligned or 4-aligned, I don't know
    uint32_t i; // 4 bytes, usually 4-aligned
    int32_t j;  // same
    short s;    // usually 2 bytes, could be 2-aligned or unaligned, I don't know
    char c[4];  // array 4 chars, 4 bytes big but "never" needs 4-alignment
    char d;     // 1 byte, any alignment
};

如果您不知道字段的对齐方式,或者正在编写可移植代码,但想要尽可能做到最好,没有主要的诡计,那么你认为对齐要求是结构中任何基本类型的最大要求,基本类型的对齐要求是它们的大小。所以,如果你的结构体包含一个uint64_t,或者一个很长的长度,那么最好的猜测是它是8对齐的。有时候你会错了,但是你会很正确的。

If you don't know the alignment of a field, or you're writing portable code but want to do the best you can without major trickery, then you assume that the alignment requirement is the largest requirement of any fundamental type in the structure, and that the alignment requirement of fundamental types is their size. So, if your struct contains a uint64_t, or a long long, then the best guess is it's 8-aligned. Sometimes you'll be wrong, but you'll be right a lot of the time.

请注意,像您的博主一样的游戏程序员通常都会了解有关其处理器和硬件的一切,因此他们不必猜测。他们知道缓存行大小,他们知道每个类型的大小和对齐方式,他们知道它们的编译器使用的结构布局规则(对于POD和非POD类型)。如果他们支持多个平台,那么如果需要,他们可以为每个平台提供特殊情况。他们还花费大量的时间来思考游戏中的哪些对象将从性能改进中受益,并使用剖析器来确定真正的瓶颈在哪里。但是即使如此,使用一些您可以使用的一些经验法则也不是一个好主意。只要不会使代码不清楚,将常用字段放在对象的开始处和按对齐要求排序是两个很好的规则。

Note that games programmers like your blogger often know everything about their processor and hardware, and thus they don't have to guess. They know the cache line size, they know the size and alignment of every type, and they know the struct layout rules used by their compiler (for POD and non-POD types). If they support multiple platforms, then they can special-case for each one if necessary. They also spend a lot of time thinking about which objects in their game will benefit from performance improvements, and using profilers to find out where the real bottlenecks are. But even so, it's not such a bad idea to have a few rules of thumb that you apply whether the object needs it or not. As long as it won't make the code unclear, "put commonly-used fields at the start of the object" and "sort by alignment requirement" are two good rules.

这篇关于C ++:优化成员变量顺序?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆