C ++:优化成员变量顺序? [英] C++: optimizing member variable order?

查看:102
本文介绍了C ++:优化成员变量顺序?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在游戏编码器阅读博文 Introversion ,他正在忙于尝试每个 CPU 勾选他可以脱离代码。他提到的一个技巧是

I was reading a blog post by a game coder for Introversion and he is busily trying to squeeze every CPU tick he can out of the code. One trick he mentions off-hand is to



类的成员变量重新排序为最常用和最少使用。

"re-order the member variables of a class into most used and least used."

我不熟悉C ++,也不熟悉它如何编译,但我想知道

I'm not familiar with C++, nor with how it compiles, but I was wondering if


  1. 此声明是准确的?

  2. 如何/为什么?

  3. 应用于其他(编译/脚本)语言?

我知道这个技巧节省的是最小的,它不是一个破产者。但另一方面,在大多数函数中,很容易识别哪些变量将是最常用的,并且默认情况下只是以这种方式开始编码。

I'm aware that the amount of (CPU) time saved by this trick would be minimal, it's not a deal-breaker. But on the other hand, in most functions it would be fairly easy to identify which variables are going to be the most commonly used, and just start coding this way by default.

推荐答案

这里有两个问题:



  • 如何做到这一点。

它可能有帮助的原因是,内存被加载到称为缓存行的块中的CPU缓存。这需要时间,一般来说,为您的对象加载的缓存行越多,需要的时间就越长。此外,更多的其他东西被扔出缓存,以腾出空间,这会以不可预测的方式减慢其他代码。

The reason that it might help, is that memory is loaded into the CPU cache in chunks called "cache lines". This takes time, and generally speaking the more cache lines loaded for your object, the longer it takes. Also, the more other stuff gets thrown out of the cache to make room, which slows down other code in an unpredictable way.

缓存行的大小取决于处理器。如果它与对象的大小相比很大,那么很少的对象将跨越缓存行边界,因此整个优化是无关紧要的。否则,你可能会离开,有时只有你的对象的一部分在缓存中,其余在主内存(或L2缓存,或许)。如果你最常用的操作(访问常用字段的操作)对对象使用尽可能少的缓存,那么这是一件好事,因此将这些字段组合在一起会给你更好的机会。

The size of a cache line depends on the processor. If it is large compared with the size of your objects, then very few objects are going to span a cache line boundary, so the whole optimization is pretty irrelevant. Otherwise, you might get away with sometimes only having part of your object in cache, and the rest in main memory (or L2 cache, perhaps). It's a good thing if your most common operations (the ones which access the commonly-used fields) use as little cache as possible for the object, so grouping those fields together gives you a better chance of this happening.

一般原则称为引用局部性。不同的内存地址越接近,您的程序访问越好,您获得良好缓存行为的机会越大。提前预测性能通常是困难的:同一架构的不同处理器模型可能表现不同,多线程意味着您通常不知道缓存中将要做什么等等。但是,有可能谈论什么是可能发生,大部分时间。如果你想要知道任何东西,你一般都要测量它。

The general principle is called "locality of reference". The closer together the different memory addresses are that your program accesses, the better your chances of getting good cache behaviour. It's often difficult to predict performance in advance: different processor models of the same architecture can behave differently, multi-threading means you often don't know what's going to be in the cache, etc. But it's possible to talk about what's likely to happen, most of the time. If you want to know anything, you generally have to measure it.

请注意,这里有一些问题。如果你使用基于CPU的原子操作(C ++ 0x中的原子类型通常会),那么你可能会发现CPU锁定整个缓存行以锁定字段。然后,如果你有几个原子场靠近在一起,不同的线程运行在不同的核心,并在不同的领域同时操作,你会发现所有这些原子操作是序列化的,因为它们都锁定相同的内存位置,重新操作不同领域。如果他们在不同的缓存线上操作,那么他们将并行工作,并运行更快。事实上,由于Glen(通过Herb Sutter)在他的答案中指出,在连贯高速缓存架构上,即使没有原子操作,这种情况也会发生,并且可能彻底毁了你的一天。因此,引用的局部性不一定是一个好东西,其中涉及多个内核,即使它们共享缓存。你可以期望它是,因为缓存未命中通常是速度丢失的来源,但在你的具体情况下是可怕的错误。

Please note that there are some gotchas here. If you are using CPU-based atomic operations (which the atomic types in C++0x generally will), then you may find that the CPU locks the entire cache line in order to lock the field. Then, if you have several atomic fields close together, with different threads running on different cores and operating on different fields at the same time, you will find that all those atomic operations are serialised because they all lock the same memory location even though they're operating on different fields. Had they been operating on different cache lines then they would have worked in parallel, and run faster. In fact, as Glen (via Herb Sutter) points out in his answer, on a coherent-cache architecture this happens even without atomic operations, and can utterly ruin your day. So locality of reference is not necessarily a good thing where multiple cores are involved, even if they share cache. You can expect it to be, on grounds that cache misses usually are a source of lost speed, but be horribly wrong in your particular case.

现在,除了区分在常用和较少使用的字段之间,对象越小,它占用的存储器越少(因此缓存越少)。这是个好消息,至少在你没有激烈争论的地方。对象的大小取决于其中的字段,以及必须插入字段之间的任何填充,以确保它们针对该体系结构正确地对齐。 C ++(有时)根据它们被声明的顺序,对字段必须出现在对象中的顺序放置约束。这是为了使低级编程更容易。所以,如果你的对象包含:

Now, quite aside from distinguishing between commonly-used and less-used fields, the smaller an object is, the less memory (and hence less cache) it occupies. This is pretty much good news all around, at least where you don't have heavy contention. The size of an object depends on the fields in it, and on any padding which has to be inserted between fields in order to ensure they are correctly aligned for the architecture. C++ (sometimes) puts constraints on the order which fields must appear in an object, based on the order they are declared. This is to make low-level programming easier. So, if your object contains:


  • 一个int(4字节,4对齐)

  • 后跟一个char(1字节,任何对齐)

  • 后跟一个int(4字节,4对齐)

  • 1个字节,任何对齐)

  • an int (4 bytes, 4-aligned)
  • followed by a char (1 byte, any alignment)
  • followed by an int (4 bytes, 4-aligned)
  • followed by a char (1 byte, any alignment)

那么这将占用内存中的16个字节。顺便说一下,每个平台的大小和对齐方式是不一样的,顺便说一下,4是很常见的,这只是一个例子。

then chances are this will occupy 16 bytes in memory. The size and alignment of int isn't the same on every platform, by the way, but 4 is very common and this is just an example.

在这种情况下,编译器将在第二个int之前插入3个字节的填充,以正确对齐它,并在结尾插入3个字节的填充。对象的大小必须是其对齐的倍数,使得相同类型的对象可以在存储器中相邻放置。这是一个数组是在C / C ++,内存中的相邻对象。如果结构体是int,int,char,char,那么同一个对象可能已经有12个字节,因为char没有对齐要求。

In this case, the compiler will insert 3 bytes of padding before the second int, to correctly align it, and 3 bytes of padding at the end. An object's size has to be a multiple of its alignment, so that objects of the same type can be placed adjacent in memory. That's all an array is in C/C++, adjacent objects in memory. Had the struct been int, int, char, char, then the same object could have been 12 bytes, because char has no alignment requirement.

我说,int是否是4对齐是平台相关的:在ARM上绝对必须是,因为非对齐访问抛出硬件异常。在x86你可以访问ints unaligned,但它通常较慢和IIRC非原子。所以编译器通常(总是?)4-align int在x86上。

I said that whether int is 4-aligned is platform-dependent: on ARM it absolutely has to be, since unaligned access throws a hardware exception. On x86 you can access ints unaligned, but it's generally slower and IIRC non-atomic. So compilers usually (always?) 4-align ints on x86.

编写代码时的经验,如果你关心打包,的结构体的每个成员。然后,首先使用最大对齐类型的字段,然后是下一个最小的字段,依此类推,直到没有分配要求的成员。例如,如果我想写可移植代码,我可能会想出这样:

The rule of thumb when writing code, if you care about packing, is to look at the alignment requirement of each member of the struct. Then order the fields with the biggest-aligned types first, then the next smallest, and so on down to members with no aligment requirement. For example if I'm trying to write portable code I might come up with this:

struct some_stuff {
    double d;   // I expect double is 64bit IEEE, it might not be
    uint64_t l; // 8 bytes, could be 8-aligned or 4-aligned, I don't know
    uint32_t i; // 4 bytes, usually 4-aligned
    int32_t j;  // same
    short s;    // usually 2 bytes, could be 2-aligned or unaligned, I don't know
    char c[4];  // array 4 chars, 4 bytes big but "never" needs 4-alignment
    char d;     // 1 byte, any alignment
};

如果您不知道字段的对齐方式,做最好的,你可以没有大的欺骗,那么你假设对齐要求是任何基本类型在结构中的最大要求,而基本类型的对齐要求是它们的大小。所以,如果你的结构包含一个uint64_t,或一个长的long,那么最好的猜测是它的8对齐。

If you don't know the alignment of a field, or you're writing portable code but want to do the best you can without major trickery, then you assume that the alignment requirement is the largest requirement of any fundamental type in the structure, and that the alignment requirement of fundamental types is their size. So, if your struct contains a uint64_t, or a long long, then the best guess is it's 8-aligned. Sometimes you'll be wrong, but you'll be right a lot of the time.

请注意,像您的博主一样,游戏程序员经常知道他们的处理器和硬件的一切,因此他们不必猜测。他们知道缓存行大小,他们知道每个类型的大小和对齐方式,他们知道他们的编译器使用的结构布局规则(对于POD和非POD类型)。如果他们支持多个平台,那么如果需要,它们可以是每个特殊情况。他们还花了很多时间思考他们游戏中的哪些对象将从性能改进中受益,并使用剖析器来找出真正瓶颈在哪里。但即使如此,这不是一个坏主意,有一些经验规则,你应用无论对象是否需要它。只要不会使代码不清楚,将常用字段放在对象的开头和按对齐要求排序是两个很好的规则。

Note that games programmers like your blogger often know everything about their processor and hardware, and thus they don't have to guess. They know the cache line size, they know the size and alignment of every type, and they know the struct layout rules used by their compiler (for POD and non-POD types). If they support multiple platforms, then they can special-case for each one if necessary. They also spend a lot of time thinking about which objects in their game will benefit from performance improvements, and using profilers to find out where the real bottlenecks are. But even so, it's not such a bad idea to have a few rules of thumb that you apply whether the object needs it or not. As long as it won't make the code unclear, "put commonly-used fields at the start of the object" and "sort by alignment requirement" are two good rules.

这篇关于C ++:优化成员变量顺序?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆