在C ++中优化成员变量顺序 [英] Optimizing member variable order in C++

查看:69
本文介绍了在C ++中优化成员变量顺序的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在阅读博客文章 Introversion 的游戏编码员编写,他正忙于尝试挤压每个

I was reading a blog post by a game coder for Introversion and he is busily trying to squeeze every CPU tick he can out of the code. One trick he mentions off-hand is to

重新排列一个成员变量 分为最常用和最不常用."

"re-order the member variables of a class into most used and least used."

我不熟悉C ++,也不熟悉C ++的编译方式,但我想知道

I'm not familiar with C++, nor with how it compiles, but I was wondering if

  1. 这句话是正确的吗?
  2. 如何/为什么?
  3. 它是否适用于其他(编译/脚本)语言?

我知道此技巧可以节省的(CPU)时间最少,这不是破坏交易的事情.但是,另一方面,在大多数函数中,很容易确定哪些变量将是最常用的变量,并且默认情况下只是以这种方式开始编码.

I'm aware that the amount of (CPU) time saved by this trick would be minimal, it's not a deal-breaker. But on the other hand, in most functions it would be fairly easy to identify which variables are going to be the most commonly used, and just start coding this way by default.

推荐答案

此处有两个问题:

  • 是否以及何时将某些字段保持在一起是一种优化.
  • 实际操作方法.

之所以可能有所帮助,是因为内存以称为缓存行"的块的形式加载到CPU缓存中.这需要时间,通常来说,为您的对象加载的缓存行越多,花费的时间就越长.另外,更多其他东西会从缓存中抛出以腾出空间,从而以无法预测的方式减慢其他代码的速度.

The reason that it might help, is that memory is loaded into the CPU cache in chunks called "cache lines". This takes time, and generally speaking the more cache lines loaded for your object, the longer it takes. Also, the more other stuff gets thrown out of the cache to make room, which slows down other code in an unpredictable way.

高速缓存行的大小取决于处理器.如果与对象的大小相比,它很大,那么很少有对象会跨越缓存行边界,因此整个优化是完全无关紧要的.否则,有时对象中的一部分仅在高速缓存中,而其余部分在主内存(或者可能是L2高速缓存)中,您可能会大失所望.如果您最常用的操作(访问常用字段的操作)对对象使用的缓存尽可能少,则是一件好事,因此将这些字段分组在一起可以为您提供更好的机会.

The size of a cache line depends on the processor. If it is large compared with the size of your objects, then very few objects are going to span a cache line boundary, so the whole optimization is pretty irrelevant. Otherwise, you might get away with sometimes only having part of your object in cache, and the rest in main memory (or L2 cache, perhaps). It's a good thing if your most common operations (the ones which access the commonly-used fields) use as little cache as possible for the object, so grouping those fields together gives you a better chance of this happening.

一般原理称为参考位置".程序访问的不同内存地址越近,获得良好缓存行为的机会就越大.通常很难预先预测性能:同一体系结构的不同处理器模型的行为可能不同,多线程意味着您经常不知道缓存中将要发生什么,等等.但是可以谈论什么是大多数情况下都有可能发生.如果您想知道任何事情,通常必须进行测量.

The general principle is called "locality of reference". The closer together the different memory addresses are that your program accesses, the better your chances of getting good cache behaviour. It's often difficult to predict performance in advance: different processor models of the same architecture can behave differently, multi-threading means you often don't know what's going to be in the cache, etc. But it's possible to talk about what's likely to happen, most of the time. If you want to know anything, you generally have to measure it.

请注意,这里有些陷阱.如果您使用的是基于CPU的原子操作(C ++ 0x中的原子类型通常会使用该原子操作),那么您可能会发现CPU锁定了整个缓存行以锁定该字段.然后,如果您有多个原子字段并在一起,并且不同的线程在不同的内核上运行并同时在不同的字段上运行,您会发现所有这些原子操作都已序列化,因为即使它们在不同的领域进行操作.如果它们在不同的缓存行上运行,那么它们将并行工作,并运行得更快.实际上,正如Glen(通过Herb Sutter)指出的那样,在一个一致的缓存体系结构中,即使没有原子操作,这种情况也会发生,并且可能彻底破坏您的一天.因此,涉及多个内核,即使它们共享缓存,引用的局部性也不是一件好事.您可以预期会这样,因为高速缓存未命中通常是造成速度损失的原因,但是在您的特定情况下会犯严重错误.

Please note that there are some gotchas here. If you are using CPU-based atomic operations (which the atomic types in C++0x generally will), then you may find that the CPU locks the entire cache line in order to lock the field. Then, if you have several atomic fields close together, with different threads running on different cores and operating on different fields at the same time, you will find that all those atomic operations are serialised because they all lock the same memory location even though they're operating on different fields. Had they been operating on different cache lines then they would have worked in parallel, and run faster. In fact, as Glen (via Herb Sutter) points out in his answer, on a coherent-cache architecture this happens even without atomic operations, and can utterly ruin your day. So locality of reference is not necessarily a good thing where multiple cores are involved, even if they share cache. You can expect it to be, on grounds that cache misses usually are a source of lost speed, but be horribly wrong in your particular case.

现在,除了区分常用字段和较少使用的字段之外,对象越小,它占用的内存就越少(因此缓存也就越少).这至少在没有激烈争执的地方都是个好消息.对象的大小取决于其中的字段,还取决于必须在字段之间插入的任何填充,以确保它们与体系结构正确对齐. C ++(有时)基于声明字段的顺序对字段在对象中必须出现的顺序施加约束.这是为了简化底层编程.因此,如果您的对象包含:

Now, quite aside from distinguishing between commonly-used and less-used fields, the smaller an object is, the less memory (and hence less cache) it occupies. This is pretty much good news all around, at least where you don't have heavy contention. The size of an object depends on the fields in it, and on any padding which has to be inserted between fields in order to ensure they are correctly aligned for the architecture. C++ (sometimes) puts constraints on the order which fields must appear in an object, based on the order they are declared. This is to make low-level programming easier. So, if your object contains:

  • 一个int(4字节,4对齐)
  • 后跟一个字符(1个字节,任何对齐方式)
  • 后跟一个整数(4字节,4对齐)
  • 后跟一个字符(1个字节,任何对齐方式)

那么这将有可能占用16个字节的内存.顺便说一下,int的大小和对齐方式在每个平台上都不相同,但是4很常见,这只是一个例子.

then chances are this will occupy 16 bytes in memory. The size and alignment of int isn't the same on every platform, by the way, but 4 is very common and this is just an example.

在这种情况下,编译器将在第二个int之前插入3字节的填充以正确对齐,并在末尾插入3字节的填充.对象的大小必须为其对齐的倍数,以便可以将相同类型的对象放置在内存中相邻的位置.这就是数组在C/C ++中的所有内容,而在内存中是相邻的对象.如果该结构是int,int,char,char,则同一对象可能是12个字节,因为char没有对齐要求.

In this case, the compiler will insert 3 bytes of padding before the second int, to correctly align it, and 3 bytes of padding at the end. An object's size has to be a multiple of its alignment, so that objects of the same type can be placed adjacent in memory. That's all an array is in C/C++, adjacent objects in memory. Had the struct been int, int, char, char, then the same object could have been 12 bytes, because char has no alignment requirement.

我说过int是否为4对齐取决于平台:在ARM上绝对必须如此,因为未对齐的访问会引发硬件异常.在x86上,您可以访问未对齐的整数,但是它通常较慢,并且IIRC是非原子的.因此,编译器通常(总是?)在x86上对4的整数进行对齐.

I said that whether int is 4-aligned is platform-dependent: on ARM it absolutely has to be, since unaligned access throws a hardware exception. On x86 you can access ints unaligned, but it's generally slower and IIRC non-atomic. So compilers usually (always?) 4-align ints on x86.

编写代码时的经验法则(如果您关心打包)是查看结构中每个成员的对齐要求.然后,首先对具有最大对齐类型的字段进行排序,然后对第二个最小的字段进行排序,依此类推,直到没有资格要求的成员为止.例如,如果我试图编写可移植的代码,我可能会提出以下建议:

The rule of thumb when writing code, if you care about packing, is to look at the alignment requirement of each member of the struct. Then order the fields with the biggest-aligned types first, then the next smallest, and so on down to members with no aligment requirement. For example if I'm trying to write portable code I might come up with this:

struct some_stuff {
    double d;   // I expect double is 64bit IEEE, it might not be
    uint64_t l; // 8 bytes, could be 8-aligned or 4-aligned, I don't know
    uint32_t i; // 4 bytes, usually 4-aligned
    int32_t j;  // same
    short s;    // usually 2 bytes, could be 2-aligned or unaligned, I don't know
    char c[4];  // array 4 chars, 4 bytes big but "never" needs 4-alignment
    char d;     // 1 byte, any alignment
};

如果您不知道字段的对齐方式,或者您正在编写可移植的代码,但想尽力而为,而又不做大的麻烦,那么您就认为对齐方式要求是任何基本类型中的最大要求结构,基本类型的对齐要求是它们的大小.因此,如果您的结构包含uint64_t或long long,那么最好的猜测是它是8对齐的.有时候你会错,但是很多时候你都会对.

If you don't know the alignment of a field, or you're writing portable code but want to do the best you can without major trickery, then you assume that the alignment requirement is the largest requirement of any fundamental type in the structure, and that the alignment requirement of fundamental types is their size. So, if your struct contains a uint64_t, or a long long, then the best guess is it's 8-aligned. Sometimes you'll be wrong, but you'll be right a lot of the time.

请注意,像您的博客作者这样的游戏程序员通常都知道有关其处理器和硬件的所有知识,因此不必猜测.他们知道高速缓存行的大小,知道每种类型的大小和对齐方式,并且知道编译器使用的结构布局规则(对于POD和非POD类型).如果它们支持多个平台,则可以在必要时对每个平台进行特殊处理.他们还花费大量时间思考游戏中的哪些对象将从性能改进中受益,并使用探查器找出真正的瓶颈在哪里.但是即使这样,无论对象是否需要应用一些经验法则也不是什么坏主意.只要不会使代码不清楚,将常用字段放在对象的开头"和按对齐要求排序"是两个很好的规则.

Note that games programmers like your blogger often know everything about their processor and hardware, and thus they don't have to guess. They know the cache line size, they know the size and alignment of every type, and they know the struct layout rules used by their compiler (for POD and non-POD types). If they support multiple platforms, then they can special-case for each one if necessary. They also spend a lot of time thinking about which objects in their game will benefit from performance improvements, and using profilers to find out where the real bottlenecks are. But even so, it's not such a bad idea to have a few rules of thumb that you apply whether the object needs it or not. As long as it won't make the code unclear, "put commonly-used fields at the start of the object" and "sort by alignment requirement" are two good rules.

这篇关于在C ++中优化成员变量顺序的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆