如何组织结构中的成员以在对齐时浪费最少的空间? [英] How do I organize members in a struct to waste the least space on alignment?

查看:107
本文介绍了如何组织结构中的成员以在对齐时浪费最少的空间?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

[不是结构填充和打包的副本.这个问题是关于填充的方式和时间.这是关于如何处理它的内容.]

我刚刚意识到C ++中的对齐会浪费多少内存.考虑下面的简单示例:

struct X
{
    int a;
    double b;
    int c;
};

int main()
{
    cout << "sizeof(int) = "                      << sizeof(int)                      << '\n';
    cout << "sizeof(double) = "                   << sizeof(double)                   << '\n';
    cout << "2 * sizeof(int) + sizeof(double) = " << 2 * sizeof(int) + sizeof(double) << '\n';
    cout << "but sizeof(X) = "                    << sizeof(X)                        << '\n';
}

使用g ++时,程序将提供以下输出:

sizeof(int) = 4
sizeof(double) = 8
2 * sizeof(int) + sizeof(double) = 16
but sizeof(X) = 24

那是50%的内存开销!在134'217'728 X的3 GB数组中,1 GB将是纯填充.

幸运的是,该问题的解决方案非常简单-我们只需在以下位置交换double bint c即可:

struct X
{
    int a;
    int c;
    double b;
};

现在的结果更加令人满意:

sizeof(int) = 4
sizeof(double) = 8
2 * sizeof(int) + sizeof(double) = 16
but sizeof(X) = 16

但是有一个问题:这不是交叉兼容的.是的,在g ++下,int是4个字节,而double是8个字节,但这不一定总是正确的(它们的对齐方式也不必相同),因此在不同的环境下,此修复"可以不仅无用,而且还可能通过增加所需的填充量来使情况变得更糟.

是否存在一种可靠的跨平台方法来解决此问题(在不因未对齐而导致性能下降的情况下,尽量减少所需填充的数量 )? 编译器为什么不执行此类优化(交换结构/类成员以减少填充)?

澄清

由于误解和困惑,我想强调我不想打包"我的struct .也就是说,我不希望其成员不结盟,因此访问速度较慢.取而代之的是,我仍然希望所有成员都是自我对齐的,但是要在填充时使用最少的内存.可以通过使用此处和迷失的包装艺术中所述的手动重新布置来解决此问题. .我正在寻找一种自动且尽可能跨平台的方法来实现此目的,类似于解决方案

(请勿无视地应用这些规则.请参阅ESR关于您一起使用的成员的缓存局部性的观点.在多线程程序中,请注意错误共享由不同线程编写的成员.通常,出于这个原因,您根本不希望每个线程的数据都在单个结构中,除非您要用大的alignas(128)来控制分离.这适用于atomic和非原子var;重要的是线程写入缓存行,而不管它们如何执行.)


经验法则:从大到小alignof() .您不可能在任何地方都能做到完美,但是到目前为止,当今最常见的情况是针对正常的32位或64位CPU的合理的正常" C ++实现.所有原始类型都具有2的幂.

大多数类型的alignof(T) = sizeof(T)alignof(T)都以实现的寄存器宽度为上限.因此,较大的类型通常比较小的类型更对齐.

大多数ABI中的结构打包规则赋予结构成员相对于结构开始位置的绝对alignof(T)对齐方式,并且结构本身继承了其任何成员中最大的alignof().

(对于无符号整数类型,请在列表中找到对应的有符号类型.)

如果需要的话,可以将8个字节的倍数较小的数组 排在较早的位置.但是,如果您不知道类型的确切大小,则不能保证int i + char buf[4]将填充两个double之间的8字节对齐插槽.但这不是一个糟糕的假设,因此,如果出于某种原因(例如,一起访问的成员的空间位置)将他们放在一起而不是放在最后,我还是会这么做.

奇异类型:x86-64 System V具有alignof(long double) = 16,而i386 System V仅具有alignof(long double) = 4sizeof(long double) = 12.这是x87 80位类型,实际上是10个字节,但填充为12或16,因此是其alignof的倍数,从而可以在不违反对齐保证的情况下进行阵列.

通常,当您的结构成员本身是具有sizeof(x) != alignof(x)的聚合(结构或联合)时,会变得更加棘手.

另一种扭曲是,在某些ABI中(例如,如果我没记错的话,是32位Windows),结构成员相对于结构的起始位置对齐其大小(最多8个字节),甚至尽管alignof(T)对于doubleint64_t仍然仅为4.
这是针对单个结构的8字节对齐内存的单独分配的常见情况的优化,而不提供对齐 guarantee .对于大多数基本类型,i386 System V也具有相同的alignof(T) = 4(但是malloc仍会为您提供8字节对齐的内存,因为alignof(maxalign_t) = 8).但是无论如何,i386 System V没有该结构打包规则,因此(如果您不按从最大到最小的顺序排列结构)最终可能会导致8字节成员相对于结构的开头未对齐


大多数CPU具有寻址模式,给定寄存器中的指针,该寻址模式允许访问任何字节偏移量.最大偏移量通常非常大,但是在x86上,如果字节偏移量适合带符号的字节([-128 .. +127]),它将节省代码大小.因此,如果您有各种类型的大型数组,则希望稍后将其放置在结构中,该列表位于经常使用的成员之后.即使这需要一些填充.

您的编译器将几乎总是使具有结构地址的代码在寄存器中,而不是在结构中间的地址来利用短负位移.


Eric S. Raymond写了一篇文章《结构包装的迷失的艺术》 .具体来说,关于结构重新排序的部分基本上是针对此问题的答案.

他还提出了另一个重要观点:

9.可读性和缓存位置

虽然按大小重新排序是消除倾斜的最简单方法,但不一定是正确的事情.还有两个问题:可读性和缓存局部性.

在一个结构中,可以很容易地将其拆分成一条缓存行边界,如果它们总是一起使用,则将两件事放在附近是有意义的.甚至是连续的以允许加载/存储合并,例如使用一个(不需要的)整数或SIMD加载/存储来复制8或16个字节,而不是分别加载较小的成员.

在现代CPU上,高速缓存行通常为32或64字节. (在现代的x86上,始终为64个字节.Sandybridge系列在L2高速缓存中具有一个相邻行空间预取器,它试图完成128字节的行对,与主要的L2流媒体硬件预取模式检测器和L1d预取分开). /p>


有趣的事实:Rust允许编译器为了更好的打包或其他原因而对结构进行重新排序.但是,如果有任何编译器实际这样做,则可以使用IDK.如果希望基于结构的实际使用方式进行选择,则只有在链接时整个程序优化中才有可能.否则,程序的单独编译部分将无法在布局上达成共识.


(@ alexis发布了仅链接的答案,链接到ESR的文章,因此感谢您的起点.)

[Not a duplicate of Structure padding and packing. That question is about how and when padding occurs. This one is about how to deal with it.]

I have just realized how much memory is wasted as a result of alignment in C++. Consider the following simple example:

struct X
{
    int a;
    double b;
    int c;
};

int main()
{
    cout << "sizeof(int) = "                      << sizeof(int)                      << '\n';
    cout << "sizeof(double) = "                   << sizeof(double)                   << '\n';
    cout << "2 * sizeof(int) + sizeof(double) = " << 2 * sizeof(int) + sizeof(double) << '\n';
    cout << "but sizeof(X) = "                    << sizeof(X)                        << '\n';
}

When using g++ the program gives the following output:

sizeof(int) = 4
sizeof(double) = 8
2 * sizeof(int) + sizeof(double) = 16
but sizeof(X) = 24

That's 50% memory overhead! In a 3-gigabyte array of 134'217'728 Xs 1 gigabyte would be pure padding.

Fortunately, the solution to the problem is very simple - we simply have to swap double b and int c around:

struct X
{
    int a;
    int c;
    double b;
};

Now the result is much more satisfying:

sizeof(int) = 4
sizeof(double) = 8
2 * sizeof(int) + sizeof(double) = 16
but sizeof(X) = 16

There is however a problem: this isn't cross-compatible. Yes, under g++ an int is 4 bytes and a double is 8 bytes, but that's not necessarily always true (their alignment doesn't have to be the same either), so under a different environment this "fix" could not only be useless, but it could also potentially make things worse by increasing the amount of padding needed.

Is there a reliable cross-platform way to solve this problem (minimize the amount of needed padding without suffering from decreased performance caused by misalignment)? Why doesn't the compiler perform such optimizations (swap struct/class members around to decrease padding)?

Clarification

Due to misunderstanding and confusion, I'd like to emphasize that I don't want to "pack" my struct. That is, I don't want its members to be unaligned and thus slower to access. Instead, I still want all members to be self-aligned, but in a way that uses the least memory on padding. This could be solved by using, for example, manual rearrangement as described here and in The Lost Art of Packing by Eric Raymond. I am looking for an automated and as much cross-platform as possible way to do this, similar to what is described in proposal P1112 for the upcoming C++20 standard.

解决方案

(Don't apply these rules without thinking. See ESR's point about cache locality for members you use together. And in multi-threaded programs, beware false sharing of members written by different threads. Generally you don't want per-thread data in a single struct at all for this reason, unless you're doing it to control the separation with a large alignas(128). This applies to atomic and non-atomic vars; what matters is threads writing to cache lines regardless of how they do it.)


Rule of thumb: largest to smallest alignof(). There's nothing you can do that's perfect everywhere, but by far the most common case these days is a sane "normal" C++ implementation for a normal 32 or 64-bit CPU. All primitive types have power-of-2 sizes.

Most types have alignof(T) = sizeof(T), or alignof(T) capped at the register width of the implementation. So larger types are usually more-aligned than smaller types.

Struct-packing rules in most ABIs give struct members their absolute alignof(T) alignment relative to the start of the struct, and the struct itself inherits the largest alignof() of any of its members.

  • Put always-64-bit members first (like double, long long, and int64_t). ISO C++ of course doesn't fix these types at 64 bits / 8 bytes, but in practice on all CPUs you care about they are. People porting your code to exotic CPUs can tweak struct layouts to optimize if necessary.
  • then pointers and pointer-width integers: size_t, intptr_t, and ptrdiff_t (which may be 32 or 64-bit). These are all the same width on normal modern C++ implementations for CPUs with a flat memory model.

    Consider putting linked-list and tree left/right pointers first if you care about x86 and Intel CPUs. Pointer-chasing through nodes in a tree or linked list has penalties when the struct start address is in a different 4k page than the member you're accessing. Putting them first guarantees that can't be the case.

  • then long (which is sometimes 32-bit even when pointers are 64-bit, in LLP64 ABIs like Windows x64). But it's guaranteed at least as wide as int.

  • then 32-bit int32_t, int, float, enum. (Optionally separate int32_t and float ahead of int if you care about possible 8 / 16-bit systems that still pad those types to 32-bit, or do better with them naturally aligned. Most such systems don't have wider loads (FPU or SIMD) so wider types have to be handled as multiple separate chunks all the time anyway).

    ISO C++ allows int to be as narrow as 16 bits, or arbitrarily wide, but in practice it's a 32-bit type even on 64-bit CPUs. ABI designers found that programs designed to work with 32-bit int just waste memory (and cache footprint) if int was wider. Don't make assumptions that would cause correctness problems, but for "portable performance" you just have to be right in the normal case.

    People tuning your code for exotic platforms can tweak if necessary. If a certain struct layout is perf-critical, perhaps comment on your assumptions and reasoning in the header.

  • then short / int16_t
  • then char / int8_t / bool
  • (for multiple bool flags, especially if read-mostly or if they're all modified together, consider packing them with 1-bit bitfields.)

(For unsigned integer types, find the corresponding signed type in my list.)

A multiple-of-8 byte array of narrower types can go earlier if you want it to. But if you don't know the exact sizes of types, you can't guarantee that int i + char buf[4] will fill an 8-byte aligned slot between two doubles. But it's not a bad assumption, so I'd do it anyway if there was some reason (like spatial locality of members accessed together) for putting them together instead of at the end.

Exotic types: x86-64 System V has alignof(long double) = 16, but i386 System V has only alignof(long double) = 4, sizeof(long double) = 12. It's the x87 80-bit type, which is actually 10 bytes but padded to 12 or 16 so it's a multiple of its alignof, making arrays possible without violating the alignment guarantee.

And in general it gets trickier when your struct members themselves are aggregates (struct or union) with a sizeof(x) != alignof(x).

Another twist is that in some ABIs (e.g. 32-bit Windows if I recall correctly) struct members are aligned to their size (up to 8 bytes) relative to the start of the struct, even though alignof(T) is still only 4 for double and int64_t.
This is to optimize for the common case of separate allocation of 8-byte aligned memory for a single struct, without giving an alignment guarantee. i386 System V also has the same alignof(T) = 4 for most primitive types (but malloc still gives you 8-byte aligned memory because alignof(maxalign_t) = 8). But anyway, i386 System V doesn't have that struct-packing rule, so (if you don't arrange your struct from largest to smallest) you can end up with 8-byte members under-aligned relative to the start of the struct.


Most CPUs have addressing modes that, given a pointer in a register, allow access to any byte offset. The max offset is usually very large, but on x86 it saves code size if the byte offset fits in a signed byte ([-128 .. +127]). So if you have a large array of any kind, prefer putting it later in the struct after the frequently used members. Even if this costs a bit of padding.

Your compiler will pretty much always make code that has the struct address in a register, not some address in the middle of the struct to take advantage of short negative displacements.


Eric S. Raymond wrote an article The Lost Art of Structure Packing. Specifically the section on Structure reordering is basically an answer to this question.

He also makes another important point:

9. Readability and cache locality

While reordering by size is the simplest way to eliminate slop, it’s not necessarily the right thing. There are two more issues: readability and cache locality.

In a large struct that can easily be split across a cache-line boundary, it makes sense to put 2 things nearby if they're always used together. Or even contiguous to allow load/store coalescing, e.g. copying 8 or 16 bytes with one (unaliged) integer or SIMD load/store instead of separately loading smaller members.

Cache lines are typically 32 or 64 bytes on modern CPUs. (On modern x86, always 64 bytes. And Sandybridge-family has an adjacent-line spatial prefetcher in L2 cache that tries to complete 128-byte pairs of lines, separate from the main L2 streamer HW prefetch pattern detector and L1d prefetching).


Fun fact: Rust allows the compiler to reorder structs for better packing, or other reasons. IDK if any compilers actually do that, though. Probably only possible with link-time whole-program optimization if you want the choice to be based on how the struct is actually used. Otherwise separately-compiled parts of the program couldn't agree on a layout.


(@alexis posted a link-only answer linking to ESR's article, so thanks for that starting point.)

这篇关于如何组织结构中的成员以在对齐时浪费最少的空间?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆