自定义数据大小以进行内存对齐 [英] Custom data size for memory alignment

查看:99
本文介绍了自定义数据大小以进行内存对齐的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

每个数据类型都有一定的范围,具体取决于硬件.例如,在32位计算机上,一个int的范围是-2147483648至2147483647.

Each datatype has a certain range, based on the hardware. For example, on a 32bit machine an int has the range -2147483648 to 2147483647.

C ++编译器填充"对象内存以适合特定大小.我很确定它是2、4、8、16、32、64等.这也可能取决于计算机.

C++ compilers 'pad' object memory to fit into certain sizes. I'm pretty sure it's 2, 4, 8, 16, 32, 64 etc. This also probably depends on the machine.

我想手动对齐对象以满足填充要求.有没有办法:

I want to manually align my objects to meet padding requirements. Is there a way to:

  • 确定程序在哪台计算机上运行
  • 确定填充大小
  • 根据位大小设置自定义数据类型

我以前在Java中使用过位集,但是我对C ++不熟悉.至于机器要求,我知道针对不同硬件集的程序在C ++中通常会以不同的方式编译,因此我想知道它是否甚至可能.

I've used bitsets before in Java, but I'm not familiar with C++. As for machine requirements, I know programs for different hardware sets are usually compiled differently in C++, so I'm wondering if its even possible at all.

示例->

/*getHardwarePack size obviously doesn't exist, just here to explain. What I'm trying to get  
here would be the minimum alignment size for the machine the program is running on*/

#define PACK_SIZE = getHardwarePackSize();
#define MONTHS = 12;

class date{

    private:

           //Pseudo code that represents making a custom type
           customType monthType = MONTH/PACK_SIZE; 

           monthType.remainder  = MONTH % PACK_SIZE;

           monthType months = 12;
};

想法是使每个变量都适合最小位大小,并跟踪剩余的位数.

The idea is to be able to fit every variable into the minimum bit size and track how many bits are left over.

从理论上讲,可以利用每个未使用的位并提高存储效率.显然,这永远不会像这样工作,但是该示例只是为了解释这一概念.

Theoretically, it would be possible to make use of every unused bit and improve memory efficiency. Obviously this would never work anything like this, but the example is just to explain the concept.

推荐答案

这比您要描述的要复杂得多,因为需要对齐对象和对象中的项目.例如,如果编译器确定structclass中的整数是16个字节,则很可能会确定啊,我可以使用对齐的SSE指令来加载此数据,因为它的对齐方式为16字节"(或ARM,PowerPC等中的类似字符).因此,如果您不满意代码中的对齐方式,则会导致程序出错(崩溃或误读数据,具体取决于体系结构).

This is a lot more complex than what you are trying to describe, as there are requirements for alignment on objects and items within objects. For example, if the compiler decides that an integer item is 16 bytes into a struct or class, it may well decide that "ah, I can use an aligned SSE instruction to load this data, because it is aligned at 16 bytes" (or something similar in ARM, PowerPC, etc). So if you do not satisfy AT LEAST that alignment in your code, you will cause the program to go wrong (crash or misread the data, depending on the architecture).

通常,编译器使用和给出的对齐方式对于编译器针对的任何体系结构都是正确的".对其进行更改通常会导致性能下降.当然,并非总是如此,但是在摆弄它之前,您最好确切地知道自己在做什么.并测量前后的性能,并彻底测试是否有任何损坏.

Typically, the alignment used and given by the compiler will be "right" for whatever architecture the compiler is targeting. Changing it will often lead to worse performance. Not always, of course, but you'd better know exactly what you are doing before you fiddle with it. And measure the performance before/after, and test thoroughly that nothing has been broken.

填充通常仅用于下一个最大类型的最小对齐方式",例如如果struct仅包含int和几个char变量,则将其填充为4个字节[根据需要在struct内部和末尾].对于double,可以确保填充到8个字节,但是三个double通常将占用8 * 3个字节,而无需进一步填充.

The padding is typically just to the next "minimum alignment for the largest type" - e.g. if a struct contains only int and a couple of char variables, it will be padded to 4 bytes [inside the struct and at the end, as required]. For double, padding to 8 bytes is done to ensure, but three double will, typically, take up 8 * 3 bytes with no further padding.

此外,确定要在编译时(或将要在其上执行)的硬件比在运行时要好.在运行时,您的代码将已经生成,并且代码已经加载.您目前无法真正更改事物的偏移和对齐方式.

Also, determining what hardware you are executing on (or will execute on) is probably better done during compilation, than during runtime. At runtime, your code will have been generated, and the code is already loaded. You can't really change the offsets and alignments of things at this point.

如果您使用的是gcc或clang编译器,则可以使用

If you are using the gcc or clang compilers, you can use the __attribute__((aligned(n))), e.g. int x[4] __attribute__((aligned(32))); would create a 16-byte array that is aligned to 32 bytes. This can be done inside structures or classes as well as for any variable you are using. But this is a compile-time option, can not be used at runtime.

从C ++ 11开始,也可以使用 alignof .

It is also possible, in C++11 onwards, to find out the alignment of a type or variable with alignof.

请注意,它给出了类型所需的对齐方式,因此,如果您做一些愚蠢的事情,例如:

Note that it gives the alignment required for the type, so if you do something daft like:

 int x;
 char buf[4 * sizeof(int)];
 int *p = (int *)buf + 7;
 std::cout << alignof(*p) << std::endl;

代码将显示为4,尽管buf+7的对齐方式可能为3(7模4).

the code will print 4, although the alignment of buf+7 is probably 3 (7 modulo 4).

无法在运行时选择类型. C ++是一种静态类型的语言:某些东西的类型在运行时确定-当然,可以从基类派生的类可以在运行时创建,但是对于任何给定的对象,它始终为ONE TYPE,直到不再分配为止.

Types can not be chosen at runtime. C++ is a statically typed language: the type of something is determined at runtime - sure, classes that derive from a baseclass can be created at runtime, but for any given object, it has ONE TYPE, always and forever until it is no longer allocated.

最好在编译时做出这样的选择,因为这会使代码更直接地面向编译器,并且比在运行时做出选择时更好的优化,因为您随后必须进行运行时决定使用某些代码的分支A或分支B.

It is better to make such choices at compile-time, as it makes the code much more straight forward for the compiler, and will allow better optimisation than if the choices are made at runtime, since you then have to make a runtime decision to use branch A or branch B of some piece of code.

作为对齐与未对齐访问的示例:

As an example of aligned vs. unaligned access:

#include <cstdio>
#include <cstdlib>
#include <vector>

#define LOOP_COUNT 1000

unsigned long long rdtscl(void)
{
    unsigned int lo, hi;
    __asm__ __volatile__ ("rdtsc" : "=a"(lo), "=d"(hi));
    return ( (unsigned long long)lo)|( ((unsigned long long)hi)<<32 );
}

struct A
{
    long a;
    long b;
    long d;
    char c;
};

struct B 
{
    long a;
    long b;
    long d;
    char c;
} __attribute__((packed));

std::vector<A> arr1(LOOP_COUNT);
std::vector<B> arr2(LOOP_COUNT);


int main()
{
    for (int i = 0; i < LOOP_COUNT; i++)
    {
    arr1[i].a = arr2[i].a = rand();
    arr1[i].b = arr2[i].b = rand();
    arr1[i].c = arr2[i].c = rand();
    arr1[i].d = arr2[i].d = rand();
    }

    printf("align A %zd, size %zd\n", alignof(A), sizeof(A));
    printf("align B %zd, size %zd\n", alignof(B), sizeof(B));
    for(int loops = 0; loops < 10; loops++)
    {
    printf("Run %d\n", loops);
    size_t sum = 0;
    size_t sum2 = 0;
    unsigned long long before = rdtscl();
    for (int i = 0; i < LOOP_COUNT; i++)
        sum += arr1[i].a + arr1[i].b + arr1[i].c + arr1[i].d;
    unsigned long long after = rdtscl();
    printf("ARR1 %lld sum=%zd\n",(after - before),  sum);

    before = rdtscl();
    for (int i = 0; i < LOOP_COUNT; i++)
        sum2 += arr2[i].a + arr2[i].b + arr2[i].c + arr2[i].d;
    after = rdtscl();
    printf("ARR2 %lld sum=%zd\n",(after - before),  sum2);
    }
}

[该代码的一部分来自另一个项目,所以它也许不是有史以来编写的最整洁的C ++代码,但是它使我无需从头开始编写代码,而与该项目无关]

[Part of that code is taken from another project, so it's perhaps not the neatest C++ code ever written, but it saved me writing code from scratch, that isn't relevant to the project]

然后结果:

$ ./a.out
align A 8, size 32
align B 1, size 25
Run 0
ARR1 5091 sum=3218410893518
ARR2 5051 sum=3218410893518
Run 1
ARR1 3922 sum=3218410893518
ARR2 4258 sum=3218410893518
Run 2
ARR1 3898 sum=3218410893518
ARR2 4241 sum=3218410893518
Run 3
ARR1 3876 sum=3218410893518
ARR2 4184 sum=3218410893518
Run 4
ARR1 3875 sum=3218410893518
ARR2 4191 sum=3218410893518
Run 5
ARR1 3876 sum=3218410893518
ARR2 4186 sum=3218410893518
Run 6
ARR1 3875 sum=3218410893518
ARR2 4189 sum=3218410893518
Run 7
ARR1 3925 sum=3218410893518
ARR2 4229 sum=3218410893518
Run 8
ARR1 3884 sum=3218410893518
ARR2 4210 sum=3218410893518
Run 9
ARR1 3876 sum=3218410893518
ARR2 4186 sum=3218410893518

如您所见,使用arr1对齐的代码大约需要3900个时钟周期,而使用arr2的代码大约需要4200个时钟周期.因此,大约4000个周期中的300个周期,如果我的薄荷醇算术"正确运行,则大约为7.5%.

As you can see, the code that is aligned, using arr1 takes around 3900 clock-cycles, and the one using arr2 takes around 4200 cycles. So 300 cycles in roughly 4000 cycles, some 7.5% if my "menthol arithmetic" is works correctly.

当然,就像许多不同的事物一样,它实际上取决于确切的情况,对象的使用方式,高速缓存的大小,确切的处理器是什么,其周围其他位置有多少其他代码和数据也使用缓存空间.唯一可以确定的方法是尝试使用您的代码.

Of course, like so many different things, it really depends on the exact situation, how the objects are used, what the cache-size is, exactly what processor it is, how much other code and data in other places around it also using cache-space. The only way to be certain is to experiment with YOUR code.

[我运行了几次代码,尽管我并不总是得到相同的结果,但我总是得到相似的比例结果]

[I ran the code several times, and although I didn't always get the same results, I always got similar proportional results]

这篇关于自定义数据大小以进行内存对齐的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆