GCC可以强制生成高效的内存对齐对象的构造函数吗? [英] Can GCC be coerced to generate efficient constructors for memory-aligned objects?

查看:201
本文介绍了GCC可以强制生成高效的内存对齐对象的构造函数吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我要优化一个在我们应用程序的最内层循环中调用的构造函数。这个类有大约100个字节宽,由一堆 int s, float s, bool s和简单的结构,并且应该是简单的可复制(它有一个重要的默认构造函数,但没有析构函数或虚函数)。它的构造足够经常,每个花费在这个ctor的时间花费在6000美元的额外的服务器硬件我们需要购买。



但是,我发现GCC不会为这个构造函数发出非常有效的代码(即使使用 -O3 -march 等设置)。 GCC的构造函数的实现,通过初始化器列表填充默认值,运行大约34ns。如果代替这个默认构造函数,我使用一个手写的函数直接写入对象的内存空间与各种SIMD内在函数和指针数学,构造需要大约8ns。



当我 __ attribute __ 它们在SIMD边界上是内存对齐的吗?或者我必须诉诸老式技术,如在组装中编写自己的记忆初始化器?



此对象只能在堆栈上构建为本地,所以任何新的/ malloc开销不适用。



上下文:



这个类通过在栈上构造一个局部变量来使用,

 默认值,然后将它传递给一个函数, struct Trivial {
float x,y,z;
Trivial():x(0),y(0),z(0){};
};

struct Frobozz
{
int na,nb,nc,nd;
bool ba,bb,bc;
char ca,cb,cc;
float fa,fb;
Trivial va,vb; //在实际类中有几种不同的这些
//等等
Frobozz():na(0),nb(1),nc(-1),nd
ba(false),bb(true),bc(false),
ca('a'),cb('b'),cc('c'),
fa -1),fb(1.0)// etc
{}
} __attribute __((aligned(16)))

//一个指向func的指针,通过引用
获取结构体typedef int(* FrobozzSink_t)(Frobozz&);

//示例函数如何构造一个param对象,并将它
//发送到sink。想象一下这是几千个事件源之一:
int OversimplifiedExample(int a,float b)
{
Frobozz params;
params.na = a; params.fb = b; //其他字段使用它们的默认值
FrobozzSink_t funcptr = AssumeAConstantTimeOperationHere();
return(* funcptr)(params);
}

这里的最佳构造函数通过从静态模板新构造的实例,理想情况下使用SIMD运算符一次工作16个字节。相反,GCC对OversimplifiedExample()做错了事。

  //从objdump -dS 
int OversimplifiedExample(int a,float b)
{
a42:55 push%ebp
a43:89 e5 mov%esp,%ebp
a45:53 push%ebx
a46:e8 00 00 00 00 call a4b< _Z21OversimplifiedExampleif + 0xb>
a4b:5b pop%ebx
a4c:81 c3 03 00 00 00 add $ 0x3,%ebx
a52:83 ec 54 sub $ 0x54,%esp
// call 'trivial()'构造函数,它们逐个移动零...
a55:89 45 e0 mov%eax,-0x20(%ebp)
a58:89 45 e4 mov%eax, 0x1c(%ebp)
a5b:89 45 e8 mov%eax,-0x18(%ebp)
a5e:89 45 ec mov%eax,-0x14(%ebp)
a61:89 45 f0 mov%eax,-0x10(%ebp)
a64:89 45 f4 mov%eax,-0xc(%ebp)
//填写na / nb / nc / nd ..
a67:c7 45 c4 01 00 00 00 movl $ 0x1,-0x3c(%ebp)
a71:c7 45 c8 ff ff ff ff movl $ 0xffffffff,-0x38(%ebp)
a78: 89 45 c0 mov%eax,-0x40(%ebp)
a7b:c7 45 cc 00 00 00 00 movl $ 0x0,-0x34(%ebp)
a82:8b 45 0c mov 0xc ),%eax
//通过每次移动一个立即字节来执行bools和chars!
a85:c6 45 d0 00 movb $ 0x0,-0x30(%ebp)
a89:c6 45 d1 01 movb $ 0x1,-0x2f(%ebp)
a8d:c6 45 d2 00 movb $ 0x0,-0x2e(%ebp)
a91:c6 45 d3 61 movb $ 0x61,-0x2d(%ebp)
a95:c6 45 d4 62 movb $ 0x62,
a99:c6 45 d5 63 movb $ 0x63,-0x2b(%ebp)
//现在浮动...
a9d:c7 45 d8 00 00 80 bf movl $ 0xbf800000, 0x28(%ebp)
aa4:89 45 dc mov%eax,-0x24(%ebp)
// FrobozzSink_t funcptr = GetFrobozz
aa7:e8 fc ff ff ff call aa8< _Z21OversimplifiedExampleif + 0x68>
// return(* funcptr)(params);
aac:8d 55 c0 lea -0x40(%ebp),%edx
aaf:89 14 24 mov%edx,(%esp)
ab2:ff d0 call *%eax
ab4:83 c4 54 add $ 0x54,%esp
ab7:5b pop%ebx
ab8:c9 leave
ab9:c3 ret
}

我试图鼓励GCC构造一个单独的默认模板这个对象,然后批量复制它在默认构造函数,通过做一个隐藏的虚拟构造函数,使基本的范例,然后有默认值复制它一点欺骗:

  struct Frobozz 
{
int na,nb,nc,nd;
bool ba,bb,bc;
charca,cb,cc;
float fa,fb;
Trivial va,vb;
inline Frobozz();
private:
//等等
inline Frobozz(int dummy):na(0),/ * etc etc * / {}
} __attribute __ )));

Frobozz :: Frobozz()
{
const static Frobozz DefaultExemplar(69105);
//类似于copy-on-write idiom
* this = DefaultExemplar;
//或:
// memcpy(this,& DefaultExemplar,sizeof(Frobozz));
}

但这样生成的代码因为一些冗余的堆栈复制,所以默认为初始化列表。



最后,我诉诸写一个内联的自由函数来做 * this = DefaultExemplar 步骤,使用编译器内在函数和关于内存对齐的假设发布管道式 MOVDQA 有效地复制结构的SSE2操作码。这给了我需要的性能,但它是icky。我认为我的写作初始化程序在组装的日子在我身后,我真的宁愿只是让GCC的优化器发出正确的代码在第一。



有一些方法我可以得到GCC生成最佳的代码为我的构造函数,一些编译器设置或额外 __属性__ 我错过了



这是在Ubuntu上运行的GCC 4.4。编译器标志包括 -m32 -march = core2 -O3 -fno-strict-aliasing -fPIC (其他)。可移植性不是的考虑因素,我完全愿意牺牲标准,以符合性能。



计时是通过直接阅读具有 rdtsc ,例如测量 N OversimplifiedExample()调用的循环的时间戳计数器在样本之间适当注意定时器分辨率和缓存和统计意义等。



我还通过尽可能减少呼叫站点的数量,当然,但我仍然想知道如何一般从GCC更好的代码。

解决方案

做它。不要声明任何构造函数;声明一个包含默认值的固定Frobozz:

  const Frobozz DefaultFrobozz = 
{
0, 1,-1,0,// int na,nb,nc,nd;
false,true,false,// bool ba,bb,bc;
'a','b','c',// char ca,cb,cc;
-1,1.0 // float fa,fb;
};

然后在 OversimplifiedExample

  Frobozz params(DefaultFrobozz); 

使用 gcc -O3 2), params 的初始化简化为:

  leal -72 (%ebp),%edi 
movl $ _DefaultFrobozz,%esi
movl $ 16,%ecx
rep movsl



警告:我试过这与64位g ++版本4.7.0 20110827(实验),它生成一个显式序列的64位副本,而不是块移动。处理器不允许 rep movsq ,但我认为 rep movsl 要比64-位加载和存储。也许不是。 (但是 -Os 开关 - 优化空间 - 使用 rep movsl 指令。)无论如何,

我错了处理器不允许 rep movsq 。英特尔的文档说MOVS,MOVSB,MOVSW和MOVSD指令前可以有REP前缀,但似乎这只是一个文档故障。在任何情况下,如果我使 Frobozz 足够大,那么64位编译器生成 rep movsq 指令;所以它可能知道它在做什么。


I'm optimizing a constructor that is called in one of our app's innermost loops. The class in question is about 100 bytes wide, consists of a bunch of ints, floats, bools, and trivial structs, and should be trivially copyable (it has a nontrivial default constructor, but no destructor or virtual functions). It is constructed often enough that every nanosecond of time spent in this ctor works out to around $6,000 of extra server hardware we need to buy.

However, I find that GCC is not emitting very efficient code for this constructor (even with -O3 -march etc set). GCC's implementation of the constructor, filling out default values via an initializer list, takes about 34ns to run. If instead of this default constructor I use a hand-written function that writes directly to the object's memory space with a variety of SIMD intrinsics and pointer math, construction takes about 8ns.

Can I get GCC to emit an efficient constructor for such objects when I __attribute__ them to be memory-aligned on SIMD boundaries? Or must I resort to old-school techniques like writing my own memory initializers in assembly?

This object is only ever constructed as a local on the stack, so any new/malloc overhead doesn't apply.

Context:

This class is used by constructing it on the stack as a local variable, selectively writing a few fields with non-default values, and then passing it (by reference) to a function, which passes its reference to another and so on.

struct Trivial {
  float x,y,z;
  Trivial () : x(0), y(0), z(0) {};
};

struct Frobozz
{
   int na,nb,nc,nd;
   bool ba,bb,bc;
   char ca,cb,cc;
   float fa,fb;
   Trivial va, vb; // in the real class there's several different kinds of these
   // and so on
   Frobozz() : na(0), nb(1), nc(-1), nd(0),
               ba(false), bb(true), bc(false),
               ca('a'), cb('b'), cc('c'),
               fa(-1), fb(1.0) // etc
    {}
} __attribute__(( aligned(16) ));

// a pointer to a func that takes the struct by reference
typedef int (*FrobozzSink_t)( Frobozz& );

// example of how a function might construct one of the param objects and send it
// to a sink. Imagine this is one of thousands of event sources:
int OversimplifiedExample( int a, float b )
{
   Frobozz params; 
   params.na = a; params.fb = b; // other fields use their default values
   FrobozzSink_t funcptr = AssumeAConstantTimeOperationHere();
   return (*funcptr)(params);
}

The optimal constructor here would work by copying from a static "template" instance into the freshly constructed instance, ideally using SIMD operators to work 16 bytes at a time. Instead GCC does exactly the wrong thing for OversimplifiedExample() — a series of immediate mov ops to fill out the struct byte-by-byte.

// from objdump -dS
int OversimplifiedExample( int a, float b )
{
     a42:55                   push   %ebp
     a43:89 e5                mov    %esp,%ebp
     a45:53                   push   %ebx
     a46:e8 00 00 00 00       call   a4b <_Z21OversimplifiedExampleif+0xb>
     a4b:5b                   pop    %ebx
     a4c:81 c3 03 00 00 00    add    $0x3,%ebx
     a52:83 ec 54             sub    $0x54,%esp
     // calling the 'Trivial()' constructors which move zero, word by word...
     a55:89 45 e0             mov    %eax,-0x20(%ebp)
     a58:89 45 e4             mov    %eax,-0x1c(%ebp)
     a5b:89 45 e8             mov    %eax,-0x18(%ebp)
     a5e:89 45 ec             mov    %eax,-0x14(%ebp)
     a61:89 45 f0             mov    %eax,-0x10(%ebp)
     a64:89 45 f4             mov    %eax,-0xc(%ebp)
     // filling out na/nb/nc/nd..
     a67:c7 45 c4 01 00 00 00 movl   $0x1,-0x3c(%ebp)
     a71:c7 45 c8 ff ff ff ff movl   $0xffffffff,-0x38(%ebp)
     a78:89 45 c0             mov    %eax,-0x40(%ebp)
     a7b:c7 45 cc 00 00 00 00 movl   $0x0,-0x34(%ebp)
     a82:8b 45 0c             mov    0xc(%ebp),%eax
     // doing the bools and chars by moving one immediate byte at a time!
     a85:c6 45 d0 00          movb   $0x0,-0x30(%ebp)
     a89:c6 45 d1 01          movb   $0x1,-0x2f(%ebp)
     a8d:c6 45 d2 00          movb   $0x0,-0x2e(%ebp)
     a91:c6 45 d3 61          movb   $0x61,-0x2d(%ebp)
     a95:c6 45 d4 62          movb   $0x62,-0x2c(%ebp)
     a99:c6 45 d5 63          movb   $0x63,-0x2b(%ebp)
     // now the floats...
     a9d:c7 45 d8 00 00 80 bf movl   $0xbf800000,-0x28(%ebp)
     aa4:89 45 dc             mov    %eax,-0x24(%ebp)
     // FrobozzSink_t funcptr = GetFrobozz();
     aa7:e8 fc ff ff ff       call   aa8 <_Z21OversimplifiedExampleif+0x68>
     // return (*funcptr)(params);
     aac:8d 55 c0             lea    -0x40(%ebp),%edx
     aaf:89 14 24             mov    %edx,(%esp)
     ab2:ff d0                call   *%eax
     ab4:83 c4 54             add    $0x54,%esp
     ab7:5b                   pop    %ebx
     ab8:c9                   leave 
     ab9:c3                   ret   
}

I tried to encourage GCC to construct a single 'default template' of this object, and then bulk-copy it in the default constructor, by doing a bit of trickery with a hidden 'dummy' constructor that made the base exemplar and then having the default just copy it:

struct Frobozz
{
     int na,nb,nc,nd;
     bool ba,bb,bc;
     char ca,cb,cc;
     float fa,fb;
     Trivial va, vb;
     inline Frobozz();
private:
     // and so on
     inline Frobozz( int dummy ) : na(0), /* etc etc */     {}
} __attribute__( ( aligned( 16 ) ) );

Frobozz::Frobozz( )
{
     const static Frobozz DefaultExemplar( 69105 );
     // analogous to copy-on-write idiom
     *this = DefaultExemplar;
     // or:
     // memcpy( this, &DefaultExemplar, sizeof(Frobozz) );
}

But this generated even slower code than the basic default with initializer list, due to some redundant stack copying.

Finally I resorted to writing an inlined free function to do the *this = DefaultExemplar step, using compiler intrinsics and assumptions about memory alignment to issue pipelined MOVDQA SSE2 opcodes that copy the struct efficiently. This got me the performance I need, but it's icky. I thought my days of writing initializers in assembly were behind me, and I'd really rather just have GCC's optimizer emit the right code in the first place.

Is there some way I can get GCC to generate optimal code for my constructor, some compiler setting or additional __attribute__ I've missed?

This is GCC 4.4 running on Ubuntu. Compiler flags include -m32 -march=core2 -O3 -fno-strict-aliasing -fPIC (among others). Portability is not a consideration, and I'm thoroughly willing to sacrifice standards-compliance for performance here.

Timings were performed by directly reading the time stamp counter with rdtsc, eg measuring a loop of N OversimplifiedExample() calls between samples with due attention to timer resolution and cache and statistical significance and so on.

I've also optimized this by reducing the number of call sites as much as possible, of course, but I'd still like to know how to generally get better ctors out of GCC.

解决方案

Here's how I would do it. Don't declare any constructor; instead, declare a fixed Frobozz that contains default values:

const Frobozz DefaultFrobozz =
  {
  0, 1, -1, 0,        // int na,nb,nc,nd;
  false, true, false, // bool ba,bb,bc;
  'a', 'b', 'c',      // char ca,cb,cc;
  -1, 1.0             // float fa,fb;
  } ;

Then in OversimplifiedExample:

Frobozz params (DefaultFrobozz) ;

With gcc -O3 (version 4.5.2), the initialisation of params reduces to:

leal    -72(%ebp), %edi
movl    $_DefaultFrobozz, %esi
movl    $16, %ecx
rep movsl

which is about as good as it gets in a 32-bit environment.

Warning: I tried this with the 64-bit g++ version 4.7.0 20110827 (experimental), and it generated an explicit sequence of 64-bit copies instead of a block move. The processor doesn't allow rep movsq, but I would expect rep movsl to be faster than a sequence of 64-bit loads and stores. Perhaps not. (But the -Os switch -- optimise for space -- does use a rep movsl instruction.) Anyway, try this and let us know what happens.

Edited to add: I was wrong about the processor not allowing rep movsq. Intel's documentation says "The MOVS, MOVSB, MOVSW, and MOVSD instructions can be preceded by the REP prefix", but it seems that this is just a documentation glitch. In any case, if I make Frobozz big enough, then the 64-bit compiler generates rep movsq instructions; so it probably knows what it's doing.

这篇关于GCC可以强制生成高效的内存对齐对象的构造函数吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆