在C ++中最快实现简单,虚拟,观察者排序的模式? [英] Fastest implementation of simple, virtual, observer-sort of, pattern in c++?

查看:172
本文介绍了在C ++中最快实现简单,虚拟,观察者排序的模式?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在努力尝试使用枚举和大量的宏魔术来为vtables实施替代方案,而这实际上已经开始让我的大脑感到困惑。我开始认为我没有走正确的道路,因为代码变得越来越丑陋,并且无论如何都不适合生产。



如何可以用最少的重定向/操作量来实现以下代码的模式吗?



必须在标准c ++中完成,最多17个。

  A类{
virtual void Update()= 0; // A非常纯*¬*
};

B类:公共A
{
覆盖void Update()final
{
// DO B STUFF
}
}

C类:公共A
{
覆盖void Update()final
{
// DO C STUFF
}
}

//类...

int main()
{
std :: vector< A *> vecA {};

//将B,C,...的实例插入vecA

for(auto a:vecA)//这将在主循环$ b中$ b a-> Update(); //每单位时间的可笑呼叫量

//空闲内存
}

PS:如果枚举,切换和宏确实是最好的选择,我想我将只是尝试刷新缓存并提出更好的设计。



PSS:我知道这是微观优化...哎呀,我需要对它进行纳米甚至微微的优化(从形象上来说),所以我将忽略任何可能出现的功利主义反应。

解决方案

第一个评论说,您在这里遇到XY问题。 排序/重新排序可以,并且您有许多对象,而不是大量的不同类,并且不需要支持代码在编译时不知道的类型。 多态+虚拟继承是错误的选择



相反,请使用N个不同的容器,每种类型的对象一个,没有间接的容器。 将编译器内联 B :: Update()放入所有 B 对象的循环中是好得多。 (对于下面增加一个成员 int 的琐碎示例,我从对asm的静态性能分析中得出,在L1D缓存中数据很热的情况下,Skylake在Skylake上的速度提高了约24倍。 AVX2自动矢量化与循环中的调用确实是如此之大。)



如果有一些必需的命令在对象之间,包括在不同类型的对象之间,则某种多态性或手动调度将是适当的。 (例如,如果您处理 vecA 的顺序很重要,则将所有 B 对象与所有<$ c分开$ c> C 对象是等效的。)






如果您关心性能,您必须意识到,增大源可能会简化,使编译器/在asm输出中使用。 根据内部循环中每个对象的类型进行检查/分派是昂贵的。使用任何类型的函数指针或枚举基于每个对象进行分派,很容易在出现分支错误时遭受错误预测



分别在多个容器上循环有效地提升了从内部循环中进行类型检查的类型,并使编译器进行虚拟化。 (或者更好的方法是,缩小每个对象,使其首先不需要vtable指针,枚举或函数指针,因为其类型是静态已知的。)



对于具有不同类型的每个容器,单独的循环有点像在将类型分配从内部循环中提升之后,在不同类型上完全展开一个循环。这对于编译器内联调用是必要的,如果每种类型的对象很多,则需要这些调用。内联使它可以在对象之间的寄存器中保持常量,在多个对象之间启用SIMD自动矢量化,并仅避免实际函数调用的开销。 (调用本身和寄存器的溢出/重装。)






如果 您确实需要按对象分派,当您使用 final 覆盖时,C ++虚拟函数是一种昂贵的获取方法。您付出的运行时成本使代码支持编译时不知道的任意大小的新派生类,但是却没有从中获得任何好处。



虚拟分派仅适用于间接级别(例如,您正在使用的指针向量),这意味着您需要以某种方式管理指向的对象,例如通过从 vector< B>分配它们poolB vector< C> poolC 。尽管我不确定 vector<> 的大多数实现在需要增长时会使用 realloc() new / delete API没有 realloc ,因此 vector 实际上可以在每次增长时复制,而不是尝试扩展现有分配。检查您的C ++实现是做什么的,因为与使用malloc / realloc相比,它可能很烂。



还有BTW,应该可以做到<$ c使用RAII的$ c> new / delete ,只要您的所有类都是可微毁的,就没有分配/取消分配的额外开销。 (但请注意, unique_ptr 可能无法使用指针向量击败其他优化 std :: unique_ptr 警告它是UB,它通过指向基类的指针来销毁它,因此您可能必须自己滚动。不过,在x86-64上的gcc上, sizeof(unique_ptr< class C>)仅为8,因此它只有一个指针成员。但是无论如何,分别分配成千上万个微小的对象很烂,所以首先不要那样做






如果您确实需要诸如标题之类的发件人



如果对象的大小相似,那么您真的想循环遍历对象,而不是指向对象的指针。这将避免指针向量的额外高速缓存占用空间,并且避免了无序执行必须隐藏以保持执行单元繁忙的额外指针追寻延迟。 但是C ++虚拟继承没有提供任何符合标准的方式来统一联盟的多态性{B b; C c; } poly_array [1024]; ,您可以使用 reinterpret_cast<> 自行修改,方法可能适用于x86- 64 gcc,但您可能不应该这样做。请参阅@BeeOnRope的后续报道:连续存储多态类型。 (还有一个较旧的问答集:数组中对象的C ++多态)。



如果需要,最高性能的方法可能是使用枚举为函数指针表建立索引(或者如果您的函数可以内联,请使用 switch())。如果您的函数没有内联,通常不使用 switch()调用一堆函数调用 case 即使它们都具有相同的参数(或没有参数),也可以优化到一个函数指针表。通常,您会获得跳转表以跳转到一组调用指令,而不是实际执行间接调用。因此,每次派遣都有一个额外的跃迁。



C ++ 17 std :: visit std :: variant< B,C> (对B和C使用非虚拟继承)似乎可以根据内部枚举进行分派。 std :: visit 使用自己的跳转表进行分派,即使只有两种可能的类型,也不必内联它们并使用条件分支。它还必须始终检查未初始化状态。如果您使用<$ c $手动解决此问题,则可以获得 c> B * tmp = std :: get_if< B>(& my_variant),以及 __ builtin_unreachable()告诉gcc nullptr不是不可能。但是到那时,您最好只滚动自己的 struct polymorph {枚举类型;联合{B b; C c; }; }; (具有非虚拟功能)(如果您不需要未初始化状态)。相关:数组中对象的C ++多态性



在这种情况下,您只有一个函数,因此可以将函数指针作为成员放在每个对象内。就像 void(* m_update)(A * this_object)。在调用方中,将指针作为 void * A * 传递给对象,因为它不是成员功能。该函数的实现将 reinterpret_cast< C *>(this_object)。 (不是 dynamic_cast :我们正在做自己的调度,不使用C ++。)



如果要使用B和C在其他情况下,函数指针成员将毫无用处地占用空间,您可以将函数指针保存在单独的容器中,而不是在基类中。因此,对于(i = 0..n)funcptrs [i](& objects [i]); ,将是。只要您的容器不同步,您就总是将指针传递给知道该如何处理的函数。与 union {B b; C c}个对象[] (或 vector< union> )。



如果需要,可以使用 void * ,尤其是在制作单独的函数指针数组时。这样,工会成员就无需从一个公共基础继承。



可以使用 std :: function< ;> 来存储指向实例成员函数的指针,但是在x86-64 gcc上,它是一个32字节的对象。最好只使用8字节的常规函数​​指针并编写知道传递与 this 指针等效的显式指针的代码,这对于您的缓存占用空间更好。



在每个对象中放置函数指针可能比枚举 uint8_t ,具体取决于当前的大小/对齐方式。在函数指针表中使用小的整数索引可能会减少对象与指针成员的每个实例的大小,尤其是对于64位目标而言。较小的对象很容易值得这对夫妇额外的指令来索引函数指针数组,并且可能因额外的指针取消引用而导致较高的错误预测损失。内存/缓存未命中通常是一个瓶颈。



我假设您确实具有按实例的状态,即使您没有显示任何状态。如果不是这样,则指向非成员函数的普通函数指针向量将便宜得多!






开销比较:



我了解了编译器生成的asm(针对x86-64的gcc和clang),以几种方式实现。



'p> <强> 通过Godbolt编译器资源管理器上的x86-64 clang 5.0实现此+ asm的多种方法的来源。您可以将其翻转到gcc或非x86体系结构。

  A类{
public:
虚拟void Update()= 0; // A非常纯*¬*
};

结构C:公共A {
int m_c = 0;
public:
void Update()覆盖最终
{m_c ++; }
};
int SC = sizeof(C); //由于使用vtable指针

,因此为16个字节C global_c; //实例化C :: Update()的定义;

//完全不继承,相当于使Update为非虚拟
struct nonvirt_B //等同于asm:// public A
{
int m_b = 0;
void Update()//覆盖最终的
{m_b ++; }
};
int SB = sizeof(nonvirt_B); //每个对象只有4个字节,没有vtable指针

void split_containers(std :: vector< nonvirt_B> vecB,std :: vector< C& vecC)
{
for(auto& b:vecB)b.Update();
for(auto& c:vecC)c.Update();
}

clang和gcc在<$ c $上自动矢量化循环c> vecB 和AVX2可以并行处理8个 int 元素,因此,如果您不对内存带宽造成瓶颈(即热在L1D缓存中),此循环每个时钟周期可以增加8个元素。此循环的运行速度与 vector< int> 上的循环一样快。

vecC 上的循环只能执行每个时钟周期1个元素,因为每个对象都是16个字节(8个字节的vtable指针,4个字节的 int m_c ,4个字节的填充到下一个对齐边界,因为如果没有 final ,则编译器还将检查vtable指针以查看其是否实际上是 C ,然后再使用内联的 C :: Update(),否则将分派。就像在 struct {int a,b,c,d; } vecC [SIZE]; vecC [i] .c ++;



final 允许完全虚拟化,但是我们的数据与vtable指针混合在一起,因此编译器只对标量 add [mem],1 做标量。只能以每个时钟1运行(在每个时钟1的存储吞吐量上遇到瓶颈,无论存储的大小如何,如果L1D高速缓存中的存储很热)。在此示例中,这主要是击败了SIMD。 (使用 -march = skylake-avx512 ,gcc和clang进行一些荒谬的改组或收集/散布,其速度甚至比标量还要慢,而不仅仅是加载/还原整个对象并添加使用仅更改 int 成员的向量。之所以允许,是因为它不包含任何volatile或atomic成员,并且使用AVX2每个时钟运行2个,或者每个时钟运行4个如果您的对象很小且您有很多,则最多增加12个字节是一个主要缺点。



每个对象有多个成员,这并不一定会破坏SIMD,但仍然会像枚举或函数指针那样花费每个对象中的空间。



由于您提到分离轴定理,希望您'不打算在每个对象中存储 float x,y 对。 结构数组基本上对SIMD很烂,因为要使用 x y 表示相同的对象。您想要的是 std :: vector< float> x,y 或类似,因此您的CPU可以将4个 x 值加载到寄存器中,并加载4个 y 值放入另一个寄存器。 (或使用AVX一次8个)。



请参见幻灯片:Insomniac Games上的SIMD(GDC 2015),介绍了如何为SIMD构建数据结构以及一些更高级的内容。另请参见 sse 标签Wiki的问题,以获取更多指南。另外, x86标签Wiki 具有很多低级x86优化材料。即使您不手动向量化任何内容,也使用 x y 的单独数组,编译器很有可能会可以为您自动矢量化。 (查看asm输出或基准 gcc -O3 -march = native gcc -O3 -march = native -fno-tree-vectorize )。您可能需要 -fast-math 进行某些类型的FP矢量化。






C ++虚拟调度:



使用虚拟继承和您在问题中的编写方式

  std :: vector< A *> vecA {}; 

void vec_virtual_pointers(){
for(auto a:vecA)
a-> Update();
}

我们从clang5.0 -O3获得此循环-march = skylake

 #rbx =& vecA [0] 
.LBB2_1 :#do {
mov rdi,qword ptr [rbx]#加载向量中的指针(将是Update()的this指针)
mov rax,qword ptr [rdi]#加载vtable指针
调用qword ptr [rax]#使用vtable
中的第一个条目的内存间接调用添加rbx,8#指针为8字节
cmp r14,rbx
jne。 LBB2_1#} while(p!= vecA.end())

因此,最终函数指针位于三个相关负载链的末端。乱序执行使迭代之间可以重叠(如果分支正确预测的话),但这只是前端总指令的大量开销,还有误判的代价。 ( call [m] 是3 oups,所以循环本身是8 uops,并且只能在Skylake上每2个周期发出一次。调用/返回也有开销。如果被调用者并不完全是琐碎的事情,我们可能不会在推送/弹出返回地址的瓶颈上遇到瓶颈。具有函数调用的循环要比空循环快。(我不确定在同一地址上独立存储/重装操作的吞吐量。这通常需要内存重命名是Skylake不会做的,如果被调用者很小,就不会成为瓶颈。)



Clang对C :: Update()的定义是

  C :: Update():#@C :: Update()
inc dword ptr [rdi + 8]
ret

如果在计算某些内容之前需要设置任何常量,则可能会更多没有我很昂贵t内联。因此,在虚拟分派的情况下,在Skylake上,这大概每3至5个时钟运行一次,而不是每个时钟大约1个成员。 (或者对于非虚拟 B类,每个时钟使用AVX2 8个成员,这不会浪费空间,并且可以使自动矢量化很好地工作。) http://agner.org/optimize/ 表示Skylake每3个时钟有一个电话吞吐量,因此可以说当L1D缓存中的数据很热时,性能损失是原来的24倍。当然,不同的微体系结构会有所不同。有关更多x86性能信息,请参见 x86 标签Wiki的问题。






Union hack:



可能永远不应该使用此功能,但是您可以从asm中看到它可以在带有clang和gcc的x86-64上运行。我做了一系列的联合,并遍历了它:

  union upoly {
upoly(){} //需要为编译器提供一个显式的构造函数,以使其不会阻塞
B b;
C c;
} poly_array [1024];

void union_polymorph(){
upoly * p =& poly_array [0];
upoly * endp =& poly_array [1024];
for(; p!= endp; p ++){
A * base = reinterpret_cast< A *>(p);
base-> Update(); //虚拟调度
}
}

AB和C都有其vtable一开始,所以我认为这通常会起作用。我们基本上是一样的,只差了一步。 (我使用静态数组而不是矢量,因为我在整理要转换的内容时保持了简单和类似C的效果。)

  lea rdi,[rbx + poly_array];这个指针
mov rax,qword ptr [rbx + poly_array];也要加载它,第一个成员是vtable指针
调用qword ptr [rax]
add rbx,16;步长是每个对象16个字节
cmp rbx,16384; 16 * 1024
jne .LBB4_1

这更好,触摸的内存更少,但是






std :: function #include< functional>



它可以容纳任何可调用的东西。但是它比vtable派发有更多的开销,因为它允许处于使用错误状态。因此,内部循环必须为此检查每个实例,并捕获是否存在。另外, sizeof(std :: function< void()>); 是32个字节(在x86-64 System V ABI上)。

  #include< functional> 
//非常cr脚:检查是否可能未设置为是否应该throw()。
std :: vector< std :: function< void()> vecF {};
void vec_functional(){
for(auto f:vecF)f();
}

#do {
.LBB6_2:#=>此内循环标题:深度= 1
mov qword ptr [rsp + 16],0#将0存储到本地堆栈?
mov rax,qword ptr [rbx + 16]
测试rax,rax
je .LBB6_5#抛出指针== 0(nullptr)
mov edx,2#第三个参数:2
mov rdi,r14#第一个arg:指向本地堆栈内存的指针(r14 =循环外的rsp)
mov rsi,rbx#第二个arg:指向向量
中的当前对象调用rax#否则调用2个参数
mov rax,qword ptr [rbx + 24]#来自std :: function<的另一个指针
mov qword ptr [rsp + 24],rax#将其存储到本地
mov rcx,qword ptr [rbx + 16]#再次加载第一个指针
mov qword ptr [rsp + 16],rcx
测试rcx,rcx
je .LBB6_5#再次检查第一个指针是否为null(如果为null,则抛出)
mov rdi,r14
调用rax#调用第二个指针
mov rax,qword ptr [rsp + 16]
测试rax,rax
je .LBB6_12#可选地跳过最后一次调用
mov edx,3
mov rdi,r14
mov rsi,r14
call rax
.LBB6_12:#in Loop:Header = BB6_2 Depth = 1
add rbx,32
cmp r15, rbx
jne .LBB6_2

.LBB6_13:#return
add rsp,32
pop rbx
pop r14
pop r15
ret

.LBB6_5:
呼叫std :: __ throw_bad_func tion_call()
jmp .LBB6_16
mov rdi,rax
通话__clang_call_terminate

因此,除非指针为nullptr,否则最多有三个调用指令。



与clang -stdlib = libc ++ 看起来有点不同默认为 libstdc ++ 。 ( https://libcxx.llvm.org/ )。但是在内循环中仍然有3条 call 指令,并有条件地跳过或抛出。



除非代码是- gen对于不同种类的 function< T> 来说是非常不同的,如果您可以编写更有效的替代方法,那么甚至不值得为成员函数的指针查看它。


I'm working my arse off trying to implement an alternative for vtables using enums and a ton of macro magic that's really starting to mess with my brain. I'm starting to think i'm not walking the right path since the code is getting uglier and uglier, and will not be fit for production by any means.

How can the pattern of the following code be implemented with the least amount of redirection/operations?

It has to be done in standard c++, up to 17.

class A{
    virtual void Update() = 0; // A is so pure *¬*
};

class B: public A
{
    override void Update() final
    {
        // DO B STUFF
    }
}

class C: public A
{
    override void Update() final
    {
        // DO C STUFF
    }
}

// class...

int main()
{
    std::vector<A*> vecA{};

    // Insert instances of B, C, ..., into vecA

    for(auto a: vecA) // This for will be inside a main loop
        a->Update(); // Ridiculous amount of calls per unit of time

    // Free memory
}

PS: If enum, switch and macros are indeed the best option, I guess I'll simply try to freshen up my caches and come up with a better design.

PSS: I know this is micro-optimization... Heck, I need to nano or even pico optimize this (figuratively speaking), so I will simply ignore any utilitarian responses that might come up.

解决方案

As the first comment said, you have an XY problem here. Sorting / reordering is ok, and you have many objects, not a huge number of different classes, and no need to support types that your code doesn't know about at compile time. Polymorphism + virtual inheritance is the wrong choice.

Instead, use N different containers, one for each type of object, with no indirection. Letting the compiler inline B::Update() into a loop over all the B objects is much better. (For the trivial example below of incrementing one member int, my static performance analysis from looking at the asm puts it at about 24x faster on Skylake with the data hot in L1D cache. AVX2 auto-vectorization vs. call in a loop is really that huge.)

If there was some required order between the objects, including between different types of objects, then some kind of polymorphism or manual dispatching would be appropriate. (e.g. if it mattered what order you processed vecA in, keeping all the B objects separate from all the C objects wouldn't be equivalent.)


If you care about performance, you have to realize that making the source larger might simplify things for the compiler / in the asm output. Checking / dispatching based on the type of each object inside the inner loop is expensive. Using any kind of function pointer or enum to dispatch on a per-object basis can easily suffer from branch mispredicts when you have a mix of different objects.

Looping separately over multiple containers effectively hoists that type check out of the inner loop and lets the compiler devirtualize. (Or better, shrinks each object to not need a vtable pointer, enum, or function pointer in the first place, because its type is statically known.)

Writing out a separate loop for each container with a different type is sort of like fully unrolling a loop over different types after hoisting the type dispatching out of the inner loop. This is necessary for the compiler to inline the calls, which you want if there are a lot of objects of each type. Inlining lets it keep constants in registers across objects, enables SIMD auto-vectorization across multiple objects, and simply avoids the overhead of an actual function call. (Both the call itself and spill/reload of registers.)


You were right that if you did need per-object dispatch, C++ virtual functions are an expensive way to get it when you're using final overrides. You're paying the same runtime cost that would let your code support new derived classes of arbitrary size that it didn't know about at compile time, but not gaining any benefit from that.

Virtual dispatch only works with a level of indirection (e.g. a vector of pointers like you're using), which means you need to manage the pointed-to objects somehow, e.g. by allocating them from vector<B> poolB and vector<C> poolC. Although I'm not sure most implementations of vector<> use realloc() when they need to grow; the new/delete API doesn't have a realloc, so vector may actually copy every time it grows, instead of trying to extend the existing allocation in place. Check what your C++ implementation does, since it might suck compared to what you can do with malloc/realloc.

And BTW, it should be possible to do the new/delete with RAII with no extra overhead for allocation/deallocation, as long as all your classes are trivially destructible. (But note that unique_ptr may defeat other optimizations for using the vector of pointers). std::unique_ptr warns that it's UB to destroy it via a pointer to the base class, so you might have to roll your own. Still, on gcc on x86-64, sizeof(unique_ptr<class C>) is only 8, so it only has a single pointer member. But whatever, separately allocating zillions of tiny objects sucks so don't do it that way in the first place.


If you did need some kind of dispatch like the title asks for

If the objects are all similar sizes, then you really want to loop over objects, not pointers-to-objects. That would avoid the extra cache footprint of a vector of pointers, and it avoids the extra pointer-chasing latency that out-of-order execution has to hide to keep the execution units busy. But C++ virtual inheritance doesn't provide any standards-compliant way to get polymorphism for union upoly { B b; C c; } poly_array[1024]; You can hack this up yourself with reinterpret_cast<> in a way that probably works on x86-64 gcc, but you probably shouldn't. See @BeeOnRope's followup: Contiguous storage of polymorphic types. (Also an older Q&A: C++ polymorphism of a object in an array).

If you need that, the highest-performance way would probably be to build it yourself with an enum to index a table of function pointers (or use a switch() if your functions can inline). If your functions don't inline, switch() to a bunch of function-call cases doesn't usually optimize down to a table of function pointers even if they all have the same args (or no args). You usually get a jump table to a block of call instructions, rather than actually doing an indirect call. So there's an extra jump in every dispatch.

C++17 std::visit with std::variant<B, C> (using non-virtual inheritance for B and C) seems to give you dispatch based on an internal enum. std::visit uses its own jump table to dispatch, even with only 2 possible types, instead of inlining them both and using a conditional branch. It also has to check for the "uninitialized" state all the time. You can get good code if you manually work around that with B *tmp = std::get_if<B>(&my_variant), and a __builtin_unreachable() to tell gcc that nullptr isn't a possibility. But at that point you might as well just roll your own struct polymorph { enum type; union { B b; C c; }; }; (with non-virtual functions) if you don't need an "uninitialized" state. Related: C++ polymorphism of a object in an array.

In this case you only have one function, so you can put a function pointer inside each object as a member. Like void (*m_update)(A* this_object). In the caller, pass a pointer to the object as a void* or A*, since it's a non-member function. The implementation of the function will reinterpret_cast<C*>(this_object). (Not dynamic_cast: we're doing our own dispatchiing, not using C++'s).

If you want to use B and C in other contexts where the function-pointer member would be taking up space for no benefit, you can keep the function pointers in a separate container instead of in the base class. So it would be for(i=0..n) funcptrs[i]( &objects[i] );. As long as your containers don't get out of sync, you're always passing a pointer to a function that knows what to do with it. Use that with union {B b; C c} objects[] (or a vector<union>).

You can use void* if you want, especially if you make a separate array of function pointers. Then the union members don't need to inherit from a common base.

You could use std::function<> to store pointers to instance member functions, but on x86-64 gcc that's a 32-byte object. It's better for your cache footprint to only use 8-byte regular function pointers and write code that knows to pass an explicit pointer equivalent to the this pointer.

Putting a function pointer in each object may take more space than an enum or uint8_t, depending on current size/alignment. A small integer index into a table of function pointers might reduce the size of each instance of your objects vs. a pointer member, especially for 64-bit targets. Smaller objects could easily be worth the couple extra instructions to index an array of function pointers, and the possibly higher mispredict penalty from an extra pointer dereference. Memory / cache misses are often a bottleneck.

I'm assuming you do have some per-instance state, even though you don't show any. If not, then a vector of ordinary function pointers to non-member functions will be much cheaper!


Overhead comparison:

I had a look at the compiler-generated asm (gcc and clang targeting x86-64) for a few ways of doing this.

Source for multiple ways of doing this + asm from x86-64 clang 5.0 on the Godbolt compiler explorer. You can flip it over to gcc, or non-x86 architectures.

class A{
    public:
    virtual void Update() = 0; // A is so pure *¬*
};

struct C : public A {
    int m_c = 0;
    public:
    void Update() override final
    {  m_c++;  }
};
int SC = sizeof(C);  // 16 bytes because of the vtable pointer

C global_c;  // to instantiate a definition for C::Update();

// not inheriting at all gives equivalent asm to making Update non-virtual
struct nonvirt_B //: public A
{
    int m_b = 0;
    void Update() //override final
    {  m_b++;  }
};
int SB = sizeof(nonvirt_B);  // only 4 bytes per object with no vtable pointer

void separate_containers(std::vector<nonvirt_B> &vecB, std::vector<C> &vecC)
{
    for(auto &b: vecB)        b.Update();
    for(auto &c: vecC)        c.Update();   
}

clang and gcc auto-vectorize the loop over vecB with AVX2 to process 8 int elements in parallel, so if you don't bottleneck on memory bandwidth (i.e. hot in L1D cache), this loop can increment 8 elements per clock cycle. This loop runs as fast as a loop over a vector<int>; everything inlines and optimizes away and it's just a pointer increment.

The loop over vecC can only do 1 element per clock cycle, because each object is 16 bytes (8 byte vtable pointer, 4 byte int m_c, 4 bytes of padding to the next alignment boundary because the pointer has an 8B alignment requirement.) Without final, the compiler also checks the vtable pointer to see if it's actually a C before using the inlined C::Update(), otherwise it dispatches. It's like what you'd get for a loop over struct { int a,b,c,d; } vecC[SIZE]; doing vecC[i].c++;

final allowed full devirtualization, but our data is mixed with vtable pointers, so compilers just do scalar add [mem], 1 which can only run at 1 per clock (bottlenecked on 1 per clock store throughput, regardless of the size of the store if it's hot in L1D cache). This mostly defeats SIMD for this example. (With -march=skylake-avx512, gcc and clang do some ridiculous shuffling or gather/scatter that's even slower than scalar, instead of just loading/restoring the whole object and adding with a vector that only changes the int member. That's allowed because it doesn't contain any volatile or atomic members, and would run a 2 per clock with AVX2, or 4 per clock with AVX512.) Having your objects up to 12 bytes larger is a major downside if they're small and you have a lot of them.

With multiple members per object, this doesn't necessarily defeat SIMD, but it still costs space in each object, just like an enum or function pointer would.

Since you mentioned the separating axis theorem, I hope you're not planning on storing float x,y pairs in each object. Array-of-structs basically sucks for SIMD, because it needs a lot of shuffling to use the x with the y for the same object. What you want is std::vector<float> x, y or similar, so your CPU can load 4 x values into a register and 4 y values into another register. (Or 8 at a time with AVX).

See Slides: SIMD at Insomniac Games (GDC 2015) for an intro to how to structure your data for SIMD, and some more advanced stuff. See also the tag wiki for more guides. Also, the x86 tag wiki has lots of low-level x86 optimization material. Even if you don't manually vectorize anything, with separate arrays for x and y there's a good chance that the compiler can auto-vectorize for you. (Look at the asm output, or benchmark gcc -O3 -march=native vs. gcc -O3 -march=native -fno-tree-vectorize). You may need -ffast-math for some kinds of FP vectorization.


C++ virtual dispatch:

Writing it the way you do in the question, with virtual inheritance and

std::vector<A*> vecA{};

void vec_virtual_pointers() {
    for(auto a: vecA)
        a->Update();
}

We get this loop from clang5.0 -O3 -march=skylake

   # rbx = &vecA[0]
.LBB2_1:                         # do{
    mov     rdi, qword ptr [rbx]   # load a pointer from the vector (will be the this pointer for Update())
    mov     rax, qword ptr [rdi]   # load the vtable pointer
    call    qword ptr [rax]        # memory-indirect call using the first entry in the vtable
    add     rbx, 8                 # pointers are 8 bytes
    cmp     r14, rbx
    jne     .LBB2_1              # }while(p != vecA.end())

So the final function pointer is at the end of a chain of three dependent loads. Out-of-order execution lets this overlap between iterations (if the branch predicts correctly), but that's a lot of overhead just in total instructions for the front-end, as well as in mispredict penalty. (call [m] is 3 uops, so just the loop itself is 8 uops, and can only issue one per 2 cycles on Skylake. Call/return has overhead too. If the callee is not totally trivial, we probably don't bottleneck on store-forwarding for pushing / popping the return address. Loop with function call faster than an empty loop. (I'm not sure about the throughput of independent store/reload operations on the same address. That would normally require memory renaming, which Skylake doesn't do, to not bottleneck on that if the callee is tiny like here.)

Clang's definition for C::Update() is

C::Update():                         # @C::Update()
    inc     dword ptr [rdi + 8]
    ret

If this needed to set up any constants before calculating something, it would be even more expensive to not have it inlined. So, with virtual dispatch, this probably runs at about one per 3 to 5 clocks, instead of about 1 member per clock, on Skylake. (Or 8 members per clock with AVX2 for non-virtual class B which doesn't waste space, and makes auto-vectorization work well.) http://agner.org/optimize/ says Skylake has one per 3 clock call throughput, so lets say 24x performance loss when the data is hot in L1D cache. Different microarchitectures will be different, of course. See the tag wiki for more x86 perf info.


Union hack:

Probably you should never use this, but you can see from the asm that it will work on x86-64 with clang and gcc. I made an array of unions, and looped over it:

union upoly {
    upoly() {}   // needs an explicit constructor for compilers not to choke
     B b;
     C c;
} poly_array[1024];

void union_polymorph() {
    upoly *p = &poly_array[0];
    upoly *endp = &poly_array[1024];
    for ( ; p != endp ; p++) {
        A *base = reinterpret_cast<A*>(p);
        base->Update();           // virtual dispatch
    }
}

A B and C all have their vtable at the start, so I think this will generally work. We asm that's basically the same, with one less step of pointer-chasing. (I used a static array instead of a vector, since I was keeping things simple and C-like while sorting out what to cast.)

    lea     rdi, [rbx + poly_array]       ; this pointer
    mov     rax, qword ptr [rbx + poly_array]   ; load it too, first "member" is the vtable pointer
    call    qword ptr [rax]
    add     rbx, 16                       ; stride is 16 bytes per object
    cmp     rbx, 16384                    ; 16 * 1024
    jne     .LBB4_1

This is better, and touches less memory, but it's only slightly better for overhead.


std::function from #include <functional>

It can hold any kind of callable thing. But it has even more overhead than vtable dispatch, because it's allowed to be in an error-if-used state. So the inner loop has to check every instance for that, and trap if it is. Also, sizeof(std::function<void()>); is 32 bytes (on x86-64 System V ABI).

#include <functional>
// pretty crappy: checks for being possibly unset to see if it should throw().
std::vector<std::function<void()>> vecF{};
void vec_functional() {
    for(auto f: vecF)     f();
}

                                # do {
.LBB6_2:                                # =>This Inner Loop Header: Depth=1
    mov     qword ptr [rsp + 16], 0       # store a 0 to a local on the stack?
    mov     rax, qword ptr [rbx + 16]
    test    rax, rax
    je      .LBB6_5           # throw on pointer==0  (nullptr)
    mov     edx, 2            # third arg:  2
    mov     rdi, r14          # first arg: pointer to local stack memory (r14 = rsp outside the loop)
    mov     rsi, rbx          # second arg: point to current object in the vector
    call    rax               # otherwise call into it with 2 args
    mov     rax, qword ptr [rbx + 24]    # another pointer from the std::function<>
    mov     qword ptr [rsp + 24], rax    # store it to a local
    mov     rcx, qword ptr [rbx + 16]    # load the first pointer again
    mov     qword ptr [rsp + 16], rcx
    test    rcx, rcx
    je      .LBB6_5           # check the first pointer for null again (and throw if null)
    mov     rdi, r14
    call    rax               # call through the 2nd pointer
    mov     rax, qword ptr [rsp + 16]
    test    rax, rax
    je      .LBB6_12          # optionally skip a final call
    mov     edx, 3
    mov     rdi, r14
    mov     rsi, r14
    call    rax
.LBB6_12:                               #   in Loop: Header=BB6_2 Depth=1
    add     rbx, 32
    cmp     r15, rbx
    jne     .LBB6_2

.LBB6_13:                       # return
    add     rsp, 32
    pop     rbx
    pop     r14
    pop     r15
    ret

.LBB6_5:
    call    std::__throw_bad_function_call()
    jmp     .LBB6_16
    mov     rdi, rax
    call    __clang_call_terminate

So there are up to three call instructions unless the pointer is nullptr. This looks far worse than virtual dispatch.

It looks a bit different with clang -stdlib=libc++, instead of the default libstdc++. (https://libcxx.llvm.org/). But still three call instructions in the inner loop, with conditionals to skip them or throw.

Unless the code-gen is very different for different kinds of function<T>, it's probably not even worth looking at it for pointers to member functions if you can write a more efficient alternative.

这篇关于在C ++中最快实现简单,虚拟,观察者排序的模式?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆