在C V表查询的性能损失++ [英] performance hit of vtable lookup in c++

查看:129
本文介绍了在C V表查询的性能损失++的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我评估改写了一块从C / ASM到C ++ / ASM实时软件(用于向code的问题,部分不相关的原因,是绝对必要的装配做)。中断带有一个3kHz的频率,并为每一个中断大约200不同的事情是在一个顺序进行。该处理器具有300MHz的运行,给我们100.000周期来完成这项工作。这已经解决了下与函数指针数组:

I'm evaluating to rewrite a piece of real-time software from C/asm to C++/asm (for reasons not relevant to the question parts of the code are absolutely necessary to do in assembly). An interrupt comes with a 3kHz frequency, and for each interrupt around 200 different things are to be done in a sequence. The processor runs with 300Mhz, giving us 100.000 cycles to do the job. This has been solved in C with an array of function pointers:

// each function does a different thing, all take one parameter being a pointer 
// to a struct, each struct also being different.
void (*todolist[200])(void *parameters);

// array of pointers to structs containing each function's parameters.
void *paramlist[200];

void realtime(void)
{
  int i;
  for (i = 0; i < 200; i++)
    (*todolist[i])(paramlist[i]);
}

速度是重要的。以上200次迭代完成每秒3000次,所以实际上我们做每秒600.000迭代。以上循环编译为每次迭代5个周期,产生每秒3,000,000个周期,即1%的CPU负载的总成本。汇编优化的可能的带来下来4条指令,但是我担心我们可能会得到一些额外的延迟是由于存储器访问彼此靠近等。总之,我相信这些5个周期是pretty最佳

Speed is important. The above 200 iterations are done 3000 times per second, so practically we do 600.000 iterations per second. The above for loop compiles to 5 cycles per iteration, yielding a total cost of 3.000.000 cycles per second, i.e. 1% CPU load. Assembler optimization might bring that down to 4 instructions, however I fear we might get some extra delay due to memory accesses close to each other etc. In short, I believe those 5 cycles are pretty optimal.

现在对C ++重写。这些事情200我们做的是那种彼此相关。有的,它们都需要和使用,并在它们各自的结构参数的子集。在C ++实现他们可能由此整齐地被视为从一个共同的基类继承

Now to the C++ rewrite. Those 200 things we do are sort of related to each other. There is a subset of parameters that they all need and use, and have in their respective structs. In a C++ implementation they could thus neatly be regarded as inheriting from a common base class:

class Base
{
  virtual void Execute();
  int something_all_things_need;
}
class Derived1 : Base
{
  void Execute() { /* do something */ }
  int own_parameter;
  // other own parameters
}
class Derived2 : Base { /* etc */ }

Base *todolist[200];

void realtime(void)
{
  for (int i = 0; i < 200; i++)
    todolist[i]->Execute(); // vtable look-up! 20+ cycles.
}

我的问题是虚函数表查找。我不能做每秒600.000查询,这将占到浪费CPU负载超过4%。此外,该todolist的在运行时不会改变,只在启动时设置一次,所以什么功能调用仰视的努力是真正的浪费。在问自己这个问题:什么是最优化的最终结果可能,我看由C解决方案给出的汇编code和refind函数指针数组...

My problem is the vtable lookup. I cannot do 600.000 lookups per second, this would account for more than 4% of wasted CPU load. Moreover the todolist never changes during run-time, it is only set up once at start-up, so the effort of looking up what function to call is truly wasted. Upon asking myself the question "what is the most optimal end result possible", I look at the assembler code given by the C solution, and refind an array of function pointers...

什么是清洁和适当的方式在C ++中做到这一点?制作一个漂亮的基类,派生类等感觉pretty毫无意义到底什么时候人再挑选出函数指针的表现的原因。

What is the clean and proper way to do this in C++? Making a nice base class, derived classes and so on feels pretty pointless when in the end one again picks out function pointers for performance reasons.

更新(包括循环开始的地方校正):

Update (including correction of where the loop starts) :

该处理器是ADSP-214xx,和编译器的VisualDSP ++ 5.0。当启用的#pragma optimize_for_speed中,C环为9次。汇编优化它在我的脑海收益率4次,但是我没有测试它,所以它不能保证。 C ++的循环是14个周期。我知道的编译器可以做一个更好的工作,但是我并不想解雇这是一个编译器的问题 - 让受无多态性仍是一个嵌入式环境preferable和设计选择仍然吸引我。作为参考,在这里所得到的组件:

The processor is a ADSP-214xx, and the compiler is VisualDSP++ 5.0. When enabling #pragma optimize_for_speed, the C loop is 9 cycles. Assembly-optimizing it in my mind yields 4 cycles, however I didn't test it so it's not guaranteed. The C++ loop is 14 cycles. I'm aware of the compiler could do a better job, however I did not want to dismiss this as a compiler issue - getting by without polymorphism is still preferable in an embedded context, and the design choice still interests me. For reference, here the resulting assembly:

C:

i3=0xb27ba;
i5=0xb28e6;
r15=0xc8;

下面是实际的循环:

r4=dm(i5,m6);
i12=dm(i3,m6);
r2=i6;
i6=i7;
jump (m13,i12) (db);
dm(i7,m7)=r2;
dm(i7,m7)=0x1279de;
r15=r15-1;
if ne jump (pc, 0xfffffff2);

C ++

i5=0xb279a;
r15=0xc8;

下面是实际的循环:

i5=modify(i5,m6);
i4=dm(m7,i5);
r2=i4;
i4=dm(m6,i4);
r1=dm(0x3,i4);
r4=r2+r1;
i12=dm(0x5,i4);
r2=i6;
i6=i7;
jump (m13,i12) (db);
dm(i7,m7)=r2;
dm(i7,m7)=0x1279e2;
r15=r15-1;
if ne jump (pc, 0xffffffe7);

在此期间,我想我已经找到了某种答案。周期的最低量做最起码可能实现的。我必须获取一个数据指针,获取一个函数指针,并调用与数据指针作为参数的功能。当读取一个指针索引寄存器由一个恒定的自动修改,并且可以一样好让​​这个常数等于1.所以再次一个人发现自己与函数指针数组,数据指针数组。自然地,该限制是什么可以在组装完成,并且已经被研究。考虑到这一点,我现在明白,即使它涉及自然一个引进一个基类,它是不是真的有什么适合该法案。所以我想答案是,如果一个人想函数指针数组,应该使自己的函数指针数组...

In the meanwhile, I think I have found sort of an answer. The lowest amount of cycles is achieved by doing the very least possible. I have to fetch a data pointer, fetch a function pointer, and call the function with the data pointer as parameter. When fetching a pointer the index register is automatically modified by a constant, and one can just as well let this constant equal 1. So once again one finds oneself with an array of function pointers, and an array of data pointers. Naturally, the limit is what can be done in assembly, and that has now been explored. Having this in mind, I now understand that even though it comes natural to one to introduce a base class, it was not really what fit the bill. So I guess the answer is that if one wants an array of function pointers, one should make oneself an array of function pointers...

推荐答案

什么让你觉得虚函数表查找开销是20个周期?如果这是真的,你需要一个更好的C ++编译器。

What makes you think vtable lookup overhead is 20 cycles? If that's really true, you need a better C++ compiler.

我想这在Intel盒子,不知道你正在使用的处理器的任何东西,如预期的C发货$ ​​C $ C和C ++的虚函数表寄发之间的差异是一个指令,有与事实做该虚函数表需要额外的间接。

I tried this on an Intel box, not knowing anything about the processor you're using, and as expected the difference between the C despatch code and the C++ vtable despatch is one instruction, having to do with the fact that the vtable involves an extra indirect.

C code(基于OP):

C code (based on OP):

void (*todolist[200])(void *parameters);                                  
void *paramlist[200];
void realtime(void)
{       
  int i;
  for (i = 0; i < 200; i++)                                               
    (*todolist[i])(paramlist[i]);                                         
}       

C ++ code:

C++ code:

class Base {
  public:
    Base(void* unsafe_pointer) : unsafe_pointer_(unsafe_pointer) {}
    virtual void operator()() = 0;
  protected:
    void* unsafe_pointer_;
};

Base* todolist[200];
void realtime() {
  for (int i = 0; i < 200; ++i)
    (*todolist[i])();
}

这两个用gcc编译4.8,-O3:

Both compiled with gcc 4.8, -O3:

realtime:                             |_Z8realtimev:
.LFB0:                                |.LFB3:
        .cfi_startproc                |        .cfi_startproc
        pushq   %rbx                  |        pushq   %rbx
        .cfi_def_cfa_offset 16        |        .cfi_def_cfa_offset 16
        .cfi_offset 3, -16            |        .cfi_offset 3, -16
        xorl    %ebx, %ebx            |        movl    $todolist, %ebx
        .p2align 4,,10                |        .p2align 4,,10
        .p2align 3                    |        .p2align 3
.L3:                                  |.L3:
        movq    paramlist(%rbx), %rdi |        movq    (%rbx), %rdi
        call    *todolist(%rbx)       |        addq    $8, %rbx
        addq    $8, %rbx              |        movq    (%rdi), %rax
                                      |        call    *(%rax)
        cmpq    $1600, %rbx           |        cmpq    $todolist+1600, %rbx
        jne     .L3                   |        jne     .L3
        popq    %rbx                  |        popq    %rbx
        .cfi_def_cfa_offset 8         |        .cfi_def_cfa_offset 8
        ret                           |        ret

在C ++ code,第一个 MOVQ 获取虚函数表的地址,而呼叫然后通过该索引。所以这是一条指令的开销。

In the C++ code, the first movq gets the address of the vtable, and the call then indexes through that. So that's one instruction overhead.

据OP,DSP的C ++编译器产生以下code。我插根据我这是怎么回事(这可能是错误的)的理解意见。需要注意的是(IMO)在循环开始一个位置早于OP指示;否则,它是没有意义的(对我来说)。

According to OP, the DSP's C++ compiler produces the following code. I've inserted comments based on my understanding of what's going on (which might be wrong). Note that (IMO) the loop starts one location earlier than OP indicates; otherwise, it makes no sense (to me).

# Initialization.
# i3=todolist; i5=paramlist           | # i5=todolist holds paramlist
i3=0xb27ba;                           | # No paramlist in C++
i5=0xb28e6;                           | i5=0xb279a;
# r15=count
r15=0xc8;                             | r15=0xc8;

# Loop. We need to set up r4 (first parameter) and figure out the branch address.
# In C++ by convention, the first parameter is 'this'
# Note 1:
r4=dm(i5,m6); # r4 = *paramlist++;    | i5=modify(i5,m6); # i4 = *todolist++
                                      | i4=dm(m7,i5);     # ..
# Note 2:                            
                                      | r2=i4;            # r2 = obj
                                      | i4=dm(m6,i4);     # vtable = *(obj + 1)
                                      | r1=dm(0x3,i4);    # r1 = vtable[3]
                                      | r4=r2+r1;         # param = obj + r1

i12=dm(i3,m6); # i12 = *todolist++;   | i12=dm(0x5,i4);   # i12 = vtable[5]

# Boilerplate call. Set frame pointer, push return address and old frame pointer.
# The two (push) instructions after jump are actually executed before the jump.
r2=i6;                                | r2=i6;
i6=i7;                                | i6=i7;
jump (m13,i12) (db);                  | jump (m13,i12) (db);
dm(i7,m7)=r2;                         | dm(i7,m7)=r2;
dm(i7,m7)=0x1279de;                   | dm(i7,m7)=0x1279e2;

# if (count--) loop
r15=r15-1;                            | r15=r15-1;
if ne jump (pc, 0xfffffff2);          | if ne jump (pc, 0xffffffe7);

注:


  1. 在C ++版本,似乎编译器已经决定做增量后两个步骤,presumably因为它想要的结果,在 I 注册,而不是在 R4 。这无疑涉及到下面的问题。

  1. In the C++ version, it seems that the compiler has decided to do the post-increment in two steps, presumably because it wants the result in an i register rather than in r4. This is undoubtedly related to the issue below.

编译器已经决定以计算对象的真实类的基址,使用对象的虚表。这会占用三条指令,和presumably还需要在步骤1中使用 0-14 作为临时的虚函数表查找自身占用一个指令。

The compiler has decided to compute the base address of the object's real class, using the object's vtable. This occupies three instructions, and presumably also requires the use of i4 as a temporary in step 1. The vtable lookup itself occupies one instruction.

所以:这个问题是不是虚函数表的查找,这可能在一个额外的指令已经完成(但实际上需要两个)。问题是,编译器觉得需要查找的对象。但是,为什么不海合会/ I86需要做的?

So: the issue is not vtable lookup, which could have been done in a single extra instruction (but actually requires two). The problem is that the compiler feels the need to "find" the object. But why doesn't gcc/i86 need to do that?

答案是:它使用的,但它没有任何更多。在许多情况下(其中,不存在多重继承,例如),指针的浇铸到一个派生类的基类的一个指针不需要修改指针。因此,当我们调用派生类的方法,我们就可以给它的基类指针作为这个参数。但在其他情况下,不工作,我们有当我们投来调整指针,因此调整回来时,我们做了电话。

The answer is: it used to, but it doesn't any more. In many cases (where there is no multiple inheritance, for example), the cast of a pointer to a derived class to a pointer of a base class does not require modifying the pointer. Consequently, when we call a method of the derived class, we can just give it the base class pointer as its this parameter. But in other cases, that doesn't work, and we have to adjust the pointer when we do the cast, and consequently adjust it back when we do the call.

有(至少)两种方法来执行第二调整。一个是由所产生的DSP code,其中调整存储在V表中所示的方式 - 即使它是0 - 然后在呼叫期间使用。另一种方法,(被称为虚函数表-的thunk )是创建一个 - 可执行$ C一点点$ C - 这调整这个指针,然后跳转到该方法的切入点,并把一个指向这个的thunk到虚函数表。 (这可以全部在编译时进行。)在thunk溶液的优点是,在没有调整需要做的共同的情况下,我们可以优化走在thunk并没有调整code左侧。 (缺点是,如果我们确实需要一个调整,我们已经生成了一个额外的分支)。

There are (at least) two ways to perform the second adjustment. One is the way shown by the generated DSP code, where the adjustment is stored in the vtable -- even if it is 0 -- and then applied during the call. The other way, (called vtable-thunks) is to create a thunk -- a little bit of executable code -- which adjusts the this pointer and then jumps to the method's entry point, and put a pointer to this thunk into the vtable. (This can all be done at compile time.) The advantage of the thunk solution is that in the common case where no adjustment needs to be done, we can optimize away the thunk and there is no adjustment code left. (The disadvantage is that if we do need an adjustment, we've generated an extra branch.)

据我了解,VisualDSP ++的是基于GCC,以及它可能具有的 -fvtable-的thunk -fno-虚函数表-的thunk 选项。所以,你也许可以用编译-fvtable-的thunk 。但是,如果你这样做,你需要编译所有你该选项使用C ++库,因为你不能混用两种调用样式。此外,有(15岁前)在GCC的虚函数表-的thunk实现各种错误,因此,如果海湾合作委员会由VisualDSP ++的使用的版本是不够的时候,你可能会遇到这些问题太(IIRC,它们都涉及多重继承,所以他们可能并不适用于您的使用情况)。

As I understand it, VisualDSP++ is based on gcc, and it might have the -fvtable-thunks and -fno-vtable-thunks options. So you might be able to compile with -fvtable-thunks. But if you do that, you would need to compile all the C++ libraries you use with that option, because you cannot mix the two calling styles. Also, there were (15 years ago) various bugs in gcc's vtable-thunks implementation, so if the version of gcc used by VisualDSP++ is old enough, you might run into those problems too (IIRC, they all involved multiple inheritance, so they might not apply to your use case.)

(原测试,更新前):

我尝试以下简单的例子(没有多重继承,它可以慢下来):

I tried the following simple case (no multiple inheritance, which can slow things down):

class Base {
  public:
    Base(int val) : val_(val) {}
    virtual int binary(int a, int b) = 0;
    virtual int unary(int a) = 0;
    virtual int nullary() = 0;
  protected:
    int val_;
};

int binary(Base* begin, Base* end, int a, int b) {
  int accum = 0;
  for (; begin != end; ++begin) { accum += begin->binary(a, b); }
  return accum;
}

int unary(Base* begin, Base* end, int a) {
  int accum = 0;
  for (; begin != end; ++begin) { accum += begin->unary(a); }
  return accum;
}

int nullary(Base* begin, Base* end) {
  int accum = 0;
  for (; begin != end; ++begin) { accum += begin->nullary(); }
  return accum;
}

和使用-O3 GCC(4.8)编译它。如我所料,它产生完全相同的组装code作为C发货会做。下面是在一元功能的情况下,循环,例如:

And compiled it with gcc (4.8) using -O3. As I expected, it produced exactly the same assembly code as your C despatch would have done. Here's the for loop in the case of the unary function, for example:

.L9:
        movq    (%rbx), %rax
        movq    %rbx, %rdi
        addq    $16, %rbx
        movl    %r13d, %esi
        call    *8(%rax)
        addl    %eax, %ebp
        cmpq    %rbx, %r12
        jne     .L9

这篇关于在C V表查询的性能损失++的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆