最有效地将编译时大小的数组的所有元素相加 [英] Add up all elements of compile-time sized array most efficiently

查看:38
本文介绍了最有效地将编译时大小的数组的所有元素相加的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用最少的指令有效地将所有内容添加到编译时大小的数组中.当然,我正在使用模板.我创建了这个.

I'm trying to efficiently add everything up in an compile-time sized array, using least amount of instructions. Naturally I'm using templates. I created this.

template<unsigned int startIndex, unsigned int count>
int AddCollapseArray(int theArray[])
{
    if(count == 1)
    {
        return theArray[startIndex];
    }
    else if(count == 2)
    {
        return theArray[startIndex] + theArray[startIndex + 1];
    }
    else if(count % 2 == 0)
    {
        return AddCollapseArray<startIndex, count / 2>(theArray) + AddCollapseArray<startIndex + count / 2, count / 2>(theArray));
    }
    else if (count % 2 == 1)
    {
        int newCount = count-1;
        return AddCollapseArray<startIndex, newCount/ 2>(theArray) + AddCollapseArray<startIndex + newCount/ 2, newCount/ 2>(theArray)) + theArray[startIndex + newCount];
    }
}

这对我来说似乎可以最有效地完成工作.我认为除了加法之外的分支和算术将被完全优化.这样做有什么缺陷吗?

This appears like it will get the job done most efficiently to me. I think the branching and the arithmetic besides the additions will be completely optimized out. Are there any flaws with doing it this way?

推荐答案

这里重要的限定词是最少指令数"的意思.如果这被解释为使 CPU 执行最少的步骤,并且我们进一步规定没有使用高级技术,如 SIMD、GPU 编程或 OMP(或其他自动并行技术)......只是 C 或C++,然后考虑:

The important qualifier here is the meaning of "least number of instructions". If that is to be interpreted as causing the CPU to perform the fewest steps, and we further stipulate there are no advanced techniques to be employed, like SIMD, GPU programming or OMP (or other auto parallel technologies)....just C or C++, then consider:

假设如下:

int a[ 10 ];

在运行时填充数据,并且将始终包含 10 个条目(0 到 9)

Which is filled with data at runtime, and will always contain 10 entries (0 through 9)

std::accumulate 在这里做得很好,在汇编器中创建了一个紧密的循环,没有混乱......只是快速:

The std::accumulate does a nice job here, creating a tight loop in the assembler, no mess...just quick:

int r = std::accumulate( &a[ 0 ], &a[ 9 ], 0 );

当然,一些表示数组 'a' 大小的 const int 就可以了.

If course, some const int signifying the size of the array 'a' would be in order.

奇怪的是:

for( int n=0; n < 10; ++n ) r += a[ n ];

编译器非常聪明地发出 10 条展开的添加指令 - 它甚至不打扰循环.

The compiler very smartly emits 10 add instructions unrolled - it doesn't even bother with a loop.

现在,这意味着在 std::accumulate 中,虽然循环很紧,但至少会有两个用于每个元素的加法指令(一个用于求和,一个用于增加迭代器).再加上比较指令和条件跳转,每个项目至少有 4 条指令,或者大约 40 个以滴答为单位的各种成本的机器语言步骤.

Now, this means that in std::accumulate, though the loop is tight, there will be, at the minimum, two add instructions for each element (one for the sum, and one to increment the iterator). Add to that the comparison instruction and a conditional jump, and there are at least 4 instructions per item, or about 40 machine language steps of various cost in ticks.

另一方面,for 循环的展开结果只有 10 个机器步骤,CPU 很可能可以以非常好的缓存友好性进行调度,并且没有跳转.

On the other hand, the unrolled result of the for loop is just 10 machine steps, which the CPU can very likely schedule with great cache friendliness, and no jumps.

for 循环肯定更快.

The for loop is definitely faster.

编译器知道"您正在尝试做什么,并且会像您使用您发布的建议代码仔细考虑它一样完成工作.

The compiler "knows" what you're trying to do, and gets to the job as well as you might think through it with the proposed code you posted.

此外,如果数组的大小对于展开循环来说太奇怪了,编译器会自动执行 std::accumulate 出于某种原因似乎没有执行的经典优化......即, 每个循环执行两次加法(当它由于元素数量而构造循环时).

Further, if the size of the array gets too outlandish for unrolling the loop, the compiler automatically performs the classic optimization that std::accumulate does not appear to do for some reason...i.e., performing two additions per loop (when it constructs a loop for reason of the number of elements).

使用 VC 2012,此来源:

Using VC 2012, this source:

 int r = std::accumulate( &a[ 0 ], &a[ 9 ], 0 );

 int z = 0;

 int *ap = a;
 int *ae = &a[9];
 while( ap <= ae ) { z += *ap; ++ap; }

 int z2 = 0;

 for (int n=0; n < 10; ++n ) z2 += a[ n ];

在 VC2012 的发布版本上生成以下汇编程序片段

Produces the following assembler snippets on a release build in VC2012

int r = std::accumulate( &a[ 0 ], &a[ 9 ], 0 );
00301270 33 D2                xor         edx,edx  
00301272 B8 D4 40 30 00       mov         eax,3040D4h  
00301277 EB 07                jmp         wmain+10h (0301280h)  
00301279 8D A4 24 00 00 00 00 lea         esp,[esp]  
00301280 03 10                add         edx,dword ptr [eax]  
00301282 83 C0 04             add         eax,4  
00301285 3D F8 40 30 00       cmp         eax,3040F8h  
0030128A 75 F4                jne         wmain+10h (0301280h) 

while( ap <= ae ) { z += *ap; ++ap; }
003012A0 03 08                add         ecx,dword ptr [eax]  
003012A2 03 70 04             add         esi,dword ptr [eax+4]  
003012A5 83 C0 08             add         eax,8  
003012A8 3D F4 40 30 00       cmp         eax,3040F4h  
003012AD 7E F1                jle         wmain+30h (03012A0h)  
003012AF 3D F8 40 30 00       cmp         eax,3040F8h  
003012B4 77 02                ja          wmain+48h (03012B8h)  
003012B6 8B 38                mov         edi,dword ptr [eax]  
003012B8 8D 04 0E             lea         eax,[esi+ecx]  
003012BB 03 F8                add         edi,eax  


for (int n=0; n < 10; ++n ) z2 += a[ n ];
003012BD A1 D4 40 30 00       mov         eax,dword ptr ds:[003040D4h]  
003012C2 03 05 F8 40 30 00    add         eax,dword ptr ds:[3040F8h]  
003012C8 03 05 D8 40 30 00    add         eax,dword ptr ds:[3040D8h]  
003012CE 03 05 DC 40 30 00    add         eax,dword ptr ds:[3040DCh]  
003012D4 03 05 E0 40 30 00    add         eax,dword ptr ds:[3040E0h]  
003012DA 03 05 E4 40 30 00    add         eax,dword ptr ds:[3040E4h]  
003012E0 03 05 E8 40 30 00    add         eax,dword ptr ds:[3040E8h]  
003012E6 03 05 EC 40 30 00    add         eax,dword ptr ds:[3040ECh]  
003012EC 03 05 F0 40 30 00    add         eax,dword ptr ds:[3040F0h]  
003012F2 03 05 F4 40 30 00    add         eax,dword ptr ds:[3040F4h]  

根据评论,我决定在 XCode 7 中尝试这个,结果截然不同.这是 for 循环的展开:

Based on comments I decided to try this in XCode 7, with drastically different results. This is it's unroll of the for loop:

    .loc    1 58 36                 ## /Users/jv/testclang/testcp/checkloop/checkloop/main.cpp:58:36
    movq    _a(%rip), %rax
Ltmp22:
    ##DEBUG_VALUE: do3:z2 <- EAX
    movq    %rax, %rcx
    shrq    $32, %rcx
    .loc    1 58 33 is_stmt 0       ## /Users/jv/testclang/testcp/checkloop/checkloop/main.cpp:58:33
    addl    %eax, %ecx
    .loc    1 58 36                 ## /Users/jv/testclang/testcp/checkloop/checkloop/main.cpp:58:36
    movq    _a+8(%rip), %rax
Ltmp23:
    .loc    1 58 33                 ## /Users/jv/testclang/testcp/checkloop/checkloop/main.cpp:58:33
    movl    %eax, %edx
    addl    %ecx, %edx
    shrq    $32, %rax
    addl    %edx, %eax
    .loc    1 58 36                 ## /Users/jv/testclang/testcp/checkloop/checkloop/main.cpp:58:36
    movq    _a+16(%rip), %rcx
    .loc    1 58 33                 ## /Users/jv/testclang/testcp/checkloop/checkloop/main.cpp:58:33
    movl    %ecx, %edx
    addl    %eax, %edx
    shrq    $32, %rcx
    addl    %edx, %ecx
    .loc    1 58 36                 ## /Users/jv/testclang/testcp/checkloop/checkloop/main.cpp:58:36
    movq    _a+24(%rip), %rax
    .loc    1 58 33                 ## /Users/jv/testclang/testcp/checkloop/checkloop/main.cpp:58:33
    movl    %eax, %edx
    addl    %ecx, %edx
    shrq    $32, %rax
    addl    %edx, %eax
    .loc    1 58 36                 ## /Users/jv/testclang/testcp/checkloop/checkloop/main.cpp:58:36
    movq    _a+32(%rip), %rcx
    .loc    1 58 33                 ## /Users/jv/testclang/testcp/checkloop/checkloop/main.cpp:58:33
    movl    %ecx, %edx
    addl    %eax, %edx
    shrq    $32, %rcx
    addl    %edx, %ecx

这可能看起来不像 VC 的简单列表那么干净,但它可能会运行得一样快,因为每次添加的设置(movq 或 movl)可能会在 CPU 中并行运行,因为前一个条目正在完成它的添加,几乎没有成本与 VC 在内存源上添加的简单、干净的外观"系列相比.

This may not look as clean as VC's simple list, but it may run as fast because the setup (movq or movl) for each addition may run parallel in the CPU as the previous entry is finishing it's addition, costing little to nothing by comparison to VC's simple, clean 'looking' series of adds on memory sources.

以下是 Xcode 的 std::accumulator.似乎需要一个 init,但随后它执行了一系列干净的添加,展开了循环,而 VC 没有这样做.

The following is Xcode's std::accumulator. It SEEMS there's a init required, but then it performs a clean series of additions having unrolled the loop, which VC did not do.

    .file   37 "/Applications/Xcode7.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/../include/c++/v1" "numeric"
    .loc    37 75 27 is_stmt 1      ## /Applications/Xcode7.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/../include/c++/v1/numeric:75:27
    movq    _a(%rip), %r14
Ltmp11:
    movq    %r14, -48(%rbp)         ## 8-byte Spill
Ltmp12:
    shrq    $32, %r14
    movq    _a+8(%rip), %rbx
    movq    %rbx, -56(%rbp)         ## 8-byte Spill
    shrq    $32, %rbx
    movq    _a+16(%rip), %r13
    movq    %r13, -72(%rbp)         ## 8-byte Spill
    shrq    $32, %r13
    movq    _a+24(%rip), %r15
    movq    %r15, %r12
    shrq    $32, %r12
Ltmp13:
    movl    _a+32(%rip), %eax
Ltmp14:
    movq    -48(%rbp), %rax         ## 8-byte Reload
    addl    %eax, %r14d
    movq    -56(%rbp), %rax         ## 8-byte Reload
    addl    %eax, %r14d
    addl    %ebx, %r14d
    movq    -72(%rbp), %rax         ## 8-byte Reload
    addl    %eax, %r14d
    addl    %r13d, %r14d
    addl    %r15d, %r14d
    addl    %r12d, %r14d
    addl    -64(%rbp), %r14d        ## 4-byte Folded Reload

这里的底线是,我们依赖于编译器的优化从一个编译器到另一个编译器的差异如此之大,以至于我们应该依赖它们,但要注意.

The bottom line here is that the optimizations we rely upon from compilers differs so widely and wildly from one compiler to another that we should rely upon them, but watch.

LLVM 非常具有示范性,并且似乎比 VC 更能理解 std::accumulate - 但是这个简短的调查无法揭示这是否是 libary 的实现或编译器.Xcode 的 std::accumulate 的实现可能存在重要差异,这使编译器比 VC 版本的库更深入.

LLVM is quite exemplary, and understands std::accumulate better than VC, it seems - but this short investigation can't reveal if that is a difference in the implementation of the libary or of the compiler. There could be important differences in the implementation of Xcode's std::accumulate which give the compiler more insight than VC's version of the library.

这更普遍地适用于算法,甚至是数字算法.std::accumulate 是一个 for 循环.它很可能作为基于指向数组的指针的 for 循环内联扩展,这就是为什么 VC 选择为 std::accumulate 创建循环的原因在它选择使用 int * 循环遍历数组,但展开 for 循环的循环,使用整数按索引引用数组中的条目.换句话说,当使用指针时,它在直接 for 循环中确实没有更好的效果,这表明在这种情况下它是 VC 的优化器,而不是库.

That applies more generally to algorithms, even those from numeric. std::accumulate is a for loop. It is likely expanded inline as for loop based on pointers into the array, which is why VC's choice to create a loop for std::accumulate was echoed in it's choice to produce a loop for the code using int * to loop through the array, but unrolled the loop for the for loop using an integer to reference entries in the array by index. In other words, it really did no better in a straight for loop when pointers were used, and that suggests it's VC's optimizer, not the library, in this case.

这遵循 Stroustrup 自己最喜欢的编译器可用信息思想的示例,比较了 C 中的 qsort 和 C++ 中的 sort.qsort 需要一个函数指针来执行比较,使编译器无法理解比较,迫使它通过指针调用函数.另一方面,C++ sort 函数采用函子,它传达有关比较的更多信息.这仍然可能导致函数调用,但优化器有机会充分理解比较以使其内联.

This follows Stroustrup's own favorite example of the idea of information available to the compiler, comparing qsort from C and sort from C++. qsort takes a function pointer to perform the comparison, cutting off the compiler from understand the comparison, forcing it to call a function via a pointer. The C++ sort function, on the other hand, takes a functor, which conveys more information about the comparison. That could still result in a function call, but the optimizer has the opportunity to understand the comparison sufficiently to make it inline.

在 VC 的情况下,无论出于何种原因(我们必须像微软一样),编译器在通过指针循环数组时会感到困惑.提供给它的信息与使用整数索引数组的循环不同.它理解这一点,但不理解指针.相比之下,LLVM 了解两者(甚至更多).信息的差异对LLVM来说并不重要,但对VC来说很重要.由于 std::accumulate 实际上是一个表示 for 循环的内联,并且该循环是通过指针进行处理的,因此它逃脱了 VC 的识别,就像 VC 在基于指针的直接 for 循环中所做的那样.如果可以对整数数组进行专门化,例如使用索引而不是指针进行累加循环,那么 VC 会以更好的输出响应,但事实并非如此.

In VC's case, for whatever reason (we'd have to as Microsoft), the compiler is confused when looping through the array via pointers. The information given to it is different than with the loop using an integer to index the array. It understands that, but not the pointers. LLVM, by contrast, understood both (and more). The difference of information is not important to LLVM, but it is to VC. Since std::accumulate is really an inline representing a for loop, and that loop is processed via pointers, it escaped VC's recognition, just as VC did in the straight for loop based on pointers. If a specialization could be made for integer arrays, such that accumulated looped with indexes rather than pointers, VC would respond with better output, but it shouldn't be so.

一个糟糕的优化器可能会错过重点,库的一个糟糕的实现可能会混淆优化器,这意味着在最好的情况下 std::accumulate 可以和 for 循环一样执行对于一个简单的整数数组,生成循环的展开版本创建总和,但并非总是如此.然而,在 for 循环中几乎没有妨碍编译器理解的东西......一切都在那里,库的实现不能把它搞砸,这完全取决于编译器.为此,VC 显示了它的弱点.

A poor optimizer can miss the point, and a poor implementation of the library could confuse the optimizer, which means that under the best circumstances std::accumulate can perform about as well as the for loop for a simple array of integers, producing an unrolled version of the loop creating the sum, but not always. However, there's little to get in the way of the compiler's understanding in a for loop..everything is right there, and the implementation of the library can't mess it up, it's all up to the compiler at that point. For that, VC shows it's weakness.

我尝试了 VC 上的所有设置,试图让它展开 std::accumulate,但到目前为止它从未成功(还没有尝试过较新版本的 VC).

I tried all settings on VC to try to get it to unroll std::accumulate, but so far it never did (haven't tried newer versions of VC).

让 Xcode 展开循环并没有花费太多时间;LLVM 似乎有更深层次的工程.它也可能有更好的库实现.

It didn't take much to get Xcode to unroll the loop; LLVM seems to have deeper engineering. It may have a better implementation of the library, too.

顺便说一句,我贴在上面的C代码示例是在VC中使用的,它没有认识到三个不同的求和是相关的.XCode 上的 LLVM 做到了,这意味着我第一次在那里尝试它时,它只是采用了 std::accumulate 的答案,其他什么都不做.VC 在这一点上真的很弱.为了让 Xcode 执行 3 个单独的测试,我在每次调用之前随机化了数组……否则 Xcode 会意识到我在做什么,而 VC 没有.

Incidentally, the C code example I posted at top was used in VC, which didn't recognize that the three different summations were related. LLVM on XCode did, which meant the first time I tried it there it simply adopted the answer from std::accumulate and did nothing otherwise. VC was really weak on that point. In order to get Xcode to perform 3 separate tests, I randomized the array before each call...otherwise Xcode realized what I was doing where VC did not.

这篇关于最有效地将编译时大小的数组的所有元素相加的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆