简单的for()循环基准测试与任何循环绑定都花费相同的时间 [英] Simple for() loop benchmark takes the same time with any loop bound

查看:97
本文介绍了简单的for()循环基准测试与任何循环绑定都花费相同的时间的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我愿意编写使我的CPU执行某些操作的代码,并查看他花费多少时间来解决这些问题.我想做一个从i = 0到i <5000的循环,然后将i乘以一个常数并乘以该时间.我结束了这段代码,它没有错误,但是即使我更改循环i< 49058349083或i< 2花费相同的时间,它也只需要0.024秒即可执行代码.有什么错误吗?

PD:我昨天开始学习C ++,很抱歉,如果这是一个很容易回答的问题,但我找不到解决方法

#include <iostream>
#include <ctime>

using namespace std;

int main () {
    int start_s=clock();

    int i;
    for(i=0;i<5000;i++){
        i*434243;
    }

    int stop_s=clock();
    cout << "time: "<< (stop_s-start_s)/double(CLOCKS_PER_SEC)*1000;

    return 0;
}

解决方案

BTW,如果您实际上已完成i<49058349083,则gcc和clang在具有32位int的系统(包括x86和x86)上创建一个无限循环-64). 49058349083大于INT_MAX.大字面量会隐式提升为足以容纳它们的类型,因此您有效地执行了(int64_t)i < 49058349083LL,这对于int i的任何可能值都是正确的.

在C ++中,签名溢出是未定义的行为,在循环内也没有副作用(例如系统调用)的无限循环也是如此,因此如何对函数的性能进行基准测试),但是要通过将某些内容放入重复循环中来学习有用的知识,则必须了解很多有关编译器如何编译为asm,您要尝试测量的内容以及如何使优化器生成与在某些实际使用上下文中从代码中获取的内容相似的asm.例如通过使用内联asm要求它具有一定的结果,或者分配给volatile变量(也有进行存储的开销).

如果这听起来比您希望的要复杂得多,那就太糟糕了,这是有充分理由的.

这是因为编译器是非常复杂的机器,通常可以从源代码中生成相当有效的可执行文件,而这些可执行文件的编写目的是清楚地表达人类读者正在发生的事情,而不是是为了避免多余的工作或看起来像什么高效的机器语言实现(这是您的CPU实际运行的语言).


微基准测试很困难-除非您了解代码的编译方式和真正的测量标准,否则您将无法获得有意义的结果.

优化编译器的目的是创建一个可执行文件,该可执行文件与C ++源代码产生的结果相同,但运行速度尽可能快.性能不是可观察到的结果,因此使程序更高效始终是合法的.这是按原样"规则:按原样"如果规则?

想要,您的编译器不会浪费时间和未使用的代码大小的计算结果.编译器将函数内联到调用程序中后,通常会发现它不使用某些计算的东西.编写良好的C ++代码通常会放弃很多工作,包括完全优化临时变量,这是正常的.这不是一件坏事,没有这样做的编译器会很烂.

请记住,您正在为C ++抽象机编写程序,但是编译器的工作是将其转换为CPU的汇编语言(或机器代码).汇编语言与C ++完全不同. (现代高性能的CPU也可以不按顺序执行指令,同时遵循自己的按条件"规则,以保留编译器生成的代码按程序顺序运行的错觉.但是CPU不能放弃工作,只能重叠它.

在一般情况下,您不能对C ++中的int * int二进制运算符进行微基准测试,即使只是针对您自己的台式机(也不要在其他硬件/不同的编译器上使用) .在不同上下文中的不同用法将编译为不同的代码.即使您可以创建一些测量有用内容的循环版本,也不一定会告诉您foo = a * b在另一个程序中有多昂贵.另一个程序可能会遇到乘法延迟而不是吞吐量的瓶颈,或者将该乘法与ab或其他任何事物上的其他附近操作相结合.


不要禁用优化

您可能会认为仅禁用优化会很有用(例如gcc -O0而不是gcc -O3).但是,唯一的方法还引入了反优化功能,例如在每个C ++语句之后将每个值存储回内存中,,以及从内存中为下一个语句重新加载变量.这样一来,您就可以在调试已编译的程序时修改变量值,甚至可以跳到同一函数中的新行,并且仍然可以通过查看C ++源代码获得预期的结果.

支持该级别的干扰会对编译器造成巨大负担.在典型的现代x86上,存储/重新加载(存储转发)的延迟约为5个周期.这意味着反优化循环最多只能在每个〜6个时钟周期内运行一次迭代,而对于looptop: dec eax/jnz looptop这样的紧密循环(其中循环计数器位于寄存器中)的循环,则只能运行1个循环.

没有太多的中间立场:编译器没有选项来使asm看起来像C ++源代码的asm,而是将值保留在语句中的寄存器中.无论如何,这将毫无用处或有意义,因为这不是启用完全优化的情况下它们的编译方式. (gcc -Og可能有点像这样.)

花时间修改C ++以使其在-O0下运行更快是完全的时间浪费:答案为:添加多余的分配速度未经优化而编译时的up 代码是一些相当微妙的Intel Sandybridge系列存储转发延迟行为,直接由存储/重新加载引起,并且循环计数器上的循环瓶颈位于内存中. 请注意,问题的第一个版本询问的是为什么这样做会使C更快,这被正确地否决了,因为基准测试-O0是愚蠢的.当我将其编辑为x86 asm问题时,它才变成一个有趣的问题,该问题很有趣,这是因为asm更大但速度更快,而不是因为它来自gcc -O0并进行了任何特定的源更改.)

这甚至都没有提到像std::sortstd::vector::push_back这样的C ++标准库函数,它们依赖于优化器来内联许多对小助手/包装函数的嵌套调用.


编译器如何优化

(我将展示C ++代码的转换.请记住,编译器实际上是在转换程序逻辑的内部表示形式,然后生成asm.您可以将转换后的C ++视为asm的伪代码,其中i++表示x86 inc eax指令之类的东西.大多数C/C ++语句可以映射到1条或几条机器指令,因此,这是描述实际编译器逻辑的一种有用方法-生成的asm可能会在没有实际将其编写为asm的情况下完成.)

不必首先计算从未使用过的结果.可以完全删除没有副作用的循环.或者可以将分配给全局变量的循环(可观察到的副作用)优化为仅执行 last 分配.例如

// int gsink;  is a global.  
// "sink" is the opposite of a source: something we dump results into.
for (int i=0 ; i<n ; i++) {
    gsink = i*10;
}

就优化编译器而言,

等效于以下代码:

if (! (0 < n)) {     // you might not have noticed that your loop could run 0 times
    gsink = n*10;    // but the compiler isn't allowed to do gsink=0 if n<1
}

如果gsink是本地的或文件作用域的static,但没有任何内容读取它,则编译器可以完全优化它.但是编译器在编译包含该代码的函数时无法在当前C ++源文件(编译单元")之外查看"代码,因此它无法更改函数返回时可观察到的副作用gsink = n*10;.

由于没有对非内联函数的函数调用,因此Nothing不会读取gsink的中间值. (因为它不是atomic<int>,所以编译器可以假定没有其他线程或信号处理程序读取它;这将是数据争用的未定义行为.)


使用volatile让编译器执行一些工作.

如果它是全局volatile int gsink,则将值存储在内存中的实际存储区 将是可观察到的副作用(这是volatile在C ++中的含义).但这是否意味着我们可以以此方式对乘法进行基准测试?不,不是.编译器必须保留的副作用只是将最终值放置在内存中.如果它可以比每次循环使用i * 10更便宜的方式进行计算,那么它将这样做.

此循环还将产生与gsink相同的赋值结果序列,因此对于优化编译器而言是有效的选项. 将独立的乘法转换为循环加法称为降低强度"优化 .

volatile int gsink;

int i10 = 0;   // I could have packed this all into a for() loop
int i=0;       // but this is more readable
while (i<n) {
    gsink = i10;
    i10 += 10;
    i++;
}

编译器是否可以完全放弃i并使用i10 < n*10作为循环条件? (当然,将upperbound = n*10计算取消.)

并非总是会产生相同的行为,因为n*10可能会溢出,因此如果以这种方式实现,则循环最多可以运行INT_MAX/10次.但是C中的带符号溢出是不确定的行为,并且循环体内的i*10会在任何n*10溢出的程序中溢出,因此编译器可以安全地引入n*10而不改变其行为.任何合法/定义明确的程序. 请参见每个C程序员应了解的知识关于未定义的行为 .

(实际上,i*10最多只能评估in-1,而n*10可能会溢出,而(n-1)*10不会溢出.gcc 实际上所做的是x86编译时更像使用无符号数学的while(i10 != n*10).x86是2的补码机,因此,无符号和有符号是相同的二进制运算,并且即使(unsigned)n*10UL == 0x8000000UL,也可以安全地检查!=而不是小于符号. ,它是INT_MIN.)

有关查看编译器asm输出以及x86 asm的入门知识的更多信息,请参阅Matt Godbolt的CppCon2017演讲如何从GCC/c装配件输出中消除噪音"?).有关当前x86 CPU的性能的更多信息,请参见 http://agner.org/optimize/.

编译器始终尝试进行这样的循环. (这并不意味着您必须以这种方式编写源代码,但是您可以让编译器在无法证明这种情况的情况下避免检查零迭代.)

如果我们使用unsigned int,则溢出将被很好地定义为环绕,因此没有UB编译器可以用作您以意想不到的方式编译代码的借口. (现代C ++并不是 宽容的语言.优化编译器对于没有仔细避免使用UB的程序员是非常敌对的,并且由于 lot 很多东西,这可能会变得非常微妙.是不确定的行为.为x86编译C ++并不像编写x86汇编程序.但是幸运的是,有像gcc -fsanitize=undefined这样的编译器选项将在运行时检测并警告UB.不过,您仍然必须检查您关心的每个可能的输入值. )

void store_n(unsigned int n) {
    for(unsigned int i=0 ; i<n; i++) {
        gvsink = i*10;
    }
}

store_n(unsigned int):
    test    edi, edi
    je      .L9            # if (n==0) return;
    xor     edx, edx       # i10 = 0
    xor     eax, eax       # i = 0
.L11:                      # do{
    add     eax, 1         #    i++
    mov     DWORD PTR gvsink[rip], edx
    add     edx, 10        #    i10 += 10
    cmp     edi, eax       
    jne     .L11           # } while(i!=n)
.L9:
    rep ret

Clang使用两个单独的计数器进行签名和未签名的编译.铛的代码更像是

i10 = 0;
do {
    gvsink = i10;
    i10 += 10;
} while(--n != 0);

因此它将n寄存器的计数递减至接近零,从而避免了单独的cmp指令,因为add/sub指令还设置了CPU可以分支的条件码标志. (Clang默认情况下会展开小循环,在它可以从中存储的寄存器中生成i10i10 + 10i10 + 20以及最多i10 + 70,而只运行一次循环开销指令.没有太多收获.但是,这是从典型的现代CPU上展开的,每个时钟周期一个存储是一个瓶颈,从前端到内核的无序部分发出的每个时钟4 oups(在Intel CPU上)也是如此./p>


让编译器相乘:

我们如何停止降低强度的优化?用* variable替换*10不起作用,那么我们只得到添加了register的代码,而不是添加立即数.

我们可以将它变成像a[i] = b[i] * 10;这样的数组的循环,但是然后我们也将依赖于内存带宽.而且,这可以使用我们可能想要或不想进行测试的SIMD指令进行自动向量化.

我们可以做类似tmp *= i;的操作(使用无符号,以避免有符号溢出的UB).但是,这使得乘法运算的输出在每次迭代中都成为下一次的输入.因此,我们将基准乘以潜伏期,而不是吞吐量. (CPU是流水线的,例如可以在每个时钟周期开始一个新的乘法运算,但是直到3个时钟周期之后才准备好一个乘法运算的结果.因此,您至少需要tmp1*=itmp2*=itmp3*=i可以使Intel Sandybridge系列CPU上的整数乘法单元保持工作饱和.

这又是要准确地了解 在这种详细程度下要制作有意义的微基准所要测量的内容.

如果这个答案超出了您的理解,请坚持对整个函数进行计时!在不了解周围环境的情况下,真的不可能对单个C算术运算符或表达式说太多话,以及它在asm中的工作方式.

如果您了解缓存,则可以相当体面地了解内存访问和数组以及链接列表,而无需过多地了解asm级别的细节.在不了解有关asm的很多知识的情况下,有可能更详细地了解C ++的性能(除了它存在的事实,而且编译器会进行大量优化).但是,您对asm,CPU性能调整以及编译器的工作了解得越多,就越有意义.


PS:

任何基于编译时常数的计算都可以(希望如此)在编译时进行.这称为恒定传播" .隐藏优化程序中的常量(例如,通过将其作为命令行args输入atoi(argv[1])或其他技巧),可以使编译器为微基准生成的代码看起来更像真实的用例,如果该用例也可以(但在编译时看不到常量.(但请注意,其他文件中定义的常量在链接时优化中变得可见),这对于具有许多跨源文件边界相互调用且未在其中定义的小函数的项目非常有用.标头(.h),它们可以正常内联.)

乘以16(或2的任何其他幂)将使用移位,因为这比实际的乘法指令更有效.尤其是对于分割而言,这很重要.参见.

在二进制表示中只设置了几位的其他乘法常数可以通过一些shift + add来完成,通常比通用乘法指令的等待时间短.参见例如如何在x86中仅使用2条连续的Leal指令将寄存器乘以37?.

如果在编译时都不知道任何输入,则a * ba / b不可能实现这些优化.


另请参见:如何基准测试C ++的性能代码?,尤其是指向 Chandler Carruth的CppCon 2015演讲的链接:调整C ++:基准,CPU和编译器!噢,我的天!" .

并且值得一提两次: Matt Godbolt的CppCon2017演讲最近我的编译器对我做了什么? .这是一个非常温和的介绍,初学者可能可以很好地了解它,以了解他们的循环是如何编译的,并查看它是否被优化了.

I'm willing to write a code that makes my CPU execute some operations and see how much time does he take to solve them. I wanted to make a loop going from i=0 to i<5000 and then multiplying i by a constant number and time that. I've ended up with this code, it has no errors but it only takes 0.024 seconds to execute the code even if I change the loop i<49058349083 or if i<2 it takes the same amount of time. Whats the error?

PD: I started yesterday learning C++ i'm sorry if this is a really easy question to answer but I couldn't find the solution

#include <iostream>
#include <ctime>

using namespace std;

int main () {
    int start_s=clock();

    int i;
    for(i=0;i<5000;i++){
        i*434243;
    }

    int stop_s=clock();
    cout << "time: "<< (stop_s-start_s)/double(CLOCKS_PER_SEC)*1000;

    return 0;
}

解决方案

BTW, if you'd actually done i<49058349083, gcc and clang create an infinite loop on systems with 32-bit int (including x86 and x86-64). 49058349083 is greater than INT_MAX. Large literal numbers are implicitly promoted to a type large enough to hold them, so you effectively did (int64_t)i < 49058349083LL, which is true for any possible value of int i.

Signed overflow is Undefined Behaviour in C++, and so is an infinite loop with no side-effects inside the loop (e.g. a system call), so I checked on the Godbolt compiler explorer to see how it really compiles with optimization enabled. Fun fact: MSVC still optimizes away an empty loop (including one with i*10 not assigned to anything) when the condition is an always-true compare, rather than a non-zero constant like 42.


A loop like this is pretty fundamentally flawed.

You can microbenchmark a complete non-inline function using Google's benchmark package (how would you benchmark the performance of a function), but to learn anything useful by putting something inside a repeat-loop you have to know a lot about how your compiler compiles to asm, exactly what you're trying to measure, and how to get the optimizer to make asm that's similar to what you'd get from your code in some real use context. e.g. by using inline asm to require it to have a certain result in a register, or by assigning to a volatile variable (which also has the overhead of doing a store).

If this sound a lot more complicated than you were hoping, well too bad, it is, and for good reason.

That's because compilers are incredibly complex pieces of machinery that can usually produce fairly efficient executables out of source that's written to clearly express what's going on for human readers, not to avoid redundant work or look anything like an efficient machine-language implementation (which is what your CPU actually runs).


Microbenchmarking is hard - you can't get meaningful results unless you understand how your code compiles and what you're really measuring.

Optimizing compilers are designed to create an executable that produces the same results as your C++ source, but which runs as quickly as possible. Performance is not an observable result, so it's always legal to make a program more efficient. This is the "as-if" rule: What exactly is the "as-if" rule?

You want your compiler to not waste time and code-size computing results that aren't used. After the compiler inlines a function into the caller, it often turns out that some of the things it computes aren't used. It's normal for well-written C++ code to have lots of work that can be thrown away, including optimizing away temporary variables entirely; this is not a bad thing and a compiler that didn't do this would suck.

Remember that you're writing for the C++ abstract machine, but your compiler's job is to translate that into assembly language (or machine code) for your CPU. Assembly language is quite different from C++. (And modern higher-performance CPUs can also execute instructions out of order, while following their own "as-if" rule to preserve the illusion of the compiler-generated code running in program order. But CPUs can't discard work, only overlap it.)

You can't microbenchmark the int * int binary operator in C++ in the general case, even just for your own desktop (nevermind on other hardware / different compilers). Different uses in different contexts will compile to different code. Even if you could create some version of your loop which measured something useful, it wouldn't necessarily tell you much about how expensive foo = a * b is in another program. The other program might bottleneck on multiply latency instead of throughput, or combine that multiply with some other nearby operation on a or b, or any number of things.


Don't disable optimization

You might think it would be useful to just disable optimization (like gcc -O0 instead of gcc -O3). But the only way to do that also introduces anti-optimizations like storing every value back to memory after every C++ statement, and reloading variables from memory for the next statement. This lets you modify variable values while debugging a compiled program, or even jump to a new line in the same function, and still get the results you'd expect from looking at the C++ source.

Supporting that level of interference imposes a huge burden on the compiler. Store/reload (store-forwarding) has about 5 cycle latency on a typical modern x86. This means an anti-optimized loop can only run one iteration per ~6 clock cycles at best, vs. 1 cycle for a tight loop like looptop: dec eax / jnz looptop where the loop counter is in a register.

There's not much middle ground: compilers don't have options to make asm that "looks like" the C++ source, but keeping values in registers across statements. That wouldn't be useful or meaningful anyway, because that's not how they compile with full optimization enabled. (gcc -Og might be a little bit like this.)

Spending time modifying your C++ to make it run faster with -O0 is a total waste of time: C loop optimization help for final assignment.

Note: anti-optimized debug mode (-O0) is the default for most compilers. It's also "compile-fast" mode, so it's good for seeing if your code compiles / runs at all, but it's useless for benchmarking. The resulting compiler-generated asm runs the speed it does for reasons which depend on the hardware, but don't tell you much of anything about how fast optimized code will run. (e.g. the answer to Adding a redundant assignment speeds up code when compiled without optimization is some fairly subtle Intel Sandybridge-family store-forwarding latency behaviour, directly caused by the store/reloads and having the loop bottleneck on the loop counter being in memory. Note that the first version of the question was asking about why doing that made the C faster, which was rightly downvoted because benchmarking -O0 is silly. It only turned into an interesting question when I edited it into an x86 asm question which is interesting because of the larger-but-faster asm, not because it came from gcc -O0 with any particular source changes.)

And this is not even mentioning C++ standard library functions like std::sort or std::vector::push_back, which depend on the optimizer to inline lots of nested calls to tiny helper / wrapper functions.


How compilers optimize

(I'm going to show transformations of C++ code. Remember that the compiler is actually transforming an internal representation of your program's logic and then producing asm. You can think of the transformed C++ as being pseudo-code for asm, where i++ represents an x86 inc eax instruction or something. Most C/C++ statements can map to 1 or a couple machine instructions. So it's a useful way of describing the logic of what the actual compiler-generated asm might be doing without actually writing it in asm.)

A result which is never used doesn't have to be computed in the first place. A loop with no side effects can be removed completely. Or a loop that assigns to a global variable (an observable side effect) could be optimized to just doing the last assignment. e.g.

// int gsink;  is a global.  
// "sink" is the opposite of a source: something we dump results into.
for (int i=0 ; i<n ; i++) {
    gsink = i*10;
}

is equivalent to this code, as far as an optimizing compiler is concerned:

if (! (0 < n)) {     // you might not have noticed that your loop could run 0 times
    gsink = n*10;    // but the compiler isn't allowed to do gsink=0 if n<1
}

If gsink were instead a local or file-scoped static with nothing that reads it, the compiler could optimize it away entirely. But the compiler can't "see" code outside the current C++ source file ("compilation unit") while compiling a function containing that, so it can't change the observable side-effect that when the function returns, gsink = n*10;.

Nothing reads the intermediate values of gsink because there are no function calls to non-inline function. (Because it's not an atomic<int>, the compiler can assume that no other threads or signal handlers read it; that would be data-race Undefined Behaviour.)


Using volatile to get the compiler to do some work.

If it were global volatile int gsink, the actual store that places the value in memory would be an observable side-effect (that's what volatile means in C++). But does that mean we can benchmark multiplication this way? No, it doesn't. The side-effect the compiler has to preserve is only the placing of the final value in memory. If it can calculate it more cheaply than with i * 10 every time through the loop, it will do so.

This loop would also produce the same result sequence of assignments to gsink, and thus is a valid option for an optimizing compiler. Transforming independent multiplies to a loop-carried add is called a "strength-reduction" optimization.

volatile int gsink;

int i10 = 0;   // I could have packed this all into a for() loop
int i=0;       // but this is more readable
while (i<n) {
    gsink = i10;
    i10 += 10;
    i++;
}

Could the compiler drop i altogether and use i10 < n*10 as the loop condition? (Of course hoisting the upperbound = n*10 calculation out of the loop.)

That wouldn't always give the same behaviour, because n*10 can overflow, so the loop can run at most INT_MAX/10 times if implemented that way. But signed overflow in C++ is undefined behaviour, and the i*10 in the loop body would overflow in any program where n*10 overflowed, so the compiler can safely introduce an n*10 without changing the behaviour of any legal / well-defined program. See What Every C Programmer Should Know About Undefined Behavior.

(Actually, i*10 is only evaluated for i up to n-1 at most, and n*10 could overflow while (n-1)*10 doesn't. What gcc actually does is more like while(i10 != n*10) using unsigned math, when compiling for x86. x86 is a 2's complement machine, so unsigned and signed are the same binary operation, and checking for != instead of signed less-than is safe even if (unsigned)n*10UL == 0x8000000UL, which is INT_MIN.)

For more about looking at compiler asm output, and a total-beginner intro to x86 asm, see Matt Godbolt's CppCon2017 talk "What Has My Compiler Done for Me Lately? Unbolting the Compiler's Lid". (And related: How to remove "noise" from GCC/clang assembly output?). See http://agner.org/optimize/ for more about how current x86 CPUs perform.

Compiler output for this function from gcc7.3 -O3, compiling for x86-64, on the Godbolt compiler explorer:

volatile int gvsink;

void store_n(int n) {
    for(int i=0 ; i<n; i++) {
        gvsink = i*10;
    }
}

store_n(int):          # n in EDI  (x86-64 System V calling convention)
    test    edi, edi
    jle     .L5                   # if(n<=0) goto end

    lea     edx, [rdi+rdi*4]      # edx = n * 5
    xor     eax, eax              # tmp = 0
    add     edx, edx              # edx = n * 10

.L7:                                   # do {
    mov     DWORD PTR gvsink[rip], eax   # gvsink = tmp
    add     eax, 10                      # tmp += 10
    cmp     eax, edx
    jne     .L7                        # } while(tmp != n*10)
.L5:
    rep ret

The optimal / idiomatic asm loop structure is a do{}while(), so compilers always try to make loops like this. (That doesn't mean you have to write your source that way, but you can let the compiler avoid the check for zero iterations in cases like this where it can't prove that.)

If we'd used unsigned int, overflow would be well-defined as wraparound, so there's no UB the compiler can use as an excuse to compile your code in a way you didn't expect. (Modern C++ is not a forgiving language. Optimizing compilers are quite hostile to programmers who don't carefully avoid any UB, and this can get pretty subtle because a lot of stuff is undefined behaviour. Compiling C++ for x86 is not like writing x86 assembly. But fortunately there are compiler options like gcc -fsanitize=undefined which will detect and warn about UB at runtime. You still have to check every possible input value you care about, though.)

void store_n(unsigned int n) {
    for(unsigned int i=0 ; i<n; i++) {
        gvsink = i*10;
    }
}

store_n(unsigned int):
    test    edi, edi
    je      .L9            # if (n==0) return;
    xor     edx, edx       # i10 = 0
    xor     eax, eax       # i = 0
.L11:                      # do{
    add     eax, 1         #    i++
    mov     DWORD PTR gvsink[rip], edx
    add     edx, 10        #    i10 += 10
    cmp     edi, eax       
    jne     .L11           # } while(i!=n)
.L9:
    rep ret

Clang compiles with two separate counters for signed and unsigned. Clang's code is more like

i10 = 0;
do {
    gvsink = i10;
    i10 += 10;
} while(--n != 0);

So it counts down the n register toward zero, avoiding a separate cmp instruction because add/sub instructions also set the condition-code flags which the CPU can branch on. (Clang unrolls small loops by default, generating i10, i10 + 10, i10 + 20, up to i10 + 70 in registers it can store from, while only running the loop-overhead instructions once. There's not a lot to be gained here from unrolling on typical modern CPUs, though. One store per clock cycle is a bottleneck, and so is 4 uops per clock (on Intel CPUs) issuing from the front-end into the out-of-order part of the core.


Getting the compiler to multiply:

How do we stop that strength-reduction optimization? Replacing *10 with * variable doesn't work, then we just get code that adds are register instead of adding an immediate constant.

We could turn it into a loop over arrays like a[i] = b[i] * 10;, but then we'd be dependent on memory bandwidth as well. Also, that could auto-vectorize with SIMD instructions, which we might or might not want to be testing.

We could do something like tmp *= i; (with unsigned, to avoid signed-overflow UB). But that makes the output of the multiply in each iteration an input for the next. So we'd be benchmarking multiply latency, not throughput. (CPUs are pipelined, and for example can start a new multiply every clock cycle, but the result of a single multiply isn't ready until 3 clock cycles later. So you'd need at least tmp1*=i, tmp2*=i, and tmp3*=i to keep the integer multiply unit on an Intel Sandybridge-family CPU saturated with work.

This comes back to having to know exactly what you're measuring to make a meaningful microbenchmark at this level of detail.

If this answer is going way over your head, stick to timing whole functions! It really isn't possible to say much about a single C arithmetic operator or an expression without understanding the surrounding context, and how it works in asm.

If you understand caching, you can understand memory access and arrays vs. linked lists pretty decently without getting into too much asm-level detail. It's possible to understand C++ performance in some level of detail without knowing much about asm (beyond the fact that it exists, and that compilers optimize heavily). But the more you know about asm, CPU performance tuning, and how compilers work, the more things start to make sense.


PS:

Any computation based on compile-time-constant values can (and hopefully is) done at compile time. This is called "constant propagation". Hiding constants from the optimizer (e.g. by inputting them as command line args (atoi(argv[1]), or with other tricks) can make the compiler-generated code for a microbenchmark look more like the real use-case, if that use-case also can't see constants at compile time. (But note that constants defined in other files become visible with link-time optimization, which is very good for projects with lots of small functions that call each other across source file boundaries and aren't defined in headers (.h) where they can inline normally.)

Multiplying by 16 (or any other power of 2) will use a shift, because that's more efficient than an actual multiply instruction. This is a big deal for division, especially. See Why is this C++ code faster than my hand-written assembly for testing the Collatz conjecture?, and Why does GCC use multiplication by a strange number in implementing integer division?.

Other multiply constants with only a couple bits set in their binary representation can be done with some shift+add, often with lower latency than a general-purpose multiply instruction. See for example How to multiply a register by 37 using only 2 consecutive leal instructions in x86?.

None of these optimizations are possible with a * b or a / b if neither input is known at compile time.


See also: How can I benchmark the performance of C++ code?, especially the links to Chandler Carruth's CppCon 2015 talk: "Tuning C++: Benchmarks, and CPUs, and Compilers! Oh My!".

And because it's worth mentioning twice: Matt Godbolt's CppCon2017 talk "What Has My Compiler Done for Me Lately? Unbolting the Compiler's Lid". It's a gentle enough introduction that a beginner can probably follow it well enough to look at how their loop compiled, and see if it optimized away or not.

这篇关于简单的for()循环基准测试与任何循环绑定都花费相同的时间的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆