C ++优化的奥秘 [英] Mysteries of C++ optimization

查看:89
本文介绍了C ++优化的奥秘的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

采用以下两个摘要:

int main()
{
    unsigned long int start = utime();

    __int128_t n = 128;

    for(__int128_t i=1; i<1000000000; i++)
        n = (n * i);

    unsigned long int end = utime();

    cout<<(unsigned long int) n<<endl;

    cout<<end - start<<endl;
}

int main()
{
    unsigned long int start = utime();

    __int128_t n = 128;

    for(__int128_t i=1; i<1000000000; i++)
        n = (n * i) >> 2;

    unsigned long int end = utime();

    cout<<(unsigned long int) n<<endl;

    cout<<end - start<<endl;
}

我在C ++中对128位整数进行基准测试.当执行第一个(正好是乘法)时,所有内容都会以大约0.95秒.当我还添加了移位操作(第二个片段)时,执行时间增加到惊人的2.49秒.

I am benchmarking 128 bit integers in C++. When executing the first one (just multiplication) everything runs in approx. 0.95 seconds. When I also add the bit shift operation (second snippet) the execution time raises to an astounding 2.49 seconds.

这怎么可能?我认为移位是处理器最轻的操作之一.如此简单的操作又会导致如此多的开销呢?我正在激活O3标志的情况下进行编译.

How is this possible? I thought that bit shifting was one of the lightest operations for a processor. How comes that there is so much overhead due to such a simple operation? I am compiling with O3 flag activated.

有什么主意吗?

推荐答案

在过去的几天中,这个问题一直困扰着我,因此我决定进行更多调查.我最初的答案集中在两次测试之间的数据值差异上.我的断言是,如果操作数之一为零,则处理器中的整数乘法单元将在较少的时钟周期内完成操作.

This question has been bugging me for the past few days, so I decided to do some more investigation. My initial answer focused on the difference in data values between the two tests. My assertion was that the integer multiplication unit in the processor finishes an operation in fewer clock cycles if one of the operands is zero.

虽然有明确说明可以使用该指令的指令(例如整数),但有很强的迹象表明,在现代处理器中,整数乘法是在固定数量的循环中完成的,不管输入.英特尔文档中的注释最初使我认为整数乘法的循环数可能取决于输入数据,但这似乎不适用于这些指令.另外,我对零和非零操作数使用相同的指令序列进行了一些更严格的性能测试,结果并没有产生显着差异.据我所知,哈罗德对此主题的评论是正确的.我的错;抱歉.

While there are instructions that are clearly documented to work that way (integer division, for example), there are very strong indications that integer multiplication is done in a constant number of cycles in modern processors, regardless of input. The note in Intel's documentation that initially made me think that the number of cycles for integer multiplication can depend on input data doesn't seem to apply to these instructions. Also, I did some more rigorous performance tests with the same sequence of instructions on both zero and non-zero operands and the results didn't yield significant differences. As far as I can tell, harold's comment on this subject is correct. My mistake; sorry.

尽管考虑完全删除此答案的可能性,以免将来使人误入歧途,但我意识到关于此主题还有很多话要说.我还认为,至少有另一种方法可以使数据值影响此类计算的性能(包括在上一节中).因此,我决定在最初的答案中重组和增强其余信息,开始编写,并且……并没有停下来一会儿.由您决定是否值得.

While contemplating the possibility of deleting this answer altogether, so that it doesn't lead people astray in the future, I realized there were still quite a few more things worth saying on this subject. I also think there's at least one other way in which data values can influence performance in such calculations (included in the last section). So, I decided to restructure and enhance the rest of the information in my initial answer, started writing, and... didn't quite stop for a while. It's up to you to decide whether it was worth it.

信息分为以下几部分:

  • 代码的作用
  • 编译器的作用
  • 处理器的作用
  • 您可以做什么
  • 未回答的问题

主要是溢出.

在第一个版本中,n在第33次迭代中开始溢出.在第二个版本中,随着移位,n52次迭代中开始溢出.

In the first version, n starts overflowing on the 33rd iteration. In the second version, with the shift, n starts overflowing on the 52nd iteration.

在没有移位的版本中,从第128次迭代开始,n为零(它干净地"溢出,在结果的最低有效128位中仅保留零).

In the version without the shift, starting with the 128th iteration, n is zero (it overflows "cleanly", leaving only zeros in the least significant 128 bits of the result).

在第二个版本中,向右移位(除以4)在每次迭代中从n的值中提取的因数比新操作数要多,所以移位导致对某些迭代进行舍入.快速计算:从1到128的所有因素中,两个因数的总数等于

In the second version, the right shift (dividing by 4) takes out more factors of two from the value of n on each iteration than the new operands bring in, so the shift results in rounding on some iterations. Quick calculation: the total number of factors of two in all numbers from 1 to 128 is equal to

128/2 + 128/4 + ... + 2 +1 = 2 6 + 2 5 + ... + 2 + 1 = 2 7 -1

128 / 2 + 128 / 4 + ... + 2 + 1 = 26 + 25 + ... + 2 + 1 = 27 - 1

通过右移(如果有足够的余数)取出的因子2的数量为128 * 2,是两倍多.

while the number of factors of two taken out by the right shift (if it had enough to take from) is 128 * 2, more than double.

有了这些知识,我们可以给出第一个答案:从C ++标准的角度来看,这段代码将其大部分时间都花在了未定义的行为领域,因此所有的赌注都没有了.问题解决了;现在停止阅读.

Armed with this knowledge, we can give a first answer: from the point of view of the C++ standard, this code spends most of its time in undefined behaviour land, so all bets are off. Problem solved; stop reading now.

如果您仍在阅读,从现在开始,我们将忽略溢出并查看一些实现细节.在这种情况下,编译器"表示GCC 4.9.2或Clang 3.5.1.我只对GCC生成的代码进行了性能测量.对于Clang,我查看了一些测试用例的生成代码,并注意到了一些差异,这些差异我将在下面提到,但是我实际上并没有运行该代码.我可能错过了一些东西.

If you're still reading, from this point forward we'll ignore the overflows and look at some implementation details. "The compiler", in this case, means GCC 4.9.2 or Clang 3.5.1. I've only done performance measurements on code generated by GCC. For Clang, I've looked at the generated code for a few test cases and noted some differences that I'll mention below, but I haven't actually run the code; I might have missed some things.

乘法和移位运算都可用于64位操作数,因此需要就这些实现128位操作.首先,乘法:n可以写为2 64 nh + nl,其中nhnl分别是高和低64位一半. i也是如此.因此,乘法可以写成:

Both multiplication and shift operations are available for 64-bit operands, so 128-bit operations need to be implemented in terms of those. First, multiplication: n can be written as 264 nh + nl, where nh and nl are the high and low 64-bit halves, respectively. The same goes for i. So, the multiplication can be written:

(2 64 nh + nl)(2 64 ih + il)= 2 128 nh ih + 2 64 (nh il + nl ih)+ nl il

(264 nh + nl)(264 ih + il) = 2128 nh ih + 264 (nh il + nl ih) + nl il

第一项在较低的128位部分中没有任何非零位;全部为溢出或全部为零.由于忽略整数溢出对于C ++实现是有效且常见的,因此编译器将执行此操作:第一项完全被忽略.

The first term doesn't have any non-zero bits in the lower 128-bit part; it's either all overflow or all zero. Since ignoring integer overflows is valid and common for C++ implementations, that's what the compiler does: the first term is ignored completely.

括号仅将位贡献到128位结果的高64位半部分;两次相乘或相加产生的任何溢出也将被忽略(结果被截断为64位).

The parenthesis only contributes bits to the upper 64-bit half of the 128-bit result; any overflow resulting from the two multiplications or the addition is also ignored (the result is truncated to 64 bits).

最后一项确定结果的低64位半部分中的位,如果该乘法的结果具有64位以上,则需要将额外的位添加到从累加器获得的高64位半部分中括号之前讨论过. x86-64汇编中有一条非常有用的乘法指令,它可以完成所需的工作:接收两个64位操作数并将结果放入两个64位寄存器中,因此可以将高半部分添加到括号.

The last term determines the bits in the low 64-bit half of the result and, if the result of that multiplication has more than 64 bits, the extra bits need to be added to the high 64-bit half obtained from the parenthesis discussed before. There's a very useful multiplication instruction in x86-64 assembly that does just what's needed: takes two 64-bit operands and places the result in two 64-bit registers, so the high half is ready to be added to the result of the operations in the parenthesis.

这就是128位整数乘法的实现方式:三个64位乘法和两个64位加法.

That is how 128-bit integer multiplication is implemented: three 64-bit multiplications and two 64-bit additions.

现在,移位:使用与上述相同的符号,nh的两个最低有效位需要变为nl的两个最高有效位,然后再将其内容右移两位.使用C ++语法,看起来像这样:

Now, the shift: using the same notations as above, the two least significant bits of nh need to become the two most significant bits of nl, after the contents of the latter is shifted right by two bits. Using C++ syntax, it would look like this:

nl = nh << 62 | nl >> 2 //Doesn't change nh, only uses its bits.

除此之外,nh也需要使用类似

Besides that, nh also needs to be shifted, using something like

nh >>= 2;

这就是编译器实现128位移位的方式.在第一部分中,有一个x86-64指令,具有该表达式的确切语义.它称为SHRD.就像我们将在下面看到的那样,使用它可能是好事,也可能是坏事.在这方面,两个编译器的选择略有不同.

That is how the compiler implements a 128-bit shift. For the first part, there's an x86-64 instruction that has the exact semantics of that expression; it's called SHRD. Using it can be good or bad, as we'll see below, and the two compilers make slightly different choices in this respect.

...高度依赖处理器. (不,真的吗?!)

... is highly processor-dependent. (No... really?!)

有关Haswell处理器发生的情况的详细信息,请参见 harold的出色答案.在这里,我将尝试在更高层次上覆盖更多领域.有关更多详细数据,请参见以下来源:

Detailed information about what happens for Haswell processors can be found in harold's excellent answer. Here, I'll try to cover more ground at a higher level. For more detailed data, here are some sources:

  • Intel® 64 and IA-32 Architectures Optimization Reference Manual
  • Software Optimization Guide for AMD Family 15h Processors
  • Agner Fog's microarchitecture manual and instruction tables.

我将参考以下架构:

  • Intel Sandy Bridge / Ivy Bridge - abbreviated "IntelSB" going forward;
  • Intel Haswell / Broadwell - "IntelH" going forward;
  • I'll just use "Intel" for things that are the same between SB and H.
  • AMD Bulldozer / Piledriver / Steamroller - "AMD" going forward.

我在IntelSB系统上获取了测量数据;我认为这很精确,只要编译器不起作用即可.不幸的是,当使用如此紧密的循环时,这很容易发生.在测试过程中的各个阶段,我不得不使用各种愚蠢的技巧来避免GCC的特质,这些特质通常与寄存器的使用有关.例如,在编译更简单的代码时,似乎比在其他情况下生成最优汇编时,不必要地对寄存器进行随机排序.具有讽刺意味的是,在我的测试设置中,它倾向于使用移位生成第二个样本的最佳代码,而使用第一个生成更差的代码,从而使移位的影响不那么明显. Clang/LLVM似乎没有那么多坏习惯,但是,我再次查看了使用它的例子,并且没有对它们进行任何衡量,因此这并不意味着什么.为了将苹果与苹果进行比较,以下所有测量数据均指针对每种情况生成的最佳代码.

I have measurement data taken on an IntelSB system; I think it's precise enough, as long as the compiler doesn't act up. Unfortunately, when working with such tight loops, this can happen very easily. At various points during testing, I had to use all kinds of stupid tricks to avoid GCC's idiosyncrasies, usually related to register use. For example, it seems to have a tendency to shuffle registers around unnecessarily, when compiling simpler code than for other cases when it generates optimal assembly. Ironically, on my test setup, it tended to generate optimal code for the second sample, using the shift, and worse code for the first one, making the impact of the shift less visible. Clang/LLVM seems to have fewer of those bad habits, but then again, I looked at fewer examples using it and I didn't measure any of them, so this doesn't mean much. In the interest of comparing apples with apples, all measurement data below refers to the best code generated for each case.

首先,让我们将上一节中用于128位乘法的表达式重新排列为一个(可怕的)图:

First, let's rearrange the expression for 128-bit multiplication from the previous section into a (horrible) diagram:

nh * il
        \
         +  -> tmp
        /          \
nl * ih             + -> next nh
                   /
             high 64 bits
                 /
nl * il --------
         \
      low 64 bits 
           \
             -> next nl

(对不起,我希望它能使我明白这一点)

(sorry, I hope it gets the point across)

一些要点:

  • 这两个加法要等它们各自的输入准备好后才能执行;在所有其他条件都准备就绪之前,最后的添加操作无法执行.
  • 从理论上讲,这三个乘法可以并行执行(没有输入取决于另一个乘法的输出).
  • 在上述理想情况下,完成一次迭代的整个计算的总周期数是一个乘法和两个加法的周期数之和.
  • next nl可以提早准备就绪.这以及下一个ilih的计算非常便宜的事实,意味着下一次迭代的nl * ilnl * ih计算可以提早开始,很可能在计算next nh之前开始.
  • The two additions can't execute until their respective inputs are ready; the final addition can't execute until everything else is ready.
  • The three multiplications can, theoretically, execute in parallel (no input depends on another multiplication's output).
  • In the ideal scenario above, the total number of cycles to complete the entire calculation for one iteration is the sum of the number of cycles for one multiplication and two additions.
  • The next nl can be ready early. This, together with the fact that the next il and ih are very cheap to calculate, means the nl * il and nl * ih calculations for the next iteration can start early, possibly before the next nh has been computed.

乘法在这些处理器上不能真正完全并行执行,因为每个内核只有一个整数乘法单元,但是它们可以通过流水线同时执行.即使以前的乘法尚未完成,也可以在Intel的每个周期开始执行一次乘法,而在AMD的情况下可以每4个周期开始执行一次.

Multiplications can't really execute entirely in parallel on these processors, as there's only one integer multiplication unit for each core, but they can execute concurrently through pipelining. One multiplication can begin executing on each cycle on Intel, and every 4 cycles on AMD, even if previous multiplications haven't finished executing yet.

以上所有内容都意味着,如果循环的主体中不包含妨碍您执行其他操作的任何内容,则处理器可以对这些乘法进行重新排序,以实现与上述理想情况尽可能接近的结果.这适用于第一个代码段.按照Harold的观点,在IntelH上,这是理想的情况:每个迭代5个周期由一个乘法的3个周期和两个加法的每个周期组成(说实在令人印象深刻).在IntelSB上,我测量了每个迭代6个周期(实际上接近5.5).

All of the above mean that, if the loop's body doesn't contain anything else that gets in the way, the processor can reorder those multiplications to achieve something as close as possible to the ideal scenario above. This applies to the first code snippet. On IntelH, as measured by harold, it's exactly the ideal scenario: 5 cycles per iteration are made up of 3 cycles for one multiplication and one cycle each for the two additions (impressive, to be honest). On IntelSB, I measured 6 cycles per iteration (closer to 5.5, actually).

问题在于,在第二个代码段中确实出现了一些问题:

The problem is that in the second code snippet something does get in the way:

nh * il
        \                              normal shift -> next nh
         +  -> tmp                   /
        /          \                /
nl * ih             + ----> temp nh
                   /                \
             high 64 bits            \
                 /                     "composite" shift -> next nl
nl * il --------                     /
         \                          /
      low 64 bits                  /
           \                      /
             -> temp nl ---------

next nl不再提前准备就绪. temp nl必须等待temp nh准备就绪,以便可以将两者都馈入composite shift,只有这样我们才能拥有next nl.即使两个转换都非常快并且可以并行执行,它们也不会只是将一个转换的执行成本添加到迭代中.它们还改变了循环管道"的动态,就像一种同步屏障一样.

The next nl is no longer ready early. temp nl has to wait for temp nh to be ready, so that both can be fed into the composite shift, and only then will we have the next nl. Even if both shifts are very fast and execute in parallel, they don't just add the execution cost of one shift to an iteration; they also change the dynamics of the loop's "pipeline", acting like a sort of synchronizing barrier.

如果两个移位同时完成,那么下一次迭代的所有三个乘法将准备好同时执行,如上所述,它们不能全部并行开始;他们将不得不等待对方,浪费时间.在IntelSB上就是这种情况,两次转换的速度相同(见下文).在这种情况下,我每次迭代测量了8个周期.

If the two shifts finish at the same time, then all three multiplications for the next iteration will be ready to execute at the same time, and they can't all start in parallel, as explained above; they'll have to wait for one another, wasting cycles. This is the case on IntelSB, where the two shifts are equally fast (see below); I measured 8 cycles per iteration for this case.

如果两个移位不能同时完成,那么通常通常是最先完成的正常移位(在大多数体系结构中,复合移位比较慢).这意味着next nh将尽早准备就绪,因此顶级乘法可以为下一次迭代提早开始.但是,其他两个乘法仍必须等待更多(浪费)的周期才能完成复合移位,然后它们将同时准备就绪,一个必须等​​待另一个开始,从而浪费了更多时间.在IntelH上就是这种情况,由harold每次迭代进行9个周​​期进行测量.

If the two shifts don't finish at the same time, it will typically be the normal shift that finishes first (the composite shift is slower on most architectures). This means that the next nh will be ready early, so the top multiplication can start early for the next iteration. However, the other two multiplications still have to wait more (wasted) cycles for the composite shift to finish, and then they'll be ready at the same time and one will have to wait for the other to start, wasting some more time. This is the case on IntelH, measured by harold at 9 cycles per iteration.

我预计AMD也将落入最后一类.尽管此平台上的复合移位和正常移位之间的性能差异更大,但AMD上的整数乘法也比Intel上慢(两倍多),因此第一个样本开始时就更慢.粗略估计,我认为第一个版本在AMD上可能需要大约12个周期,而第二个版本大约需要16个周期.但是,进行一些具体的测量还是不错的.

I expect AMD to fall under this last category as well. While there's an even bigger difference in performance between the composite shift and normal shift on this platform, integer multiplications are also slower on AMD than on Intel (more than twice as slow), making the first sample slower to begin with. As a very rough estimate, I think the first version could take about 12 cycles on AMD, with the second one at around 16. It would be nice to have some concrete measurements, though.

关于麻烦的复合转变的更多数据,SHRD:

Some more data on the troublesome composite shift, SHRD:

  • 在IntelSB上,它与简单的转换完全一样便宜(太棒了!);简单的班次就和它们一样便宜:它们在一个周期内执行,每个班次可以开始执行两个班次.
  • 在IntelH上,SHRD需要执行3个周期(是的,在新一代系统中,情况变得更糟),并且每个周期的任何两次移位(简单或复合)都可以开始执行;
  • 在AMD上,情况甚至更糟.如果我正确地读取了数据,则执行SHRD会使两个移位执行单元都保持忙碌状态,直到执行完成为止-没有并行性,也没有流水线.它需要3个周期,在此期间无法开始执行其他换档.
  • On IntelSB, it's exactly as cheap as a simple shift (great!); simple shifts are about as cheap as they come: they execute in one cycle, and two shifts can start executing each cycle.
  • On IntelH, SHRD takes 3 cycles to execute (yes, it got worse in the newer generation), and two shifts of any kind (simple or composite) can start executing each cycle;
  • On AMD, it's even worse. If I'm reading the data correctly, executing an SHRD keeps both shift execution units busy until execution finishes - no parallelism and no pipelining possible; it takes 3 cycles, during which no other shift can start executing.

我可以想到三个可能的改进:

I can think of three possible improvements:

  1. 在有意义的平台上用更快的速度替换SHRD
  2. 优化乘法以利用此处涉及的数据类型;
  3. 重组循环.
  1. replace SHRD with something faster on platforms where it makes sense;
  2. optimize the multiplication to take advantage of the data types involved here;
  3. restructure the loop.

1..SHRD可以用两个移位和按位或"替换,如编译器部分所述.右移128位的C ++实现看起来像这样:

1. SHRD can be replaced with two shifts and a bitwise OR, as described in the compiler section. A C++ implementation of a 128-bit shift right by two bits could look like this:

__int128_t shr2(__int128_t n)
{
   using std::int64_t;
   using std::uint64_t;

   //Unpack the two halves.
   int64_t nh = n >> 64;
   uint64_t nl = static_cast<uint64_t>(n);

   //Do the actual work.
   uint64_t rl = nl >> 2 | nh << 62;
   int64_t rh = nh >> 2;

   //Pack the result.
   return static_cast<__int128_t>(rh) << 64 | rl;
}

尽管看起来很多代码,但是只有中间部分进行实际工作才能生成移位和OR.其他部分仅向编译器指示我们要使用哪个64位部分;其他部分仅向编译器指示.由于64位部分已经在单独的寄存器中,因此在生成的汇编代码中实际上是无操作的.

Although it looks like a lot of code, only the middle section doing the actual work generates shifts and ORs. The other parts merely indicate to the compiler which 64-bit parts we want to work with; since the 64-bit parts are already in separate registers, those are effectively no-ops in the generated assembly code.

但是,请记住,这相当于尝试使用C ++语法编写程序集",通常这不是一个很聪明的主意.我之所以使用它,是因为我验证了它适用于GCC,并且试图将此答案中的汇编代码量降至最低.即便如此,还是有一个惊喜:LLVM优化器检测到我们试图对这两个移位和一个OR进行处理,并且在某些情况下...将其替换为SHRD(有关此内容的更多信息,请参见下文).

However, keep in mind that this amounts to "trying to write assembly using C++ syntax", and it's generally not a very bright idea. I'm only using it because I verified that it works for GCC and I'm trying to keep the amount of assembly code in this answer to a minimum. Even so, there's one surprise: the LLVM optimizer detects what we're trying to do with those two shifts and one OR and... replaces them with an SHRD in some cases (more about this below).

具有相同形式的函数可用于移位其他位数(小于64).从64到127,它变得更容易,但是形式会改变.要记住的一件事是,将要转移的位数作为运行时参数传递给shr函数是错误的.在大多数体系结构上,移位指令的位数比可变常数的指令慢.您可以在编译时使用非类型模板参数来生成不同的函数-毕竟这是C ++.

Functions of the same form can be used for shifts by other numbers of bits, less than 64. From 64 to 127, it gets easier, but the form changes. One thing to keep in mind is that it would be a mistake to pass the number of bits for shifting as a runtime parameter to a shr function. Shift instructions by a variable number of bits are slower than the ones using a constant number on most architectures. You could use a non-type template parameter to generate different functions at compile time - this is C++, after all...

我认为使用这种功能在除IntelSB之外的所有体系结构上都是有意义的,IntelSB的SHRD已经尽可能快.在AMD上,这绝对是一个进步. IntelH的情况则更少:就我们而言,我认为这不会有所作为,但是通常只要进行一些计算,它就可以剃光.从理论上讲,在某些情况下它可能会使情况变得更糟,但我认为这些情况非常少见(与往常一样,测量无可替代).我认为这不会对我们的循环产生任何影响,因为它将把[nh一次循环后准备好,nl在三个循环后]更改为[两个都准备好了].这意味着下一次迭代的所有三个乘法都将同时准备好,并且它们必须等待彼此,这实际上浪费了通过移位获得的周期.

I think using such a function makes sense on all architectures except IntelSB, where SHRD is already as fast as it can get. On AMD, it will definitely be an improvement. Less so on IntelH: for our case, I don't think it will make a difference, but generally it could shave once cycle off some calculations; there could theoretically be cases where it could make things slightly worse, but I think those are very uncommon (as usual, there's no substitute for measuring). I don't think it will make a difference for our loop because it will change things from [nh being ready after once cycle and nl after three] to [both being ready after two]; this means all three multiplications for the next iteration will be ready at the same time and they'll have to wait for one another, essentially wasting the cycle that was gained by the shift.

GCC似乎在所有体系结构上都使用SHRD,并且上面的"C ++汇编"代码可以在有意义的情况下用作优化. LLVM优化器使用了更细微的方法:如上所述,它为AMD自动进行优化(替换SHRD),但对于Intel则没有,相反,它甚至将其逆转,如上所述.正如在实现该优化的LLVM修补程序补丁上的讨论所指出的那样,这可能会在将来的发行版中发生变化.现在,如果要在Intel的LLVM上使用替代方案,则必须诉诸汇编代码.

GCC seems to use SHRD on all architectures, and the "assembly in C++" code above can be used as an optimization where it makes sense. The LLVM optimizer uses a more nuanced approach: it does the optimization (replaces SHRD) automatically for AMD, but not for Intel, where it even reverses it, as mentioned above. This may change in future releases, as indicated by the discussion on the patch for LLVM that implemented this optimization. For now, if you want to use the alternative with LLVM on Intel, you'll have to resort to assembly code.

2.:优化乘法:测试代码对i使用128位整数,但是在这种情况下不需要这样做,因为它的值很容易适合64位(32位,实际上是32位). ,但这对我们没有帮助).这意味着ih始终为零;这样可以将128位乘法的图减少到以下内容:

2. Optimizing the multiplication: The test code uses a 128-bit integer for i, but that's not needed in this case, as its value fits easily in 64 bits (32, actually, but that doesn't help us here). What this means is that ih will always be zero; this reduces the diagram for 128-bit multiplication to the following:

nh * il
        \
         \
          \
           + -> next nh
          /
    high 64 bits
        /
nl * il 
        \
     low 64 bits 
          \
            -> next nl

通常,我只是说将i声明为long long,然后让编译器对其进行优化",但是不幸的是,这在这里不起作用.两种编译器都遵循在进行计算之前将两个操作数转换为它们的普通类型的标准行为,因此,即使i以64位开始,其最终结果仍为128位. >

Normally, I'd just say "declare i as long long and let the compiler optimize things" but unfortunately this doesn't work here; both compilers go for the standard behaviour of converting the two operands to their common type before doing the calculation, so i ends up on 128 bits even if it starts on 64. We'll have to do things the hard way:

__int128_t mul(__int128_t n, long long i)
{
   using std::int64_t;
   using std::uint64_t;

   //Unpack the two halves.
   int64_t nh = n >> 64;
   uint64_t nl = static_cast<uint64_t>(n);

   //Do the actual work.
   __asm__(R"(
    movq    %0, %%r10
    imulq   %2, %%r10
    mulq    %2
    addq    %%r10, %0
   )" : "+d"(nh), "+a"(nl) : "r"(i) : "%r10");

   //Pack the result.
   return static_cast<__int128_t>(nh) << 64 | nl;
}

我说过我试图避免在此答案中使用汇编代码,但这并不总是可能的.我成功地哄骗GCC使用上面的函数通过"C ++汇编语言"生成正确的代码,但是一旦函数内联,一切就崩溃了-优化器可以看到完整循环体中正在发生的事情,并将所有内容转换为128位. LLVM在这种情况下似乎可以正常运行,但是,由于我在GCC上进行了测试,因此我不得不使用一种可靠的方法在其中获取正确的代码.

I said I tried to avoid assembly code in this answer, but it's not always possible. I managed to coax GCC into generating the right code with "assembly in C++" for the function above, but once the function is inlined everything falls apart - the optimizer sees what's going on in the complete loop body and converts everything to 128 bits. LLVM seems to behave in this case, but, since I was testing on GCC, I had to use a reliable way to get the right code in there.

i声明为long long并使用此函数代替普通的乘法运算符,我在IntelSB上测量了第一个样本的每个迭代5个周期和第二个样本的7个周期,每种情况下增加了一个周期.我希望它也可以在IntelH上的两个示例的迭代中减少一个周期.

Declaring i as long long and using this function instead of the normal multiplication operator, I measured 5 cycles per iteration for the first sample and 7 cycles for the second one on IntelSB, a gain of one cycle in each case. I expect it to shave one cycle off the iterations for both examples on IntelH as well.

3..有时(至少有些)迭代实际上并不依赖于先前的结果,即使看起来好像确实如此,有时也可以对循环进行重组以鼓励流水线执行.例如,我们可以将第二个示例的for循环替换为以下内容:

3. The loop can sometimes be restructured to encourage pipelined execution, when (at least some) iterations don't really depend on previous results, even though it may look like they do. For example, we could replace the for loop for the second sample with something like this:

__int128_t n2 = 1;
long long j = 1000000000 / 2;
for(long long i = 1; i < 1000000000 / 2; ++i, ++j)
{
   n *= i;
   n2 *= j;
   n >>= 2;
   n2 >>= 2; 
}
n *= (n2 * j) >> 2;

这利用了以下事实:一些部分结果可以独立计算,并且只能在最后汇总.我们还向编译器暗示了我们希望对乘法和移位进行流水化处理(并非总是必要的,但是对于此代码,这确实对GCC产生了很小的影响).

This takes advantage of the fact that some partial results can be calculated independently and only aggregated at the end. We're also hinting to the compiler that we want to pipeline the multiplications and shifts (not always necessary, but it does make a small difference for GCC for this code).

上面的代码不过是概念的幼稚证明.实际代码将需要以更可靠的方式处理迭代的总数.更大的问题是,由于存在溢出和舍入的不同行为,此代码将不会产生与原始代码相同的结果.即使我们在第51次迭代中停止循环,为避免溢出,由于右移时舍入以不同的方式发生,结果仍将相差约10%.在实际代码中,这很可能是一个问题.但是话又说回来,您将不会有这样的真实代码,对吗?

The code above is nothing more than a naive proof of concept. Real code would need to handle the total number of iterations in a more reliable way. The bigger problem is that this code won't generate the same results as the original, because of different behaviour in the presence of overflow and rounding. Even if we stop the loop on the 51st iteration, to avoid overflow, the result will still be different by about 10%, because of rounding happening in different ways when shifting right. In real code, this would most likely be a problem; but then again, you wouldn't have real code like this, would you?

假定此技术适用于不会发生上述问题的情况,那么我还是在IntelSB上测量了少数情况下此类代码的性能.结果以每次迭代的周期"的形式给出,如前所述,迭代"是指原始代码中的迭代次数(我将执行整个循环的周期总数除以原始代码执行的迭代总数,不是为了进行重组,而是要进行有意义的比较):

Assuming this technique is applied to a case where the problems above don't occur, I measured the performance of such code in a few cases, again on IntelSB. The results are given in "cycles per iteration", as before, where "iteration" means the one from the original code (I divided the total number of cycles for executing the whole loop by the total number of iterations executed by the original code, not for the restructured one, to have a meaningful comparison):

  • 上面的代码每次迭代执行7个周期,比原始代码多一个周期.
  • 上面用乘法运算符替换为我们的mul()函数的代码每次迭代需要6个周期.
  • The code above executes in 7 cycles per iteration, a gain of one cycle over the original.
  • The code above with the multiplication operator replaced with our mul() function needs 6 cycles per iteration.

重组后的代码确实遭受了更多的寄存器改组,不幸的是,这无法避免(更多的变量).像IntelH这样的较新的处理器在体系结构上进行了改进,从而在许多情况下使寄存器移动基本免费.这可以使代码产生更大的收益.对IntelH使用新的指令,如MULX可以完全避免某些寄存器移动.在使用-march=haswell进行编译时,GCC确实会使用此类指令.

The restructured code does suffer from more register shuffling, which can't be avoided unfortunately (more variables). More recent processors like IntelH have architecture improvements that make register moves essentially free in many cases; this could make the code yield even larger gains. Using new instructions like MULX for IntelH could avoid some register moves altogether; GCC does use such instructions when compiling with -march=haswell.

到目前为止,我们尚未进行过任何测量,无法解释OP所报告的性能差异,以及我在不同系统上所观察到的性能差异.

None of the measurements that we have so far explain the large differences in performance reported by the OP, and observed by me on a different system.

我的最初计时是在远​​程系统(Westmere系列处理器)上进行的,当然,这里可能发生很多事情;然而,结果却异常稳定.

My initial timings were taken on a remote system (Westmere family processor) where, of course, a lot of things could happen; yet, the results were strangely stable.

在该系统上,我还尝试了执行第二个样本的右移和左移;使用右移的代码始终比其他变体慢50%.我无法在IntelSB的受控测试系统上复制它,也无法解释这些结果.

On that system, I also experimented with executing the second sample with a right shift and a left shift; the code using a right shift was consistently 50% slower than the other variant. I couldn't replicate that on my controlled test system on IntelSB, and I don't have an explanation for those results either.

我们可以将以上所有内容视为编译器/处理器/整体系统行为的不可预测的副作用,但我不能撼动这种解释,这里并没有解释一切.

We can discard all of the above as unpredictable side effects of compiler / processor / overall system behaviour, but I can't shake the feeling that not everything has been explained here.

的确,在没有受控系统,精确工具(计数周期)并查看每种情况下生成的汇编代码的情况下,对如此紧密的循环进行基准测试并没有多大意义.编译器特有的特性很容易导致代码人为地引入性能上50%或更多的差异.

It's true that it doesn't really make much sense to benchmark such tight loops without a controlled system, precise tools (counting cycles) and looking at the generated assembly code for each case. Compiler idiosyncrasies can easily result in code that artificially introduces differences of 50% or more in performance.

另一个可以解释巨大差异的因素是英特尔超线程技术的存在.启用此功能后,内核的不同部分的行为会有所不同,并且处理器系列之间的行为也已更改.这可能会对紧密循环产生奇怪的影响.

Another factor that could explain large differences is the presence of Intel Hyper-Threading. Different parts of the core behave differently when this is enabled, and the behaviour has also changed between processor families. This could have strange effects on tight loops.

最重要的是,这是一个疯狂的假设:翻转位比保持位不变消耗的功率更多.在我们的情况下,第一个采样大多数情况下都使用零值,它将比第二个采样少翻转很多位,因此后者将消耗更多的功率.许多现代处理器具有可根据电气和散热限制动态调整内核频率的功能(Intel Turbo Boost/AMD Turbo Core).这意味着从理论上讲,在正确(或错误?)条件下,第二个样本可能会触发核心频率的降低,从而使相同数量的周期花费更长的时间,并使性能数据相关.

To top everything off, here's a crazy hypothesis: Flipping bits consumes more power than keeping them constant. In our case, the first sample, working with zero values most of the time, will be flipping far fewer bits than the second one, so the latter will consume more power. Many modern processors have features that dynamically adjust the core frequency depending on electrical and thermal limits (Intel Turbo Boost / AMD Turbo Core). This means that, theoretically, under the right (or wrong?) conditions, the second sample could trigger a reduction of the core frequency, thus making the same number of cycles take longer time, and making the performance data-dependent.

这篇关于C ++优化的奥秘的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆