双打C中的数组的优化总和 [英] optimized sum of an array of doubles in C

查看:242
本文介绍了双打C中的数组的优化总和的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个任务,我必须采取一个程序,使其在时间上更有效率。
原来的code是:

I've got an assignment where I must take a program and make it more efficient in terms of time. the original code is:

#include <stdio.h>
#include <stdlib.h>

// You are only allowed to make changes to this code as specified by the comments in it.

// The code you submit must have these two values.
#define N_TIMES     600000
#define ARRAY_SIZE   10000

int main(void)
{
    double  *array = calloc(ARRAY_SIZE, sizeof(double));
    double  sum = 0;
    int     i;

    // You can add variables between this comment ...
    long int help;
    // ... and this one.

    // Please change 'your name' to your actual name.
    printf("CS201 - Asgmt 4 - I. Forgot\n");

    for (i = 0; i < N_TIMES; i++) {

        // You can change anything between this comment ...

        int     j;

        for (j = 0; j < ARRAY_SIZE; j++) {
            sum += array[j];
            help++;
            }

        // ... and this one. But your inner loop must do the same
        // number of additions as this one does.

        }
    // You can add some final code between this comment ...

    // ... and this one.

    return 0;
}

我几乎完全由其更改为修改的第二个for循环

I almost exclusively modified the second for loop by changing it to

double *j=array;
double *p=array+ARRAY_SIZE;

for(; j<p;j+=10){
    sum += j[0]+j[1]+j[2]+j[3]+j[4]+j[5]+j[6]+j[7]+j[8]+j[9];
{

这对自己能下来的时间缩短到标准...
它似乎已经工作的但是否有任何错误,我没有看到?

this on its own was able to reduce the time down to the criteria... it already seems to work but are there any mistakes i'm not seeing?

推荐答案

首先,它是一个真正的垃圾样本,因为它并没有什么可以从优化掉整个事情停止智能编译器。它甚至不打印的总和。即使的gcc -O1 (而不是 -O3 )扔掉了一些循环的。

First of all, it's a really crap sample because it doesn't have anything to stop a smart compiler from optimizing away the entire thing. It doesn't even print the sum. Even gcc -O1 (instead of -O3) threw away some of the looping.

通常情况下你把你的code的作用,并从调用它在一个循环的main()在另一个文件中。并单独编译它们,无需整个程序的跨文件优化,因此基于你把它叫做编译时间常数编译器不能做的优化。重复循环被包裹得那么紧围绕阵列在实际环路造成破坏与海湾合作委员会的优化(见下文)。

Normally you'd put your code in a function, and call it in a loop from main() in another file. And compile them separately, without whole-program cross-file optimisation, so the compiler can't do optimisations based on the compile-time constants you call it with. The repeat-loop being wrapped so tightly around the actual loop over the array is causing havoc with gcc's optimizer (see below).

还有:

gcc -Wall -O3 -march=native  fast-loop-cs201.c -o fl
fast-loop-cs201.c: In function ‘main’:
fast-loop-cs201.c:17:14: warning: ‘help’ is used uninitialized in this function [-Wuninitialized]
     long int help; 

我有EOF的有关教授如此微词同意。发放code,优化烂掉,并与未初始化的变量,完全是胡说八道。

I have to agree with EOF's disparaging remarks about your prof. Giving out code that optimizes away to nothing, and with uninitialized variables, is utter nonsense.

有些人说在注释中说,编译器没有关系,那你应该做的优化您的C源为CPU微架构,而不是让编译器做到这一点。这是胡扯:良好的性能,你必须要知道什么样的编译器可以做,不能做的。一些优化是脆,并到源小貌似无辜的变化将做某事停止编译器。

Some people are saying in comments that "the compiler doesn't matter", and that you're supposed to do optimize your C source for the CPU microarchitecture, rather than letting the compiler do it. This is crap: for good performance, you have to be aware of what compilers can do, and can't do. Some optimizations are "brittle", and a small seemingly-innocent change to the source will stop the compiler from doing something.

我想你提到的教授对性能的几件事情。也有可能开始发挥作用,在这里不同的东西,其中有许多我认为没有获得在第二年​​的CS类提到crapton。

I assume your prof mentioned a few things about performance. There are a crapton of different things that could come into play here, many of which I assume didn't get mentioned in a 2nd-year CS class.

除了使用OpenMP多线程,有量化与SIMD。也有现代流水线的CPU优化:具体而言,避免一个长依赖链

Besides multithreading with openmp, there's vectorizing with SIMD. There are also optimizations for modern pipelined CPUs: specifically, avoid having one long dependency chain.

另外必不可少的读物:

  • Agner Fog's guides for optimizing C and asm for x86. Some of it applies to all CPUs.
  • What Every Programmer Should Know About Memory

您的编译器手册也是必不可少的,尤其。浮点code。浮点限制precision,是的的关联。最后一笔的确实的取决于哪个命令你做在加法。然而,通常在舍入误差的差小。所以编译器可以通过重新排序的事情,如果你使用 -ffast-数学允许它获得了大的提速。这可能是你解开乘10允许的。

Your compiler manual is also essential, esp. for floating point code. Floating point has limited precision, and is not associative. The final sum does depend on which order you do the additions in. However, usually the difference in rounding error is small. So the compiler can get a big speedup by re-ordering things if you use -ffast-math to allow it. This may have been what your unroll-by-10 allowed.

而不是仅仅展开,保留多个蓄电池,你只在最后加起来能保持饱和浮点执行单元,因为FP指令有延迟!=吞吐​​量。如果你需要的最后一个运算的结果是完整在下单前就可以开始,你被延迟的限制。对于FP增加,这是每3个周期之一。在英特尔SandyBridge的,IVB,Haswell的和Broadwell微架构,FP的吞吐量补充的是每个周期。所以,你需要保持至少3个独立的OPS,可以在飞行中一旦饱和机器。 对于SKYLAKE微架构,它的 2每个周期为4个时钟延迟的。 (关于SKYLAKE微架构有利的一面,FMA下降到4个周期的延迟。)

Instead of just unrolling, keeping multiple accumulators which you only add up at the end can keep the floating point execution units saturated, because FP instructions have latency != throughput. If you need the result of the last op to be complete before the next one can start, you're limited by latency. For FP add, that's one per 3 cycles. In Intel Sandybridge, IvB, Haswell, and Broadwell, the throughput of FP add is one per cycle. So you need to keep at least 3 independent ops that can be in flight at once to saturate the machine. For Skylake, it's 2 per cycle with latency of 4 clocks. (On the plus side for Skylake, FMA is down to 4 cycle latency.)

在这种情况下,也喜欢将一切圈外的基本的东西,例如帮助+ = ARRAY_SIZE

In this case, there's also basic stuff like pulling things out of the loop, e.g. help += ARRAY_SIZE.

我开始了与原来的内循环,只有帮助+ = ARRAY_SIZE 拉出,添加的printf 末以使gcc不优化了所有一切。让我们尝试一些编译器选项,看看我们可以用gcc 4.9.2实现(在我的酷睿i5 2500K的SandyBridge 3.8GHz的最大涡轮增压(轻微OC),持续为3.3GHz(无关这短短的基准)):

I started out with the original inner loop, with just help += ARRAY_SIZE pulled out, and adding a printf at the end so gcc doesn't optimize everything away. Let's try some compiler options and see what we can achieve with gcc 4.9.2 (on my i5 2500k Sandybridge. 3.8GHz max turbo (slight OC), 3.3GHz sustained (irrelevant for this short benchmark)):


  • gcc的-O0快速环路cs201.c -o FL :16.43s的表现是一个总的笑话。变量存储到内存中每次操作后,和之前的下一个重新加载。这是一个瓶颈,并且增添了不少延迟。且不说实际的优化输给了。定时/调谐$ C与C -O0 是没有用的。

  • -O1 :4.87s

  • -O2 :4.89s

  • -O3 :2.453s(使用SSE做2间,我使用的是64位系统是当然的,所以对于硬件支持-msse2 为基准。)

  • -O3 -ffast-数学-funroll-循环:2.439s

  • -O3 -march =的SandyBridge -ffast-数学-funroll-循环:1.275s(AVX用做4一次)

  • -Ofast ... :没有收获

  • -O3 -ftree-并行化,循环= 4 -march =的SandyBridge -ffast-数学-funroll-循环:0m2.375s真实的,0m8.500s用户。貌似锁定开销杀了它。它只是派生的​​4线程总计,但内环太短,这是一个双赢的(因为它收集的款项每一次,而不是让一个线程外层循环迭代的第一个1/4)。

  • -Ofast -fprofile生成-march =的SandyBridge -ffast-数学,运行它,那么结果
    -Ofast -fprofile使用-march =的SandyBridge -ffast-数学:1.275s

  • gcc -O0 fast-loop-cs201.c -o fl: 16.43s performance is a total joke. Variables are stored to memory after every operation, and re-loaded before the next. This is a bottleneck, and adds a lot of latency. Not to mention losing out on actual optimisations. Timing / tuning code with -O0 is not useful.
  • -O1: 4.87s
  • -O2: 4.89s
  • -O3: 2.453s (uses SSE to do 2 at once. I'm of course using a 64bit system, so hardware support for -msse2 is baseline.)
  • -O3 -ffast-math -funroll-loops: 2.439s
  • -O3 -march=sandybridge -ffast-math -funroll-loops: 1.275s (uses AVX to do 4 at once.)
  • -Ofast ...: no gain
  • -O3 -ftree-parallelize-loops=4 -march=sandybridge -ffast-math -funroll-loops: 0m2.375s real, 0m8.500s user. Looks like locking overhead killed it. It only spawns the 4 threads total, but the inner loop is too short for it to be a win (because it collects the sums every time, instead of giving one thread the first 1/4 of the outer loop iterations).
  • -Ofast -fprofile-generate -march=sandybridge -ffast-math, run it, then
    -Ofast -fprofile-use -march=sandybridge -ffast-math: 1.275s

铛-3.5 -Ofast -march =本地-ffast-数学:1.070s。 (铛不支持 -march =的SandyBridge )。

clang-3.5 -Ofast -march=native -ffast-math: 1.070s. (clang doesn't support -march=sandybridge).

的gcc -O3 向量化的搞笑方式:内循环做2(或4)并行外部循环的迭代,一个数组元素广播到所有元素的XMM(或YMM)注册,并在这样做的 addpd 。因此,它认为被重复添加相同的值,但即使 -ffast-数学不会让GCC只是把它变成一个繁衍。或切换循环。

gcc -O3 vectorizes in a hilarious way: The inner loop does 2 (or 4) iterations of the outer loop in parallel, by broadcasting one array element to all elements of an xmm (or ymm) register, and doing an addpd on that. So it sees the same values are being added repeatedly, but even -ffast-math doesn't let gcc just turn it into a multiply. Or switch the loops.

铛-3.5矢量化好很多:它向量化内部循环,代替外,所以它不需要广播。它甚至使用4个向量寄存器为4个独立的蓄电池。然而,它并不假定释放calloc 对齐回报内存,以及由于某种原因,它认为最好的办法是一对128B的负载。

clang-3.5 vectorizes a lot better: it vectorizes the inner loop, instead of the outer, so it doesn't need to broadcast. It even uses 4 vector registers as 4 separate accumulators. However, it doesn't assume that calloc returns aligned memory, and for some reason it thinks the best bet is a pair of 128b loads.

vmovupd -0x60(%rbx,%rcx,8),%xmm4`
vinsertf128 $0x1,-0x50(%rbx,%rcx,8),%ymm4,%ymm4

它实际上是的,当我告诉它的阵列排列。 (有一个愚蠢的黑客像阵列=(双*)((ptrdiff_t的)阵列和放大器;〜31); 这实际上产生的指令,能够屏蔽低5位,因为铛-3.5不支持gcc的 __ builtin_assume_aligned )。我的思维方式4倍的紧密循环 vaddpd纪念,%ymmX,%ymmX 对齐看跌 CMP $ 0x271c,RCX%穿越边界32B,所以它不能宏观保险丝 JNE 。 UOP吞吐量不应该是一个问题,虽然,因为这code仅让每个周期(0.93微指令/循环)0.65insns,根据 PERF

It's actually slower when I tell it that the array is aligned. (with a stupid hack like array = (double*)((ptrdiff_t)array & ~31); which actually generates an instruction to mask off the low 5 bits, because clang-3.5 doesn't support gcc's __builtin_assume_aligned.) I think the way the tight loop of 4x vaddpd mem, %ymmX,%ymmX is aligned puts cmp $0x271c,%rcx crossing a 32B boundary, so it can't macro-fuse with jne. uop throughput shouldn't be an issue, though, since this code is only getting 0.65insns per cycle (and 0.93 uops / cycle), according to perf.

啊,我与调试检查,释放calloc 只返回一个16B对齐的指针。所以一半的32B存储器访问穿越高速缓存行,引起了很大的放慢。我想它的的稍快做两个独立的16B负载时,指针为16B对齐但不32B对齐,上的Sandy Bridge。编译器在此犯了一个不错的选择。

Ahh, I checked with a debugger, and calloc is only returning a 16B-aligned pointer. So half the 32B memory accesses are crossing a cache line, causing a big slowdown. I guess it is slightly faster to do two separate 16B loads when your pointer is 16B-aligned but not 32B-aligned, on Sandybridge. The compiler is making a good choice here.

我们可以从铿锵跳动GCC看到,多个蓄电池都非常优秀。最明显的方式做到这一点是:

As we can see from clang beating gcc, multiple accumulators are excellent. The most obvious way to do this would be:

for (j = 0; j < ARRAY_SIZE; j+=4) {  // unroll 4 times
    sum0 += array[j];
    sum1 += array[j+1];
    sum2 += array[j+2];
    sum3 += array[j+3];
}

然后不收4蓄电池为一体,直到外循环结束后。

and then don't collect the 4 accumulators into one until after the end of the outer loop.

您的源变化

sum += j[0]+j[1]+j[2]+j[3]+j[4]+j[5]+j[6]+j[7]+j[8]+j[9];

实际上有类似的效果,由于乱序执行。每组10是一个单独的依赖链。订单中的操作规则说的Ĵ价值得到加在一起,然后再加入。因此,循环携带依赖性链仍然只是一个FP增加延迟,并有大量的独立的工作每组10​​每组9单独依赖链增加,并采取针对出足够的几个指令阶执行硬件到下个链的起点和,并找到并行保持这些中等延时,高吞吐量FP执行单元供电。

actually has a similar effect, thanks to out-of-order execution. Each group of 10 is a separate dependency chain. order-of-operations rules say the j values get added together first, and then added to sum. So the loop-carried dependency chain is still only the latency of one FP add, and there's lots of independent work for each group of 10. Each group is a separate dependency chain of 9 adds, and takes few enough instructions for the out-of-order execution hardware to see the start of the next chain and, and find the parallelism to keep those medium latency, high throughput FP execution units fed.

使用 -O0 ,因为你的愚蠢任务显然需要,值存储到内存在每个语句的结束。 (从技术上讲,在每个顺序点,为C类标准称呼它)没有更新任何变量,甚至是临时写更长的前pressions,将使 -O0 运行速度更快,但它不是一个有用的优化。不要浪费上的只有的有 -O0 帮助,尤其改变你的时间。未在可读性为代价的。

With -O0, as your silly assignment apparently requires, values are stored to RAM at the end of every statement. (Technically, at every "sequence point", as the C standards call it.) Writing longer expressions without updating any variables, even temporaries, will make -O0 run faster, but it's not a useful optimisation. Don't waste your time on changes that only help with -O0, esp. not at the expense of readability.

使用4累加器和不在一起加入他们,直到外环年底击败铛的自动向量化。它仍然只有1.66s运行(与4.89 gcc的非矢量化 -O2 一个累加器)。即使的gcc -O2 没有 -ffast-数学也得到1.66s此源的变化。需要注意的是ARRAY_SIZE被称为是4的倍数,所以我并没有包括任何清理code处理的最后一个先进的3个元素(或避免阅读过去的数组的末尾,这会发生,因为现在写的)。这真的很容易得到的东西错了,这样当读取超出数组的结尾。

Using 4-accumulators and not adding them together until the end of the outer loop defeats clang's auto-vectorizer. It still runs in only 1.66s (vs. 4.89 for gcc's non-vectorized -O2 with one accumulator). Even gcc -O2 without -ffast-math also gets 1.66s for this source change. Note that ARRAY_SIZE is known to be a multiple of 4, so I didn't include any cleanup code to handle the last up-to-3 elements (or to avoid reading past the end of the array, which would happen as written now). It's really easy to get something wrong and read past the end of the array when doing this.

GCC,另一方面,不向量化这一点,但它也pessimises(未优的)内环成一个单一的依赖关系链。我认为这是做外循环的多次迭代,再一次。

gcc, on the other hand, does vectorize this, but it also pessimises (un-optimises) the inner loop into a single dependency chain. I think it's doing multiple iterations of the outer loop, again.

使用gcc的独立于平台的矢量扩展,我写了编译成显然是最佳code版本:

Using gcc's platform-independent vector extensions, I wrote a version which compiles into apparently-optimal code:

// compile with gcc -g -Wall -std=gnu11 -Ofast -fno-tree-vectorize -march=native fast-loop-cs201.vec.c -o fl3-vec

#include <stdio.h>
#include <stdlib.h>
#include <stddef.h>
#include <assert.h>
#include <string.h>

// You are only allowed to make changes to this code as specified by the comments in it.

// The code you submit must have these two values.
#define N_TIMES     600000
#define ARRAY_SIZE   10000

int main(void)
{
    double  *array = calloc(ARRAY_SIZE, sizeof(double));
    double  sum = 0;
    int     i;

    // You can add variables between this comment ...
    long int help = 0;

    typedef double v4df __attribute__ ((vector_size (8*4)));
    v4df sum0={0}, sum1={0}, sum2={0}, sum3={0};

    const size_t array_bytes = ARRAY_SIZE*sizeof(double);
    double *aligned_array = NULL;

    // this more-than-declaration could go in an if(i == 0) block for strict compliance with the rules
    if ( posix_memalign((void**)&aligned_array, 32, array_bytes) ) {
        exit (1);
    }
    memcpy(aligned_array, array, array_bytes);  // In this one case: faster to align once and have no extra overhead for N_TIMES through the loop

    // ... and this one.

    // Please change 'your name' to your actual name.
    printf("CS201 - Asgmt 4 - I. Forgot\n");

    for (i = 0; i < N_TIMES; i++) {

        // You can change anything between this comment ...
    /*
    #if defined(__GNUC__) && (__GNUC__ * 100 + __GNUC_MINOR__) >= 407 // GCC 4.7 or later.
        array = __builtin_assume_aligned(array, 32);
    #else
        // force-align for other compilers.  This loop-invariant will be done outside the loop.
        array = (double*) ((ptrdiff_t)array & ~31);
    #endif
    */

        assert ( ARRAY_SIZE / (4*4) == (ARRAY_SIZE+15) / (4*4) );  // We don't have a cleanup loop to handle where the array size isn't a multiple of 16


        // incrementing pointers can be more efficient than indexing arrays
        // esp. on recent Intel where micro-fusion only works with one-register addressing modes
        // of course, the compiler can always generate pointer-incrementing asm from array-indexing source
        const double *start = aligned_array;

        while ( (ptrdiff_t)start & 31 ) {
            // annoying loops like this are the reason people use aligned buffers
            sum += *start++;        // scalar until we reach 32B alignment
            // in practice, this loop doesn't run, because we copy into an aligned buffer
            // This will also require a cleanup loop, and break our multiple-of-16 doubles assumption.
        }

        const v4df *end = (v4df *)(aligned_array+ARRAY_SIZE);
        for (const v4df *p = (v4df *)start ; p+3 < end; p+=4) {
            sum0 += p[0];   // p+=4 increments the pointer by 4 * 4 * 8 bytes
            sum1 += p[1];       // make sure you keep track of what you're incrementing
            sum2 += p[2];
            sum3 += p[3];

        }

        // the compiler might be smart enough to pull this out of the inner loop
        // in fact, gcc turns this into a 64bit movabs outside of both loops :P
        help+= ARRAY_SIZE;

            // ... and this one. But your inner loop must do the same
            // number of additions as this one does.

        /* You could argue legalese and say that
         if (i == 0) {
             for (j ...)
                 sum += array[j];
             sum *= N_TIMES;
         }
         * still does as many adds in its *INNER LOOP*, but it just doesn't run it as often
         */
    }

    // You can add some final code between this comment ...
    sum0 = (sum0 + sum1) + (sum2 + sum3);
    sum += sum0[0] + sum0[1] + sum0[2] + sum0[3];
    printf("sum = %g; help=%ld\n", sum, help);  // defeat the compiler.

    free (aligned_array);
    free (array);  // not strictly necessary, because this is the end of main().  Leaving it out for this special case is a bad example for a CS class, though.
    // ... and this one.

    return 0;
}

内环编译为:

  4007c0:       c5 e5 58 19             vaddpd (%rcx),%ymm3,%ymm3
  4007c4:       48 83 e9 80             sub    $0xffffffffffffff80,%rcx   # subtract -128, because -128 fits in imm8 instead of requiring an imm32 to encode add $128, %rcx
  4007c8:       c5 f5 58 49 a0          vaddpd -0x60(%rcx),%ymm1,%ymm1   # one-register addressing mode can micro-fuse
  4007cd:       c5 ed 58 51 c0          vaddpd -0x40(%rcx),%ymm2,%ymm2
  4007d2:       c5 fd 58 41 e0          vaddpd -0x20(%rcx),%ymm0,%ymm0
  4007d7:       4c 39 c1                cmp    %r8,%rcx  # compare with end with p
  4007da:       75 e4                   jne    4007c0 <main+0xb0>

(有关详细信息,在网上看到编译器输出的godbolt 。请注意,我不得不投<$的返回值C $ C>释放calloc ,因为godbolt使用C ++编译器,而不是C编译器。内环是由 .L3 JNE .L3 。请参见 http://stackoverflow.com/tags/x86/info 以86 ASM链接。另请参见微融合和寻址模式,因为这种变化的SandyBridge不是招 ŧ使它成为瓦格纳雾手册呢。)

(For more, see online compiler output at godbolt. Note I had to cast the return value of calloc, because godbolt uses C++ compilers, not C compilers. The inner loop is from .L3 to jne .L3. See http://stackoverflow.com/tags/x86/info for x86 asm links. See also Micro fusion and addressing modes, because this Sandybridge change hasn't made it into Agner Fog's manuals yet.).

表现:

$ perf stat -e task-clock,cycles,instructions,r1b1,r10e,stalled-cycles-frontend,stalled-cycles-backend,L1-dcache-load-misses,cache-misses ./fl3-vec 
CS201 - Asgmt 4 - I. Forgot
sum = 0; help=6000000000

 Performance counter stats for './fl3-vec':

       1086.571078      task-clock (msec)         #    1.000 CPUs utilized          
     4,072,679,849      cycles                    #    3.748 GHz                    
     2,629,419,883      instructions              #    0.65  insns per cycle        
                                                  #    1.27  stalled cycles per insn
     4,028,715,968      r1b1                      # 3707.733 M/sec  # unfused uops
     2,257,875,023      r10e                      # 2077.982 M/sec  # fused uops.  lower than insns because of macro-fusion
     3,328,275,626      stalled-cycles-frontend   #   81.72% frontend cycles idle   
     1,648,011,059      stalled-cycles-backend    #   40.47% backend  cycles idle   
       751,736,741      L1-dcache-load-misses     #  691.843 M/sec                  
            18,772      cache-misses              #    0.017 M/sec                  

       1.086925466 seconds time elapsed

我仍然不知道为什么它让每个周期如此低的指令。内循环使用4个独立的蓄电池,和我用gdb检查了指针对齐。这样的高速缓存存储体冲突不应是问题。 SandyBridge的L2缓存可以支持每个周期32B传输,这应该跟上一32B FP向量每个周期的补充。

I still don't know why it's getting such low instructions per cycle. The inner loop is using 4 separate accumulators, and I checked with gdb that the pointers are aligned. So cache-bank conflicts shouldn't be the problem. Sandybridge L2 cache can sustain one 32B transfers per cycle, which should keep up with the one 32B FP vector add per cycle.

加载32B从L1负载需要2个周期(但直到Haswell的英特尔提出32B装入一个单周期操作)。不过,也有2负载端口,因此持续的吞吐量是每个周期32B(这我们没有达到)。

Loads 32B loads from L1 take 2 cycles (it wasn't until Haswell that Intel made 32B loads a single-cycle operation). However, there are 2 load ports, so the sustained throughput is 32B per cycle (which we're not reaching).

或许载荷需要提前正在使用时的流水线,以减少具有ROB(重排序缓冲器)填满时的负载档位?但PERF计数器显示相当高的L1缓存的命中率,所以硬件prefetch从L2到L1似乎做的工作。

Perhaps the loads need to be pipelined ahead of when they're used, to minimize having the ROB (re-order buffer) fill up when a load stalls? But the perf counters indicate a fairly high L1 cache hit rate, so hardware prefetch from L2 to L1 seems to be doing its job.

每周期指令0.65只有大约一半到饱和向量加法FP。这是令人沮丧的。即使 IACA 说,循环应该运行每次迭代4个周期。 (即饱和负载端口和端口1(在FP加法生活的地方))/

0.65 instructions per cycle is only about half way to saturating the vector FP adder. This is frustrating. Even IACA says the loop should run in 4 cycles per iteration. (i.e. saturate the load ports and port1 (where the FP adder lives)) :/

更新:我想L2潜伏期的问题,毕竟。减少ARRAY_SIZE至1008(16的倍数),并以10倍增加N_TIMES,带来的运行时下降到0.5秒。这是每个周期1.68的insn。 (内环是7总说明4 FP增加,因此,我们终于饱和的矢量FP加法单元,和负载端口)。IDK为什么一档后,汉王prefetcher不能出人头地,然后留先。可能是软件prefetch可以帮助?也许在某种程度上避免跑过去阵列的硬件prefetcher,而是开始$ P $再次pfetching数组的开始。 (循环平铺是一个更好的解决办法,见下文)。

update: I guess L2 latency was the problem after all. Reducing ARRAY_SIZE to 1008 (multiple of 16), and increasing N_TIMES by a factor of 10, brought the runtime down to 0.5s. That's 1.68 insns per cycle. (The inner loop is 7 total instructions for 4 FP adds, thus we are finally saturating the vector FP add unit, and the load ports.) IDK why the HW prefetcher can't get ahead after one stall, and then stay ahead. Possibly software prefetch could help? Maybe somehow avoid having the HW prefetcher run past the array, and instead start prefetching the start of the array again. (loop tiling is a much better solution, see below.)

英特尔CPU只有32K L1的每个数据和L1指令缓存。我觉得你的数组将刚好适合在L1的AMD CPU上。

Intel CPUs only have 32k each L1-data and L1-instruction caches. I think your array would just barely fit in the L1 on an AMD CPU.

海合会试图通过广播相同的值转换成并行加法向量化似乎不那么疯狂。如果它成功地得到这个权利(使用多个蓄电池隐藏延迟),这将允许它与只有一半的内存带宽饱和向量加法FP。由于-是,它是pretty多洗,广播开销可能是因为

Gcc's attempt to vectorize by broadcasting the same value into a parallel add doesn't seem so crazy. If it had managed to get this right (using multiple accumulators to hide latency), that would have allowed it to saturate the vector FP adder with only half the memory bandwidth. As-is, it was pretty much a wash, probably because of overhead in broadcasting.

此外,它是pretty傻了。在 N_TIMES 是只是一个化妆的工作重复。我们实际上并不希望优化相同的工作,多次做。除非我们想在这样的傻分配取胜。源代码级的方式来做到这将是增加 I 在code我们允许修改的部分:

Also, it's pretty silly. The N_TIMES is a just a make-work repeat. We don't actually want to optimize for doing the identical work multiple times. Unless we want to win at silly assignments like this. A source-level way to do this would be to increment i in the part of the code we're allowed to modify:

for (...) {
    sum += a[j] + a[j] + a[j] + a[j];
}
i += 3;  // The inner loop does 4 total iterations of the outer loop

更实际,对付这种可以互换的循环(循环数组一旦超过,增加每个值N_TIMES次)。我想我读过,英特尔的编译器有时会为你做的。

More realistically, to deal with this you could interchange your loops (loop over the array once, adding each value N_TIMES times). I think I've read that Intel's compiler will sometimes do that for you.

一个更普遍的技术被称为缓存阻塞,或循环平铺。我们的想法是在适合缓存小块您的输入数据。取决于你的算法,它可以是可能做到的事情不同阶段上的块,然后重复下一个块,而不必在整个输入每个阶段循环。与往常一样,一旦你知道的伎俩正确的名称(以及它存在的话),你可以在google一吨的信息。

A more general technique is called cache blocking, or loop tiling. The idea is to work on your input data in small blocks that fit in cache. Depending on your algorithm, it can be possible to do various stages of thing on a chunk, then repeat for the next chunk, instead of having each stage loop over the whole input. As always, once you know the right name for a trick (and that it exists at all), you can google up a ton of info.

您可以规定律师用自己的方式为把一个互换环路内的如果(我== 0)在code你的部分区块允许修改。它仍然会做同样数量的增加,但在更多的缓存优化的顺序。

You could rules-lawyer your way into putting an interchanged loop inside an if (i == 0) block in the part of the code you're allowed to modify. It would still do the same number of additions, but in a more cache-optimal order.

这篇关于双打C中的数组的优化总和的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆