为什么我的程序慢慢循环完全8192元素？ [英] Why is my program slow when looping over exactly 8192 elements?

查看：173 发布时间：2016/10/11 20:30:26 c++ performance memory-management gcc

本文介绍了为什么我的程序慢慢循环完全8192元素？的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

这是从有问题的程序的提取。矩阵 img [] [] 的大小为SIZE×SIZE，初始化为：

img [j] [i] = 2 * j + i

res [] [] ，这里的每个字段都是img矩阵中的9个字段的平均值。

  for（i = 1; i  res [j] [i] = 0; 
 for（k = -1; k <2; k ++）
 for（l = -1; l <2; l ++）
 res [j] [i] + = img [j + l] [i + k]; 
 res [j] [i] / = 9; 
}

这是程序的全部。为了完整起见，这里是以前的。没有代码。正如你所看到的，它只是初始化。

  #define SIZE 8192 
 float img [SIZE] [SIZE] // input image 
 float res [SIZE] [SIZE]; // mean of mean filter 
 int i，j，k，l; （i = 0; i  for（j = 0; j  img [j] [i] =（2 * j + i ）％8196;

基本上，当SIZE是2048的倍数时，执行时间：

  SIZE = 8191：3.44 secs 
 SIZE = 8192：7.20 secs 
 SIZE = 8193：3.18 secs

编译器是GCC。
从我知道，这是因为内存管理，但我不太了解这个主题，这就是为什么我在这里问。

我已经知道malloc / free了，但是如果有人能解释这些执行时间，我已经足够快乐。，但是问题不是使用的内存量，它只是执行时间，所以我不知道这将如何帮助。

解决方案

p>这是因为以下相关问题导致了同样的超级对齐问题：

为什么转置512x512的矩阵比转置513x513的矩阵要慢得多？

矩阵乘法：矩阵大小差异很小，计时差异很大

但是这只是因为代码还有一个问题。

从原始循环开始：

<$ p （j = 1; j res [j]）中的（i = 1; i

 [i] = 0; 
 for（k = -1; k <2; k ++）
 for（l = -1; l <2; l ++）
 res [j] [i] + = img [j + l] [i + k]; 
 res [j] [i] / = 9; 
}

首先注意两个内部循环是微不足道的。它们可以展开如下：

  for（i = 1; i  for（j = 1; j  res [j] [i] = 0; 
 res [j] [i] + = img [j-1] [i-1]; 
 res [j] [i] + = img [j] [i-1]; 
 res [j] [i] + = img [j + 1] [i-1]; 
 res [j] [i] + = img [j-1] [i]; 
 res [j] [i] + = img [j] [i]; 
 res [j] [i] + = img [j + 1] [i]; 
 res [j] [i] + = img [j-1] [i + 1]; 
 res [j] [i] + = img [j] [i + 1]; 
 res [j] [i] + = img [j + 1] [i + 1]; 
 res [j] [i] / = 9; 
} 
}

这样就留下了两个外循环，

现在我们可以看到这个问题是一样的：为什么迭代二维数组时循环的顺序会影响性能？

您是逐列迭代矩阵而不是逐行迭代。

要解决这个问题，您应该互换两个循环。

  for（j = 1; j  for（i = 1; i  res [j] [i] = 0; 
 res [j] [i] + = img [j-1] [i-1]; 
 res [j] [i] + = img [j] [i-1]; 
 res [j] [i] + = img [j + 1] [i-1]; 
 res [j] [i] + = img [j-1] [i]; 
 res [j] [i] + = img [j] [i]; 
 res [j] [i] + = img [j + 1] [i]; 
 res [j] [i] + = img [j-1] [i + 1] 
 res [j] [i] + = img [j] [i + 1]; 
 res [j] [i] + = img [j + 1] [i + 1] 
 res [j] [i] / = 9; 
} 
}

这完全消除了所有非顺序访问

Core i7 920 @ 3.5 GHz

原始代码：

  8191：1.499秒
 8192：2.122秒
 8193：1.582秒

互换外环：

  8191：0.376秒
 8192：0.357秒
 8193：0.351秒

Here is the extract from the program in question. The matrix img[][] has the size SIZE×SIZE, and is initialized at:



img[j][i] = 2 * j + i

Then, you make a matrix res[][], and each field in here is made to be the average of the 9 fields around it in the img matrix. The border is left at 0 for simplicity.
for(i=1;i<SIZE-1;i++) 
    for(j=1;j<SIZE-1;j++) {
        res[j][i]=0;
        for(k=-1;k<2;k++) 
            for(l=-1;l<2;l++) 
                res[j][i] += img[j+l][i+k];
        res[j][i] /= 9;
}
That's all there's to the program. For completeness' sake, here is what comes before. No code comes after. As you can see, it's just initialization.
#define SIZE 8192
float img[SIZE][SIZE]; // input image
float res[SIZE][SIZE]; //result of mean filter
int i,j,k,l;
for(i=0;i<SIZE;i++) 
    for(j=0;j<SIZE;j++) 
        img[j][i] = (2*j+i)%8196;
Basically, this program is slow when SIZE is a multiple of 2048, e.g. the execution times:
SIZE = 8191: 3.44 secs
SIZE = 8192: 7.20 secs
SIZE = 8193: 3.18 secs
The compiler is GCC.
From what I know, this is because of memory management, but I don't really know too much about that subject, which is why I'm asking here.

Also how to fix this would be nice, but if someone could explain these execution times I'd already be happy enough.

I already know of malloc/free, but the problem is not amount of memory used, it's merely execution time, so I don't know how that would help.
 解决方案 
The difference is caused by the same super-alignment issue from the following related questions:


Why is transposing a matrix of 512x512 much slower than transposing a matrix of 513x513?
Matrix multiplication: Small difference in matrix size, large difference in timings


But that's only because there's one other problem with the code.

Starting from the original loop:
for(i=1;i<SIZE-1;i++) 
    for(j=1;j<SIZE-1;j++) {
        res[j][i]=0;
        for(k=-1;k<2;k++) 
            for(l=-1;l<2;l++) 
                res[j][i] += img[j+l][i+k];
        res[j][i] /= 9;
}
First notice that the two inner loops are trivial. They can be unrolled as follows:
for(i=1;i<SIZE-1;i++) {
    for(j=1;j<SIZE-1;j++) {
        res[j][i]=0;
        res[j][i] += img[j-1][i-1];
        res[j][i] += img[j  ][i-1];
        res[j][i] += img[j+1][i-1];
        res[j][i] += img[j-1][i  ];
        res[j][i] += img[j  ][i  ];
        res[j][i] += img[j+1][i  ];
        res[j][i] += img[j-1][i+1];
        res[j][i] += img[j  ][i+1];
        res[j][i] += img[j+1][i+1];
        res[j][i] /= 9;
    }
}
So that leaves the two outer-loops that we're interested in.

Now we can see the problem is the same in this question: Why does the order of the loops affect performance when iterating over a 2D array?

You are iterating the matrix column-wise instead of row-wise.



To solve this problem, you should interchange the two loops.
for(j=1;j<SIZE-1;j++) {
    for(i=1;i<SIZE-1;i++) {
        res[j][i]=0;
        res[j][i] += img[j-1][i-1];
        res[j][i] += img[j  ][i-1];
        res[j][i] += img[j+1][i-1];
        res[j][i] += img[j-1][i  ];
        res[j][i] += img[j  ][i  ];
        res[j][i] += img[j+1][i  ];
        res[j][i] += img[j-1][i+1];
        res[j][i] += img[j  ][i+1];
        res[j][i] += img[j+1][i+1];
        res[j][i] /= 9;
    }
}
This eliminates all the non-sequential access completely so you no longer get random slow-downs on large powers-of-two.



Core i7 920 @ 3.5 GHz

Original code:
8191: 1.499 seconds
8192: 2.122 seconds
8193: 1.582 seconds
Interchanged Outer-Loops:
8191: 0.376 seconds
8192: 0.357 seconds
8193: 0.351 seconds


                        
这篇关于为什么我的程序慢慢循环完全8192元素？的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！


                    
                        查看全文

为什么我的程序慢慢循环完全8192元素？ [英] Why is my program slow when looping over exactly 8192 elements?

问题描述

相关文章

C/C++开发最新文章

热门教程

热门工具

登录关闭

为什么我的程序慢慢循环完全8192元素？ [英] Why is my program slow when looping over exactly 8192 elements?

问题描述

相关文章

C/C++开发最新文章

热门教程

热门工具

登录 关闭

登录关闭