C ++:提高3d阵列中的缓存性能 [英] C++: Improving cache performance in a 3d array

查看:101
本文介绍了C ++:提高3d阵列中的缓存性能的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我不知道如何在一个非常低的水平优化缓存性能,考虑缓存行大小或关联性。这不是你可以一夜之间学到的东西。考虑到我的程序将运行在许多不同的系统和架构,我不认为这是值得的。但是,仍然可能有一些步骤,我可以采取减少缓存未命中一般。



这里是我的问题的描述:



我有一个3d整数数组,表示空间中的点处的值,如[x] [y] [z]。每个维度都是相同的大小,所以它就像一个立方体。从那里,我需要做另一个3d数组,其中每个值在这个新的数组是一个函数的7​​个参数:原始3d数组中的对应值,加上6个触摸空间中的索引。我现在不用担心立方体的边和角。



这是我在C ++代码中的意思:



void process3DArray(int input [LENGTH] [LENGTH] [LENGTH],
int output [LENGTH] [LENGTH] [LENGTH])
{
for(int i = 1; i< LENGTH-1; i ++)
for(int j = 1; j for ; k // for循环从1开始,在LENGTH-1之前停止
//或者其他方式我会超出边界错误
//我不关心
// 3d数组立方体的边和角。
{
int value = input [i] [j] [k];

//我期望在这里疯狂的缓存未命中:
int posX = input [i + 1] [j] [k];
int negX = input [i-1] [j] [k];
int posY = input [i] [j + 1] [k];
int negY = input [i] [j-1] [k];
int posZ = input [i] [j] [k + 1];
int negZ = input [i] [j] [k-1];

output [i] [j] [k] =
process(value,posX,negX,posY,negY,posZ,negZ)
}
}

然而,似乎如果LENGTH足够大,当我获取进程的参数时,我会得到吨的缓存未命中。有没有一个缓存友好的方式来做到这一点,或更好的方式来表示我的数据而不是一个3d数组?



如果你有时间来回答这些额外问题,我要考虑LENGTH的价值吗?喜欢不同的是LENGTH是20 vs 100 vs 10000.此外,如果我使用整数之外的其他东西,如可能一个64字节的结构吗?



<对我有用[0]丢个板砖[0]引用|举报|编辑删除管理c >重要。但是如果是,我想知道为什么。

  int main(){
int data [LENGTH] [LENGTH] [LENGTH];
for(int i = 0; i for(int j = 0; j for(int k = 0; k < LENGTH; k ++)
data [i] [j] [k] = rand()*(i + j + k)

int result [LENGTH] [LENGTH] [LENGTH];
process3DArray(data,result);
}


解决方案

有权。如果你使用Fortran,你会做的完全错了,但这是另一个故事。你有权利的是,你正沿着内存地址最接近的方向在内循环中进行处理。单个存储器提取(超出高速缓存)将提取多个值,对应于k的一系列相邻值。在你的循环中,缓存将包含一些来自i,j的值;从i +/- 1,j和从i,j +/- 1的类似数。所以你基本上有五个不相交的内存段。对于较小的LENGTH值,这些将只是1或3部分的存储器。这是在如何建立缓存的性质,你可以有多个不同的内存在活动集中的不相交的部分。



我希望process()是小,和内联。否则这可能是不重要的。此外,它会影响您的代码是否适合指令高速缓存。



由于您对性能感兴趣,因此几乎总是更好地初始化五个指针(您只需要一个用于值,posZ和negZ),然后在循环中取*(p ++)。

  input [i + 1] [j] [k] 

要求编译器生成3个加法和两个乘法,除非你有一个非常好的优化器。如果你的编译器特别懒惰的寄存器分配,你也得到四个内存访问;否则一个。

  * inputIplusOneJK ++ 

要求一个add和一个内存引用。


I don't know how to optimize cache performance at a really low level, thinking about cache-line size or associativity. That's not something you can learn overnight. Considering my program will run on many different systems and architectures, I don't think it would be worth it anyway. But still, there are probably some steps I can take to reduce cache misses in general.

Here is a description of my problem:

I have a 3d array of integers, representing values at points in space, like [x][y][z]. Each dimension is the same size, so it's like a cube. From that I need to make another 3d array, where each value in this new array is a function of 7 parameters: the corresponding value in the original 3d array, plus the 6 indices that "touch" it in space. I'm not worried about the edges and corners of the cube for now.

Here is what I mean in C++ code:

void process3DArray (int input[LENGTH][LENGTH][LENGTH], 
                     int output[LENGTH][LENGTH][LENGTH])
{
    for(int i = 1; i < LENGTH-1; i++)
        for (int j = 1; j < LENGTH-1; j++)
            for (int k = 1; k < LENGTH-1; k++)
            //The for loops start at 1 and stop before LENGTH-1
            //or other-wise I'll get out-of-bounds errors
            //I'm not concerned with the edges and corners of the 
            //3d array "cube" at the moment.
            {
                int value = input[i][j][k];

                //I am expecting crazy cache misses here:
                int posX = input[i+1] [j]   [k];
                int negX = input[i-1] [j]   [k];
                int posY = input[i]   [j+1] [k];
                int negY = input[i]   [j-1] [k];
                int posZ = input[i]   [j]   [k+1];
                int negZ = input[i]   [j]   [k-1];

                output [i][j][k] = 
                    process(value, posX, negX, posY, negY, posZ, negZ);
            }
}

However, it seems like if LENGTH is large enough, I'll get tons of cache misses when I'm fetching the parameters for process. Is there a cache-friendlier way to do this, or a better way to represent my data other than a 3d array?

And if you have the time to answer these extra questions, do I have to consider the value of LENGTH? Like it's different whether LENGTH is 20 vs 100 vs 10000. Also, would I have to do something else if I used something other than integers, like maybe a 64-byte struct?

@ ildjarn:

Sorry, I did not think that the code that generates the arrays I am passing into process3DArray mattered. But if it does, I would like to know why.

int main() {
    int data[LENGTH][LENGTH][LENGTH];
    for(int i = 0; i < LENGTH; i++)
        for (int j = 0; j < LENGTH; j++)
            for (int k = 0; k < LENGTH; k++)
                data[i][j][k] = rand() * (i + j + k);

    int result[LENGTH][LENGTH][LENGTH];
    process3DArray(data, result);
}

解决方案

The most important thing you already have right. If you were using Fortran, you'd be doing it exactly wrong, but that's another story. What you have right is that you are processing in the inner loop along the direction where memory addresses are closest together. A single memory fetch (beyond the cache) will pull in multiple values, corresponding to a series of adjacent values of k. Inside your loop the cache will contain some number of values from i,j; a similar number from i+/-1, j and from i,j+/-1. So you basically have five disjoint sections of memory active. For small values of LENGTH these will only be 1 or three sections of memory. It is in the nature of how caches are built that you can have more than this many disjoint sections of memory in your active set.

I hope process() is small, and inline. Otherwise this may well be insignificant. Also, it will affect whether your code fits in the instruction cache.

Since you're interested in performance, it is almost always better to initialize five pointers (you only need one for value, posZ and negZ), and then take *(p++) inside the loop.

input[i+1] [j]   [k];

is asking the compiler to generate 3 adds and two multiplies, unless you have a very good optimizer. If your compiler is particularly lazy about register allocation, you also get four memory accesses; otherwise one.

*inputIplusOneJK++ 

is asking for one add and a memory reference.

这篇关于C ++:提高3d阵列中的缓存性能的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆