加强与本地缓存的C函数的性能? [英] Improve C function performance with cache locality?

查看:224
本文介绍了加强与本地缓存的C函数的性能?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我必须找到psented作为二维数组和函数原型为

I have to find a diagonal difference in a matrix represented as 2d array and the function prototype is

int diagonal_diff(int x[512][512])

我必须使用一个二维数组,数据是512×512。这是一个SPARC机器上测试:我现在的时间是6ms的,但我需要在2毫秒

I have to use a 2d array, and the data is 512x512. This is tested on a SPARC machine: my current timing is 6ms but I need to be under 2ms.

示例数据:

[3][4][5][9]
[2][8][9][4]
[6][9][7][3]
[5][8][8][2]

的区别是:

|4-2| + |5-6| + |9-5| + |9-9| + |4-8| + |3-8| = 2 + 1 + 4 + 0 + 4 + 5 = 16

在为了做到这一点,我用下面的算法:

In order to do that, I use the following algorithm:

int i,j,result=0;
for(i=0; i<4; i++)
    for(j=0; j<4; j++)
        result+=abs(array[i][j]-[j][i]);

return result;

但这种算法保持访问列,行,列,行等,这使低效利用高速缓存。

But this algorithm keeps accessing the column, row, column, row, etc which make inefficient use of cache.

有没有改进我的功能的一种方式?

Is there a way to improve my function?

推荐答案

编辑:为什么在面向块的方法更快?我们通过确保我们是否遍历块按行或列,我们保证整块适合缓存利用CPU的数据缓存。

Why is a block oriented approach faster? We are taking advantage of the CPU's data cache by ensuring that whether we iterate over a block by row or by column, we guarantee that the entire block fits into the cache.

例如,如果你有32个字节,并且缓存行的 INT 为4个字节,可以容纳一个8x8的 INT 矩阵到8高速缓存行。假设你有一个足够大的数据缓存,您可以按行或按列迭代的矩阵,保证您不会鞭打缓存。想想它的另一种方法是,如果你的矩阵在缓存适合,你可以穿越它,你想要的任何方式。

For example, if you have a cache line of 32-bytes and an int is 4 bytes, you can fit a 8x8 int matrix into 8 cache lines. Assuming you have a big enough data cache, you can iterate over that matrix either by row or by column and be guaranteed that you do not thrash the cache. Another way to think about it is if your matrix fits in the cache, you can traverse it any way you want.

如果你有一个矩阵就是要大得多,说512×512,那么你就需要调整您的矩阵遍历,这样你不鞭打缓存。例如,如果您遍历矩阵的布局相反的顺序基质中,你几乎永远怀念您每次访问元素缓存。

If you have a matrix that is much bigger, say 512x512, then you need to tune your matrix traversal such that you don't thrash the cache. For example, if you traverse the matrix in the opposite order of the layout of the matrix, you will almost always miss the cache on every element you visit.

一个块导向的方法可以确保你只有一个缓存丢失数据,你会最终参观前的CPU必须刷新高速缓存行。换句话说,调整到高速缓存行大小的块为导向的方法将确保你不会鞭打缓存。

A block oriented approach ensures that you only have a cache miss for data you will eventually visit before the CPU has to flush that cache line. In other words, a block oriented approach tuned to the cache line size will ensure you don't thrash the cache.

所以,如果你想优化你的机器上运行的高速缓存行的大小,您可以通过以块的形式矩阵迭代,并确保你只访问的每个矩阵元素一次:

So, if you are trying to optimize for the cache line size of the machine you are running on, you can iterate over the matrix in block form and ensure you only visit each matrix element once:

int sum_diagonal_difference(int array[512][512], int block_size)
{
    int i,j, block_i, block_j,result=0;

     // sum diagonal blocks
    for (block_i= 0; block_i<512; block_i+= block_size)
        for (block_j= block_i + block_size; block_j<512; block_j+= block_size)
            for(i=0; i<block_size; i++)
                for(j=0; j<block_size; j++)
                    result+=abs(array[block_i + i][block_j + j]-array[block_j + j][block_i + i]);

    result+= result;

     // sum diagonal
    for (int block_offset= 0; block_offset<512; block_offset+= block_size)
    {
        for (i= 0; i<block_size; ++i)
        {
            for (j= i+1; j<block_size; ++j)
            {
                int value= abs(array[block_offset + i][block_offset + j]-array[block_offset + j][block_offset + i]);
                result+= value + value;
            }
        }
    }

    return result;
}

您应该与 BLOCK_SIZE 各种数值试验。在我的机器, 8 导致的最大速度可达(2.5倍)相比,1的 BLOCK_SIZE (和〜 5倍相比原来迭代在整个矩阵)。在 BLOCK_SIZE 最好应 cache_line_size_in_bytes /的sizeof(INT)

You should experiment with various values for block_size. On my machine, 8 lead to the biggest speed up (2.5x) compared to a block_size of 1 (and ~5x compared to the original iteration over the entire matrix). The block_size should ideally be cache_line_size_in_bytes/sizeof(int).

这篇关于加强与本地缓存的C函数的性能?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆