高速缓存内存优化阵列移调:C [英] Cache Memory Optimization Array Transpose: C

查看:108
本文介绍了高速缓存内存优化阵列移调:C的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

typedef int array[2][2];

void transpose(array dst, array src) {
    int i, j;
    for (j = 0; j < 2; j++) {
        for (i = 0; i < 2; i++) {
            dst[i][j] = src[j][i];
        }
    }
}

SRC数组起始地址0和DST阵列起始地址为0x10。

src array starts at address 0 and the dst array starts at address 0x10.

L1数据缓存,直接映射,写分配,8个字节的块大小。

L1 data cache, direct map, write-allocate, 8 byte block size.

缓存总大小为16字节的数据。

cache total size is 16 data bytes.

什么是SRC和DST阵列的每个条目漫无?

What is the hit or miss on each entry of src and dst array?

答案是:

src: 
[0][0] -> miss,
[0][1] -> miss,
[1][0] -> miss,
[1][1] -> hit

dst:
[0][0] -> miss,
[0][1] -> miss,
[1][0] -> miss,
[1][1] -> miss

如果高速缓存的总大小是32个数据字节,回答是:

If the cache total size is 32 data bytes, the answer is:

src: 
[0][0] -> miss,
[0][1] -> hit,
[1][0] -> miss,
[1][1] -> hit

dst:
[0][0] -> miss,
[0][1] -> hit,
[1][0] -> miss,
[1][1] -> hit

我不确定这两种结果。我真的不明白使用数组和缓存的概念。

I am unsure of both outcomes. I don't really understand the concept with arrays and caching.

推荐答案

所以,在一审你的每8个字节,共16字节的两个高速缓存行。我假设的4个字节的int数据大小。鉴于C数组和偏移的位置,你提供的这些都是可以缓存内存线已经:

So, in the first instance you have two cache lines of 8 bytes each for a total of 16 bytes. I'll assume an int data size of 4 bytes. Given the placement of arrays in C and the offsets you've provided these are the memory lines which can be cached:

Cacheable lines:
#A: &src[0][0] = 0x00, &src[0][1] = 0x04
#B: &src[1][0] = 0x08, &src[1][1] = 0x0C
#C: &dst[0][0] = 0x10, &dst[0][1] = 0x14
#D: &dst[1][0] = 0x18, &dst[1][1] = 0x1C

然后,我们需要知道访问顺序,每个存储器地址由程序访问。我假设没有优化这可能是由编译器引起reorderings。

Then we need to know the access order that each memory address is visited by the program. I'm assuming no optimizations which might cause reorderings by the compiler.

Access order and cache behavior (assuming initially empty):
#1: load src[0][0] --> Miss line A = cache slot 1
#2: save dst[0][0] --> Miss line C = cache slot 2
#3: load src[0][1] --> Hit  line A = cache slot 1
#4: save dst[0][1] --> Hit  line C = cache slot 2
#5: load src[1][0] --> Miss line B = cache slot 1 (LRU, replaces line A)
#6: save dst[1][0] --> Miss line D = cache slot 2 (LRU, replaces line C)
#7: load src[1][1] --> Hit  line B = cache slot 1
#8: save dst[1][1] --> Hit  line D = cache slot 2

其中,我认为,匹配你的第二个答案。然后用32个字节(4行),假设所有其它因素的高速缓存大小是恒定的:

Which, I think, matches your second answer. Then with a cache size of 32 bytes (4 lines), assuming all other factors are constant:

Access order and cache behavior (assuming initially empty):
#1: load src[0][0] --> Miss line A = cache slot 1
#2: save dst[0][0] --> Miss line C = cache slot 2
#3: load src[0][1] --> Hit  line A = cache slot 1
#4: save dst[0][1] --> Hit  line C = cache slot 2
#5: load src[1][0] --> Miss line B = cache slot 3
#6: save dst[1][0] --> Miss line D = cache slot 4
#7: load src[1][1] --> Hit  line B = cache slot 3
#8: save dst[1][1] --> Hit  line D = cache slot 4

它们是相同的。如果你再重新运行的移调唯一的区别是。在案例1中,你会得到完全相同的行为(当然,you'ld与缓存满了所有错误的事情开始,所以它也可能是空的)。在更大的高速缓存的情况下,不过,你需要第二个电话一切都已经缓存,所以不会有缓存未命中。

They are identical. The only difference would be if you reran transpose again. In case 1 you would get the exact same behavior (well, you'ld start with the cache full of all the wrong things, so it might as well be empty). In the larger cache case, though, everything you need for the second call is already cached, so there will be no cache misses.

我的答案和你之间的区别是最有可能是由于我们对编译器为您的循环计数寄存器(i和j)的行为假设。我假设他们都存储在寄存器(所以不会影响数据缓存)。您可能需要承担他们是在某处内存(并​​因此与缓存交互),以获得预期的结果。

The difference between my answers and yours is most likely due to our assumptions about the behavior of the compiler for your loop count registers (i and j). I would assume they are both stored in registers (and so would not affect the data cache). You may need to assume they are in memory somewhere (and therefore interact with the cache) to get the expected results.

这篇关于高速缓存内存优化阵列移调:C的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆