高效的访问矩阵列 [英] Efficient access matrix columns
问题描述
高效的访问问题:我需要访问一个大型矩阵(大于2000×2000)列明智的,我的算法需要1行通和1列通。行通是罚款记忆效率(高速缓存未命中),但如何减少在列高速缓存未命中通过?我需要效率。
Efficient access question: I need to access a large matrix (more than 2000x2000) column wise, my algorithm require a 1 row pass and 1 column pass. Row pass is fine for memory efficiency (cache miss), but how to reduce the cache miss in the column pass? I need efficiency.
我在我的就像是唯一:ñ声明局部变量(根据内存读取大小),
The only thing I had in my is like : declare n local variable (based on memory fetch size),
INT A1,A2,A3,A4;
对于(INT J = 0; J< DIM_Y; J + = 4)(INT I = 0; I< DIM_X;我++)
A1 =矩阵[I] [J]。 ...; A4 =矩阵[I] [J + 4];
//作出的4个变量列处理。
这是C或C ++和数组或int或字符。
It's in C or C++, and array or int or char.
任何主张和意见是值得欢迎的。
Any proposition and comment is welcomed.
感谢。
推荐答案
两种基本技术适用于:
1)循环阻塞
而不是
for (j=0;j<2000;j++)
for (i=0;i<2000;i++)
process_element(i,j);
使用
for (j=0;j<2000;j+=8)
for (i=0;i<2000;i+=8)
process_block_of_8x8(i,j);
2)2排步幅的非电力(例如8192字节+ 64) - 垫在必要
2) non-power of 2 row stride (e.g. 8192 bytes + 64) -- pad if necessary
在这种情况下,行[I] ...排第[i + 7]将不会在同一高速缓存行打
in this case row[i] .. row[i+7] will not fight for the same cache line
数据应与人工计算填充连续的内存区域。
the data should be in continuous memory region with the manually calculated padding.
这篇关于高效的访问矩阵列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!