code连续存取L1缓存慢于code访问3X非连续L1存储器 [英] Code accessing continuous L1 cache memory slower than code accessing 3x non-continuous L1 memory
问题描述
此问题的更新,这一点:<一href=\"http://stackoverflow.com/questions/23162398/accessing-three-static-arrays-is-quicker-than-one-static-array-containing-3x-dat\">Accessing三个静态数组是包含3倍的数据速度比一个静态数组?,但我有太多的更新,因此这是有意义的开始一个新的。
This question is an update to this: Accessing three static arrays is quicker than one static array containing 3x data? but I have too many updates so it made sense to start a new one.
我有700项目,每个项目中有三个属性 - A
, B
和ë
。我通过每一个项目环和使用三个atrributes来计算两个全球性的属性称为 C
和 D
值。
I have 700 "items" and each item has three attributes- a
, b
and e
. I loop through every item and use the three atrributes to calculate values for two "global" attributes called c
and d
.
我用了两个技术来实现这一点:
I used two techniques to implement this:
1)三700元件阵列,一个阵列的每个三个属性的:
1) Three 700-element arrays, one array for each of the three attributes:
item0.a = array1[0]
item0.b = array2[0]
item0.e = array3[0]
2)含有数据连续三个属性一2100元素的数组:
2) One 2100-element array containing data for the three attributes consecutively:
item0.a = array[offset + 0]
item0.b = array[offset + 1]
item0.e = array[offset + 2]
现在三大项属性 A
, B
和电子
是在环 - 中一起使用,因此会更有意义,如果你把它们存储在一个阵列中的表现应该比如果使用三阵技术更好。
Now the three item attributes a
, b
and e
are used together within the loop- therefore it would make sense that if you store them in one array the performance should be better than if you use the three-array technique.
这是因为在技术#1中的每个的三个属性(每相同项)都位于相隔(700×4字节)。因此,我将不得不检索项目三个不同的高速缓存行。
This is because in technique #1 each of the three attributes (per same item) are located (700 x 4 bytes) apart. I would therefore have to retrieve three different cache lines for an item.
然而,随着技术#2的三个属性是位于内存中相邻,所以一个高速缓存行会给我所有三个项目的属性。事实上,因为高速缓存线是64个字节,我使用integers-会有每个缓存线路负载提供16整数。这意味着我可以处理每缓存行16/3 = 5个项目。
However, with technique #2 the three attributes are located adjacent in memory, so one cache line will give me all three item attributes. In fact because the cache line is 64 bytes and I am using integers- there will be 16 integers available per cache line load. This means I can process 16/3 = 5 items per cache line.
在换句话说,技术#1应该至少需要三级缓存线路负荷处理5个项目,而技术#2将只需要一个高速缓存行负载。我期待技术#2是快了不少。
In other words, technique #1 should require at least 3 cache line loads to process 5 items whereas technique #2 would only require one cache line load. I was expecting technique #2 to be a lot faster.
结果(Win 7的64,MSVC 2012年台式机的Ivy Bridge CPU):
Results (Win 7 64, MSVC 2012 with desktop Ivy Bridge CPU):
- 技巧#1:3000的CPU周期(平均)
- 技巧#2:3400个CPU周期(平均)
技术#1 C ++:
unsigned int x;
unsigned int y;
double c = 0;
double d = 0;
bool data_for_all_items = true;
unsigned long long start = 0;
unsigned long long finish = 0;
unsigned int array1[700];
unsigned int array2[700];
unsigned int array3[700];
//I have left out code for simplicity. You can assume by now the arrays are populated.
start = __rdtscp(&x);
for(int i=0; i < 700; i++){
unsigned int a= array1[i]; //Array 1
unsigned int b= array2[i]; //Array 2
data_for_all_items = data_for_all_items & (a!= -1 & b != -1);
unsigned int e = array3[i]; //Array 3
c += (a * e);
d += (b * e);
}
finish = __rdtscp(&y);
if(data_for_all_items){
std::cout << finish - start << std::endl;
//This line prevents compiler over-optimizing
std::cout << c << d << std::endl;
}
大会技巧#1:
unsigned int x;
unsigned int y;
start = __rdtscp(&x);
rdtscp
shl rdx,20h
lea r8,[this]
for(int i=0; i < 700; i++){
sub rdi,rsi
or rax,rdx
lea r10,[rsi+4]
mov ebx,46h
mov dword ptr [r8],ecx
mov r11,rax
sub r9,rsi
unsigned int a = array1[i];
unsigned int b = array2[i];
all_instr_found = all_instr_found & (a != -1 & b != -1);
cmp dword ptr [r10-4],0FFFFFFFFh
xorps xmm0,xmm0
setne cl
cmp dword ptr [r10+rdi-4],0FFFFFFFFh
setne al
and cl,al
and bpl,cl
unsigned int e = array3[i];
mov ecx,dword ptr [r10+r9-4]
c += (a * e);
mov eax,ecx
d += (b * e);
imul ecx,dword ptr [r10-4]
imul eax,dword ptr [r10+rdi-4]
cmp dword ptr [r10],0FFFFFFFFh
cvtsi2sd xmm0,rax
mov eax,ecx
setne cl
cmp dword ptr [rdi+r10],0FFFFFFFFh
addsd xmm6,xmm0
xorps xmm0,xmm0
cvtsi2sd xmm0,rax
setne al
and cl,al
and bpl,cl
mov ecx,dword ptr [r9+r10]
mov eax,ecx
addsd xmm7,xmm0
imul ecx,dword ptr [r10]
xorps xmm0,xmm0
imul eax,dword ptr [rdi+r10]
cmp dword ptr [r10+4],0FFFFFFFFh
cvtsi2sd xmm0,rax
mov eax,ecx
setne cl
cmp dword ptr [rdi+r10+4],0FFFFFFFFh
addsd xmm6,xmm0
xorps xmm0,xmm0
cvtsi2sd xmm0,rax
setne al
and cl,al
and bpl,cl
mov ecx,dword ptr [r9+r10+4]
mov eax,ecx
addsd xmm7,xmm0
imul ecx,dword ptr [r10+4]
xorps xmm0,xmm0
imul eax,dword ptr [rdi+r10+4]
cmp dword ptr [r10+8],0FFFFFFFFh
cvtsi2sd xmm0,rax
mov eax,ecx
setne cl
cmp dword ptr [rdi+r10+8],0FFFFFFFFh
addsd xmm6,xmm0
xorps xmm0,xmm0
cvtsi2sd xmm0,rax
setne al
and cl,al
and bpl,cl
mov ecx,dword ptr [r10+r9+8]
mov eax,ecx
addsd xmm7,xmm0
imul ecx,dword ptr [r10+8]
xorps xmm0,xmm0
imul eax,dword ptr [rdi+r10+8]
cmp dword ptr [r10+0Ch],0FFFFFFFFh
cvtsi2sd xmm0,rax
mov eax,ecx
setne cl
cmp dword ptr [r10+rdi+0Ch],0FFFFFFFFh
addsd xmm6,xmm0
xorps xmm0,xmm0
cvtsi2sd xmm0,rax
setne al
and cl,al
and bpl,cl
mov ecx,dword ptr [r10+r9+0Ch]
mov eax,ecx
d += (b * e);
addsd xmm7,xmm0
xorps xmm0,xmm0
imul eax,dword ptr [r10+rdi+0Ch]
imul ecx,dword ptr [r10+0Ch]
cvtsi2sd xmm0,rax
addsd xmm6,xmm0
cmp dword ptr [r10+10h],0FFFFFFFFh
mov eax,ecx
xorps xmm0,xmm0
setne cl
cmp dword ptr [r10+rdi+10h],0FFFFFFFFh
cvtsi2sd xmm0,rax
setne al
and cl,al
and bpl,cl
mov ecx,dword ptr [r10+r9+10h]
addsd xmm7,xmm0
mov eax,ecx
xorps xmm0,xmm0
imul ecx,dword ptr [r10+10h]
imul eax,dword ptr [r10+rdi+10h]
cmp dword ptr [r10+14h],0FFFFFFFFh
cvtsi2sd xmm0,rax
mov eax,ecx
setne cl
cmp dword ptr [r10+rdi+14h],0FFFFFFFFh
addsd xmm6,xmm0
xorps xmm0,xmm0
cvtsi2sd xmm0,rax
setne al
and cl,al
and bpl,cl
mov ecx,dword ptr [r10+r9+14h]
mov eax,ecx
addsd xmm7,xmm0
imul ecx,dword ptr [r10+14h]
xorps xmm0,xmm0
imul eax,dword ptr [r10+rdi+14h]
cmp dword ptr [r10+18h],0FFFFFFFFh
cvtsi2sd xmm0,rax
mov eax,ecx
setne cl
cmp dword ptr [r10+rdi+18h],0FFFFFFFFh
addsd xmm6,xmm0
xorps xmm0,xmm0
cvtsi2sd xmm0,rax
setne al
and cl,al
and bpl,cl
mov ecx,dword ptr [r10+r9+18h]
mov eax,ecx
addsd xmm7,xmm0
imul ecx,dword ptr [r10+18h]
xorps xmm0,xmm0
imul eax,dword ptr [r10+rdi+18h]
cmp dword ptr [r10+1Ch],0FFFFFFFFh
cvtsi2sd xmm0,rax
mov eax,ecx
setne cl
cmp dword ptr [r10+rdi+1Ch],0FFFFFFFFh
addsd xmm6,xmm0
xorps xmm0,xmm0
cvtsi2sd xmm0,rax
setne al
and cl,al
and bpl,cl
mov ecx,dword ptr [r10+r9+1Ch]
mov eax,ecx
addsd xmm7,xmm0
imul ecx,dword ptr [r10+1Ch]
xorps xmm0,xmm0
imul eax,dword ptr [r10+rdi+1Ch]
cmp dword ptr [r10+20h],0FFFFFFFFh
cvtsi2sd xmm0,rax
mov eax,ecx
setne cl
cmp dword ptr [r10+rdi+20h],0FFFFFFFFh
addsd xmm6,xmm0
xorps xmm0,xmm0
cvtsi2sd xmm0,rax
setne al
and cl,al
and bpl,cl
mov ecx,dword ptr [r10+r9+20h]
mov eax,ecx
addsd xmm7,xmm0
imul eax,dword ptr [r10+rdi+20h]
imul ecx,dword ptr [r10+20h]
xorps xmm0,xmm0
add r10,28h
cvtsi2sd xmm0,rax
mov eax,ecx
addsd xmm6,xmm0
xorps xmm0,xmm0
cvtsi2sd xmm0,rax
addsd xmm7,xmm0
dec rbx
jne 013FC6DA71h
}
finish = __rdtscp(&y);
rdtscp
shl rdx,20h
lea r8,[y]
or rax,rdx
mov dword ptr [r8],ecx
(它看起来像编译器展开循环为技术#1五次)
(It looks like the compiler was unrolling the loop five times for Technique #1)
技术#2 C ++:
unsigned int x;
unsigned int y;
unsigned short pos = 0;
double c = 0;
double d = 0;
bool data_for_all_items = true;
unsigned long long start = 0;
unsigned long long finish = 0;
unsigned int array[2100];
//I have left out code for simplicity. You can assume by now the array is populated.
start = __rdtscp(&x);
while(pos < 2100){
unsigned int a = array[pos + 0];
unsigned int b = array[pos + 1];
data_for_all_items = data_for_all_items & (a!= -1 & b != -1);
unsigned int e = array[pos + 2];
c += (a * e);
d += (b * e);
pos = pos + 3;
}
finish = __rdtscp(&y);
if(data_for_all_items){
std::cout << finish - start << std::endl;
//This line prevents compiler over-optimizing
std::cout << c << d << std::endl;
}
大会技巧#2:
start = __rdtscp(&x);
rdtscp
shl rdx,20h
lea r9,[this]
mov r11d,2BCh
or rax,rdx
mov dword ptr [r9],ecx
add r10,8
mov rbx,rax
nop word ptr [rax+rax]
while(pos < 2100){
unsigned int a = array[pos];
unsigned int b = array[pos + 1];
all_instr_found = all_instr_found & (a != -1 & b != -1);
cmp dword ptr [r10-4],0FFFFFFFFh
xorps xmm0,xmm0
setne dl
cmp dword ptr [r10-8],0FFFFFFFFh
setne cl
unsigned int e = array[pos + 2];
c += (a * e);
d += (b * e);
pos = pos + 3;
add r10,0Ch
and dl,cl
mov ecx,dword ptr [r10-0Ch]
mov eax,ecx
and r15b,dl
imul ecx,dword ptr [r10-10h]
imul eax,dword ptr [r10-14h]
cvtsi2sd xmm0,rax
mov eax,ecx
addsd xmm6,xmm0
xorps xmm0,xmm0
cvtsi2sd xmm0,rax
addsd xmm7,xmm0
dec r11
jne 013F21DA30h
}
finish = __rdtscp(&y);
rdtscp
shl rdx,20h
lea r8,[y]
or rax,rdx
mov dword ptr [r8],ecx
因此,没有循环展开?这必须是这个原因吧?嗯,我手动展开使用这个C ++循环:
So no loop unrolling? This must be the reason right? Well I manually unrolled the loop using this C++:
while(pos < 2100){
unsigned int a1 = array[pos + 0];
unsigned int b1 = array[pos + 1];
unsigned int e1 = array[pos + 2];
unsigned int a2 = array[pos + 3];
unsigned int b2 = array[pos + 4];
unsigned int e2 = array[pos + 5];
unsigned int a3 = array[pos + 6];
unsigned int b3 = array[pos + 7];
unsigned int e3 = array[pos + 8];
unsigned int a4 = array[pos + 9];
unsigned int b4 = array[pos + 10];
unsigned int e4 = array[pos + 11];
unsigned int a5 = array[pos + 12];
unsigned int b5 = array[pos + 13];
unsigned int e5 = array[pos + 14];
c += (a1 * e1) + (a2 * e2) + (a3 * e3) + (a4 * e4) + (a5 * e5);
d += (b1 * e1) + (b2 * e2) + (b3 * e3) + (b4 * e4) + (b5 * e5);
data_for_all_items = data_for_all_items & (a1 != -1 & b1 != -1) & (a2 != -1 & b2 != -1) & (a3 != -1 & b3 != -1) & (a4 != -1 & b4 != -1) & (a5 != -1 & b5 != -1);
pos += 15;
}
if(data_for_all_items){
std::cout << finish - start << std::endl;
//This line prevents compiler over-optimizing
std::cout << c << d << std::endl;
}
生成此组件:
start = __rdtscp(&x);
rdtscp
lea r9,[x]
shl rdx,20h
mov qword ptr [rsp+108h],8Ch
mov dword ptr [r9],ecx
lea r9,[r8+8]
or rax,rdx
mov qword ptr [y],r9
mov qword ptr [rsp+20h],rax
nop
unsigned int a1 = array[pos + 0];
unsigned int b1 = array[pos + 1];
unsigned int e1 = array[pos + 2];
mov edi,dword ptr [r9]
unsigned int a2 = array[pos + 3];
mov r13d,dword ptr [r9+4]
unsigned int b2 = array[pos + 4];
mov r12d,dword ptr [r9+8]
xorps xmm0,xmm0
unsigned int e2 = array[pos + 5];
mov r10d,dword ptr [r9+0Ch]
unsigned int a3 = array[pos + 6];
mov r15d,dword ptr [r9+10h]
unsigned int b3 = array[pos + 7];
mov r14d,dword ptr [r9+14h]
unsigned int e3 = array[pos + 8];
mov r8d,dword ptr [r9+18h]
unsigned int a4 = array[pos + 9];
mov ebp,dword ptr [r9+1Ch]
unsigned int b4 = array[pos + 10];
mov esi,dword ptr [r9+20h]
unsigned int e4 = array[pos + 11];
mov edx,dword ptr [r9+24h]
unsigned int a5 = array[pos + 12];
mov ebx,dword ptr [r9+28h]
unsigned int b5 = array[pos + 13];
mov r11d,dword ptr [r9+2Ch]
unsigned int e5 = array[pos + 14];
mov r9d,dword ptr [r9+30h]
c += (a1 * e1) + (a2 * e2) + (a3 * e3) + (a4 * e4) + (a5 * e5);
mov eax,edx
mov dword ptr [x],r13d
d += (b1 * e1) + (b2 * e2) + (b3 * e3) + (b4 * e4) + (b5 * e5);
imul edx,esi
imul eax,ebp
mov ecx,r9d
imul r9d,r11d
imul ecx,ebx
add r9d,edx
add ecx,eax
mov eax,r8d
imul r8d,r14d
imul eax,r15d
add r9d,r8d
all_instr_found = all_instr_found & (a1 != -1 & b1 != -1) & (a2 != -1 & b2 != -1) & (a3 != -1 & b3 != -1) & (a4 != -1 & b4 != -1) & (a5 != -1 & b5 != -1);
movzx r8d,byte ptr [this]
add ecx,eax
mov eax,r10d
imul r10d,r12d
add r9d,r10d
imul eax,r13d
mov r13,qword ptr [y]
add ecx,eax
mov eax,edi
imul eax,dword ptr [r13-8]
add eax,ecx
cvtsi2sd xmm0,rax
mov rax,r13
mov edx,dword ptr [rax-4]
imul edi,edx
cmp r11d,0FFFFFFFFh
addsd xmm6,xmm0
setne cl
all_instr_found = all_instr_found & (a1 != -1 & b1 != -1) & (a2 != -1 & b2 != -1) & (a3 != -1 & b3 != -1) & (a4 != -1 & b4 != -1) & (a5 != -1 & b5 != -1);
cmp ebx,0FFFFFFFFh
lea eax,[rdi+r9]
mov r9,r13
xorps xmm0,xmm0
cvtsi2sd xmm0,rax
setne al
and cl,al
cmp esi,0FFFFFFFFh
setne al
and cl,al
cmp ebp,0FFFFFFFFh
setne al
and cl,al
cmp r14d,0FFFFFFFFh
addsd xmm7,xmm0
setne al
and cl,al
cmp r15d,0FFFFFFFFh
setne al
and cl,al
cmp r12d,0FFFFFFFFh
setne al
and cl,al
cmp dword ptr [x],0FFFFFFFFh
setne al
and cl,al
cmp edx,0FFFFFFFFh
setne al
and cl,al
cmp dword ptr [r13-8],0FFFFFFFFh
setne al
and cl,al
and r8b,cl
pos += 15;
add r9,3Ch
mov qword ptr [y],r9
mov byte ptr [this],r8b
while(pos < 2100){
dec qword ptr [rsp+108h]
jne 013F31DA50h
}
finish = __rdtscp(&y);
rdtscp
shl rdx,20h
lea r9,[y]
mov dword ptr [r9],ecx
mov r15,qword ptr [rsp+0A8h]
mov r13,qword ptr [rsp+0B0h]
mov r12,qword ptr [rsp+0B8h]
or rax,rdx
新的人工循环展开平均需要3500个CPU周期,比未推出原始版本这是3400较慢的CPU周期。
The new manual loop-unrolling takes 3500 CPU cycles on average, slower than the un-rolled original version which was 3400 CPU cycles.
能否有人请解释为什么code应该从缓存中提取的3倍少慢,是否有办法帮助编译器或得到这个code比技术#1?
Can somebody please explain why code which should fetch from the cache 3x less is slower and whether there is a way to help the compiler or get this code faster than technique #1?
推荐答案
使用一些prefetching在#2,你可能只是实现这一远大目标的性能。结果
对于GCC,使用内联汇编器和3DNOW prefetch指令(几乎所有从现在延伸离开)。
Use some prefetching in #2, and you might just reach that lofty performance goal.
For gcc, use inline assembler and the 3dNow Prefetch instructions (just about all left from that extension nowadays).
您例如排名第一的可能运行得更快,因为这样做3个独立的内存请求(因乱顺序执行)是指内存总线上更好的吞吐量(减少空闲时间)。
Your example number one probably runs faster because doing 3 independent memory requests (due to out-of-order-execution) means better throughput on the memory bus (less idle time).
这是微优化,不仅使得事情更难你,但作为弄巧成拙一个很好的例子。结果
处理器设计者和制造商,以及编译器编写者花真是太神奇了大量的金钱,时间和人才,为常见的情况,以便优化的猜测它们是困难的。
That's a nice example for micro-optimisation not only making things harder on you, but being self-defeating.
The processor designer and maker as well as the compiler writers spend really amazing amounts of money, time and talent optimizing for the common case so second-guessing them is hard.
这篇关于code连续存取L1缓存慢于code访问3X非连续L1存储器的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!