热点在for循环 [英] Hotspot in a for loop
问题描述
我试图优化这个code。
I am trying to optimize this code.
static
lvh_distance levenshtein_distance( const std::string & s1, const std::string & s2 )
{
const size_t len1 = s1.size(), len2 = s2.size();
std::vector<unsigned int> col( len2+1 ), prevCol( len2+1 );
const size_t prevColSize = prevCol.size();
for( unsigned int i = 0; i < prevColSize; i++ )
prevCol[i] = i;
for( unsigned int i = 0, j; i < len1; ++i )
{
col[0] = i+1;
const char s1i = s1[i];
for( j = 0; j < len2; ++j )
{
const auto minPrev = 1 + std::min( col[j], prevCol[1 + j] );
col[j+1] = std::min( minPrev, prevCol[j] + ( s1i == s2[j] ? 0 : 1 ) );
}
col.swap( prevCol );
}
return prevCol[len2];
}
英特尔VTune显示的处理器时间大约一半在第二花在为
指令,不是的 2号线内的为
的循环的。当我解开汇编源,我可以看到为
C ++指令已被翻译成数运算codeS,其中3似乎被吞噬的CPU时间:
Intel VTune shows that roughly half of the processor time is spent in the second for
instruction, not the 2 lines inside the for
loop. When I unroll the assembly source, I can see that the for
c++ instruction has been translated into a number of opcodes, 3 of which seem to be devouring the CPU time:
Code Location Source Line Assembly CPU Time
Block 14: [Unknown]
0x420c00 31 movq (%r12), %rcx 19.969ms
0x420c04 30 add $0x1, %r11d [Unknown]
0x420c08 32 test %rbx, %rbx [Unknown]
0x420c0b 30 movl %r11d, (%r8) [Unknown]
0x420c0e 31 movzxb (%rcx,%rdx,1), %r9d 19.964ms
0x420c13 32 jz 0x420c53 <Block 17> [Unknown]
Block 15: [Unknown]
0x420c15 32 movq (%rbp), %r10 [Unknown]
0x420c19 32 mov %r11d, %edx [Unknown]
0x420c1c 32 xor %ecx, %ecx 39.928ms
0x420c1e 32 xor %edi, %edi [Unknown]
Block 16: [Unknown]
0x420c20 34 add $0x1, %edi 29.994ms
0x420c23 34 mov %edi, %esi 30.956ms
0x420c25 34 movl (%rax,%rsi,4), %r15d 180.659ms
0x420c29 34 cmp %r15d, %edx 39.896ms
0x420c2c 34 cmovbe %edx, %r15d 19.951ms
0x420c30 35 xor %edx, %edx 460.772ms
0x420c32 34 add $0x1, %r15d 19.946ms
0x420c36 35 cmpb (%r10,%rcx,1), %r9b 169.659ms
0x420c3a 35 setnz %dl 49.815ms
0x420c3d 35 addl (%rax,%rcx,4), %edx [Unknown]
0x420c40 32 mov %rsi, %rcx 210.615ms <------------------
0x420c43 32 cmp %edx, %r15d 29.936ms
0x420c46 32 cmovbe %r15d, %edx 29.938ms
0x420c4a 32 cmp %rsi, %rbx 558.298ms <-------------------
0x420c4d 35 movl %edx, (%r8,%rsi,4) 19.965ms
0x420c51 32 jnbe 0x420c20 <Block 16> 200.625ms <-------------------
我不理解简单的举动,比较可能是耗时的。
I don’t understand how a simple move and compare could be that time-consuming.
推荐答案
由于所有现代的CPU使用外的顺序和推测执行探查不能告诉你确切的最耗时的指令。这是不寻常,从最耗时的指令看到最大的测量时间一两行了。
Profiler cannot show you the exact most time-consuming instructions because all modern CPUs use out-of-order and speculative execution. It is not unusual to see the largest measured time one or two lines away from the most time-consuming instruction.
正如预期的那样,最耗时的说明这里有 cmovbe
(实施的std ::分钟
)。你可以看到他们附近最大时间:460.772ms和558.298ms。 cmovbe
正在消耗指令,因为他们通常有较大的延迟和preceding说明更多的依赖关系最多的时间。
As expected, the most time consuming instructions here are cmovbe
(implementing std::min
). You can see the largest times near them: 460.772ms and 558.298ms. cmovbe
are the most time consuming instructions because usually they have large latency and more dependencies on preceding instructions.
这篇关于热点在for循环的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!