SSE42和STTNI-PcmpEstrM比PcmpIstrM慢两倍,是真的吗? [英] SSE42 & STTNI - PcmpEstrM is twice slower than PcmpIstrM, is it true?
问题描述
我正在尝试使用SSE42和STTNI指令,但结果很奇怪- PcmpEstrM (适用于显式长度字符串)的运行速度比PcmpIstrM (隐含长度字符串)慢了两倍。 。
I'm experimenting with SSE42 and STTNI instructions and have got strange result - PcmpEstrM (works with explicit length strings) runs twice slower than PcmpIstrM (implicit length strings).
- 在我的 i7 3610QM 上,差异为 2366.2毫秒vs. 1202.3毫秒-97%强>。
- 在 i5 3470 上,差异不是很大,但仍然很明显= 3206.2 ms与2623.2 ms-22%。
- On my i7 3610QM the difference is 2366.2 ms vs. 1202.3 ms - 97%.
- On i5 3470 difference is not so huge, but is still significant = 3206.2 ms vs. 2623.2 ms - 22%.
两者都是常春藤桥-奇怪的是它们具有如此不同的差异(至少我看不到规格上的技术差异- http://www.cpu-world.com/Compare_CPUs/
Both are "Ivy Bridge" - it is strange that they have so different "difference" (at least i can't see any technical differences in their specs - http://www.cpu-world.com/Compare_CPUs/Intel_AW8063801013511,Intel_CM8063701093302/).
《 Intel 64和IA-32架构优化参考手册》提到PcmpEstrM和PcmpIstrM的吞吐量= 11,延迟= 3。因此,我期望两者都具有相似的性能。
Intel 64 and IA-32 Architectures Optimization Reference Manual mentions same throughput = 11 and latency = 3 for both PcmpEstrM and PcmpIstrM. Therefore i expect similar performance for both.
Q:是我实际设计/预期的差异还是我正在使用这些说明?以错误的方式?
Q: Is the difference i've got practically designed/expected or i'm using these instruction in a wrong way?
下面是我的虚拟测试场景(VS 2012)。逻辑非常简单-扫描16MB的文本以找到匹配的字符。由于干草堆和针线都不包含零终结符-我希望E和I都具有相似的性能。
Below is my dummy test scenario (VS 2012). The logic is pretty simple - scan 16MB оf text to find matching character. Since none of haystack and needle string contain zero terminators - i expect both E and I to have similar performance.
PS:我尝试在英特尔的开发人员论坛,但他们将其识别为垃圾邮件:(
PS: I tried posting this question at intel's dev forum, but they identify it as spam :(
#include "stdafx.h"
#include <windows.h>
#define BEGIN_TIMER(NAME) \
{ \
LARGE_INTEGER __freq; \
LARGE_INTEGER __t0; \
LARGE_INTEGER __t1; \
double __tms; \
const char* __tname = NAME; \
char __tbuf[0xff]; \
\
QueryPerformanceFrequency(&__freq); \
QueryPerformanceCounter(&__t0);
#define END_TIMER() \
QueryPerformanceCounter(&__t1); \
__tms = (__t1.QuadPart - __t0.QuadPart) * 1000.0 / __freq.QuadPart; \
sprintf_s(__tbuf, sizeof(__tbuf), "%-32s = %6.1f ms\n", __tname, __tms ); \
OutputDebugStringA(__tbuf); \
printf(__tbuf); \
}
// 4.1.3 Aggregation Operation
#define SSE42_AGGOP_BITBASE 2
#define SSE42_AGGOP_EQUAL_ANY (00b << SSE42_AGGOP_BITBASE)
#define SSE42_AGGOP_RANGES (01b << SSE42_AGGOP_BITBASE)
#define SSE42_AGGOP_EQUAL_EACH (10b << SSE42_AGGOP_BITBASE)
#define SSE42_AGGOP_EQUAL_ORDERED (11b << SSE42_AGGOP_BITBASE)
int _tmain(int argc, _TCHAR* argv[])
{
int cIterations = 1000000;
int cCycles = 1000;
int cchData = 16 * cIterations;
char* testdata = new char[cchData + 16];
memset(testdata, '*', cchData);
testdata[cchData - 1] = '+';
testdata[cchData] = '\0';
BEGIN_TIMER("PcmpIstrI") {
for( int i = 0; i < cCycles; i++ ) {
__asm {
push ecx
push edx
push ebx
mov edi, testdata
mov ebx, cIterations
mov al, '+'
mov ah, al
movd xmm1, eax // fill low word with pattern
pshuflw xmm1, xmm1, 0 // fill low dqword with pattern
movlhps xmm1, xmm1 // ... and copy it hi dqword
loop_pcmpistri:
PcmpIstrM xmm1, [edi], SSE42_AGGOP_EQUAL_EACH
add edi, 16
sub ebx, 1
jnz loop_pcmpistri
pop ebx
pop edx
pop ecx
}
}
} END_TIMER();
BEGIN_TIMER("PcmpEstrI") {
for( int i = 0; i < cCycles; i++ ) {
__asm {
push ecx
push edx
push ebx
mov edi, testdata
mov ebx, cIterations
mov al, '+'
mov ah, al
movd xmm1, eax // fill low word with pattern
pshuflw xmm1, xmm1, 0 // fill low dqword with pattern
movlhps xmm1, xmm1 // ... and copy it hi dqword
mov eax, 15
mov edx, 15
loop_pcmpestri:
PcmpEstrM xmm1, [edi], SSE42_AGGOP_EQUAL_EACH
add edi, 16
sub ebx, 1
jnz loop_pcmpestri
pop ebx
pop edx
pop ecx
}
}
} END_TIMER();
return 0;
}
推荐答案
根据 Agner雾, pcmpestrm
需要8 µops,而 pcmpistrm
在大多数体系结构上需要3 µops。这应该可以解释您观察到的性能差异。考虑重写代码,以便可以使用 pcmpistrm
代替 pcmpestrm
。
According to the instruction tables of Agner fog, pcmpestrm
takes 8 µops, whereas pcmpistrm
takes 3 µops on most architectures. This should explain the performance difference you observe. Consider rewriting your code so you can use pcmpistrm
instead of pcmpestrm
if possible.
这篇关于SSE42和STTNI-PcmpEstrM比PcmpIstrM慢两倍,是真的吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!