SSE42和STTNI-PcmpEstrM比PcmpIstrM慢两倍，是真的吗？ [英] SSE42 & STTNI - PcmpEstrM is twice slower than PcmpIstrM, is it true?

查看：274 发布时间：2020/9/27 4:34:49 c++ performance sse sse4

本文介绍了SSE42和STTNI-PcmpEstrM比PcmpIstrM慢两倍，是真的吗？的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在尝试使用SSE42和STTNI指令，但结果很奇怪- PcmpEstrM （适用于显式长度字符串）的运行速度比PcmpIstrM （隐含长度字符串）慢了两倍。。

I'm experimenting with SSE42 and STTNI instructions and have got strange result - PcmpEstrM (works with explicit length strings) runs twice slower than PcmpIstrM (implicit length strings).

在我的 i7 3610QM 上，差异为 2366.2毫秒vs. 1202.3毫秒-97％。

在 i5 3470 上，差异不是很大，但仍然很明显= 3206.2 ms与2623.2 ms-22％。

On my i7 3610QM the difference is 2366.2 ms vs. 1202.3 ms - 97%.

On i5 3470 difference is not so huge, but is still significant = 3206.2 ms vs. 2623.2 ms - 22%.

两者都是常春藤桥-奇怪的是它们具有如此不同的差异（至少我看不到规格上的技术差异- http://www.cpu-world.com/Compare_CPUs/

Both are "Ivy Bridge" - it is strange that they have so different "difference" (at least i can't see any technical differences in their specs - http://www.cpu-world.com/Compare_CPUs/Intel_AW8063801013511,Intel_CM8063701093302/).

《 Intel 64和IA-32架构优化参考手册》提到PcmpEstrM和PcmpIstrM的吞吐量= 11，延迟= 3。因此，我期望两者都具有相似的性能。

Intel 64 and IA-32 Architectures Optimization Reference Manual mentions same throughput = 11 and latency = 3 for both PcmpEstrM and PcmpIstrM. Therefore i expect similar performance for both.

Q：是我实际设计/预期的差异还是我正在使用这些说明？以错误的方式？

Q: Is the difference i've got practically designed/expected or i'm using these instruction in a wrong way?

下面是我的虚拟测试场景（VS 2012）。逻辑非常简单-扫描16MB的文本以找到匹配的字符。由于干草堆和针线都不包含零终结符-我希望E和I都具有相似的性能。

Below is my dummy test scenario (VS 2012). The logic is pretty simple - scan 16MB оf text to find matching character. Since none of haystack and needle string contain zero terminators - i expect both E and I to have similar performance.

PS：我尝试在英特尔的开发人员论坛，但他们将其识别为垃圾邮件：（

PS: I tried posting this question at intel's dev forum, but they identify it as spam :(

#include "stdafx.h" #include <windows.h> #define BEGIN_TIMER(NAME) \ { \ LARGE_INTEGER __freq; \ LARGE_INTEGER __t0; \ LARGE_INTEGER __t1; \ double __tms; \ const char* __tname = NAME; \ char __tbuf[0xff]; \ \ QueryPerformanceFrequency(&__freq); \ QueryPerformanceCounter(&__t0); #define END_TIMER() \ QueryPerformanceCounter(&__t1); \ __tms = (__t1.QuadPart - __t0.QuadPart) * 1000.0 / __freq.QuadPart; \ sprintf_s(__tbuf, sizeof(__tbuf), "%-32s = %6.1f ms\n", __tname, __tms ); \ OutputDebugStringA(__tbuf); \ printf(__tbuf); \ } // 4.1.3 Aggregation Operation #define SSE42_AGGOP_BITBASE 2 #define SSE42_AGGOP_EQUAL_ANY (00b << SSE42_AGGOP_BITBASE) #define SSE42_AGGOP_RANGES (01b << SSE42_AGGOP_BITBASE) #define SSE42_AGGOP_EQUAL_EACH (10b << SSE42_AGGOP_BITBASE) #define SSE42_AGGOP_EQUAL_ORDERED (11b << SSE42_AGGOP_BITBASE) int _tmain(int argc, _TCHAR* argv[]) { int cIterations = 1000000; int cCycles = 1000; int cchData = 16 * cIterations; char* testdata = new char[cchData + 16]; memset(testdata, '*', cchData); testdata[cchData - 1] = '+'; testdata[cchData] = '\0'; BEGIN_TIMER("PcmpIstrI") { for( int i = 0; i < cCycles; i++ ) { __asm { push ecx push edx push ebx mov edi, testdata mov ebx, cIterations mov al, '+' mov ah, al movd xmm1, eax // fill low word with pattern pshuflw xmm1, xmm1, 0 // fill low dqword with pattern movlhps xmm1, xmm1 // ... and copy it hi dqword loop_pcmpistri: PcmpIstrM xmm1, [edi], SSE42_AGGOP_EQUAL_EACH add edi, 16 sub ebx, 1 jnz loop_pcmpistri pop ebx pop edx pop ecx } } } END_TIMER(); BEGIN_TIMER("PcmpEstrI") { for( int i = 0; i < cCycles; i++ ) { __asm { push ecx push edx push ebx mov edi, testdata mov ebx, cIterations mov al, '+' mov ah, al movd xmm1, eax // fill low word with pattern pshuflw xmm1, xmm1, 0 // fill low dqword with pattern movlhps xmm1, xmm1 // ... and copy it hi dqword mov eax, 15 mov edx, 15 loop_pcmpestri: PcmpEstrM xmm1, [edi], SSE42_AGGOP_EQUAL_EACH add edi, 16 sub ebx, 1 jnz loop_pcmpestri pop ebx pop edx pop ecx } } } END_TIMER(); return 0; }

推荐答案

根据 Agner雾， pcmpestrm 需要8 µops，而 pcmpistrm 在大多数体系结构上需要3 µops。这应该可以解释您观察到的性能差异。考虑重写代码，以便可以使用 pcmpistrm 代替 pcmpestrm 。

According to the instruction tables of Agner fog, pcmpestrm takes 8 µops, whereas pcmpistrm takes 3 µops on most architectures. This should explain the performance difference you observe. Consider rewriting your code so you can use pcmpistrm instead of pcmpestrm if possible.

这篇关于SSE42和STTNI-PcmpEstrM比PcmpIstrM慢两倍，是真的吗？的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

SSE42和STTNI-PcmpEstrM比PcmpIstrM慢两倍，是真的吗？ [英] SSE42 & STTNI - PcmpEstrM is twice slower than PcmpIstrM, is it true?

问题描述

推荐答案

相关文章

C/C++开发最新文章

热门教程

热门工具

登录关闭

SSE42和STTNI-PcmpEstrM比PcmpIstrM慢两倍，是真的吗？ [英] SSE42 &amp; STTNI - PcmpEstrM is twice slower than PcmpIstrM, is it true?

问题描述

推荐答案

相关文章

C/C++开发最新文章

热门教程

热门工具

登录 关闭

SSE42和STTNI-PcmpEstrM比PcmpIstrM慢两倍，是真的吗？ [英] SSE42 & STTNI - PcmpEstrM is twice slower than PcmpIstrM, is it true?

登录关闭