很少使用的更快的整数SSE非指定负载 [英] A faster integer SSE unalligned load that's rarely used

查看:108
本文介绍了很少使用的更快的整数SSE非指定负载的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

尤其是与_mm_loadu_si128内在函数(自SSE2开始的movdqu指令)相比,我想了解更多关于_mm_lddqu_si128 intrinsic(自SSE3以来的lddqu指令)的信息.

I would like to know more about the _mm_lddqu_si128intrinsic (lddqu instruction since SSE3) particularly compared with the _mm_loadu_si128 intrinsic (movdqu instruction since SSE2) .

我今天才发现_mm_lddqu_si128.英特尔内部指南说

I only discovered _mm_lddqu_si128 today. The intel intrinsic guide says

当数据越过缓存行边界时,此内在函数的性能可能优于_mm_loadu_si128

this intrinsic may perform better than _mm_loadu_si128 when the data crosses a cache line boundary

评论了 >

在某些情况下会表现更好,但绝不会变差.

will perform better under certain circumstances, but never perform worse.

那么,为什么不多使用它(SSE3的门槛却很低,因为所有Core2处理器都拥有它)?为什么数据越过高速缓存行时性能会更好? lddqu仅在某些处理器子集上可能更好.例如.在Nehalem之前?

So why is it not used more (SSE3 is a pretty low bar since all Core2 processors have it)? Why may it perform better when data crosses a cache line? Is lddqu only possibly better on a certain subset of processors. E.g. before Nehalem?

我意识到我可以通读一份英特尔手册来找到答案,但是我认为这个问题可能对其他人很有趣.

I realize I could read through an Intel manual to probably find the answer but I think this question may be interesting to other people.

推荐答案

lddqu 使用了不同的策略比P4上的 movdqu 相同,但在支持该功能的所有其他CPU上均相同.没有特别的缺点(因为SSE3指令不会占用任何额外的机器代码字节,并且在这一点上甚至AMD都已广泛支持它),但是除非您关心P4,否则根本没有任何缺点.

lddqu used a different strategy than movdqu on P4, but runs identically on all other CPUs that support it. There's no particular downside (since SSE3 instructions don't take any extra bytes of machine code, and are fairly widely supported even by AMD at this point), but no upside at all unless you care about P4.

Dark Shikari(x264视频编码器主要开发人员之一,负责许多SSE加速)

Dark Shikari (one of the x264 video encoder lead developers, responsible for a lot of SSE speedups) went into detail about it in a blog post in 2008. This is an archive.org link since the original is offline, but there's a lot of good stuff in his blog.

他提出的最有趣的一点是,Core2仍然具有较慢的未对齐负载,其中手动执行两个对齐的负载和palignr可以更快,但仅在立即移位计数时可用.由于Core2的lddqumovdqu相同,因此无济于事.

The most interesting point he makes is that Core2 still has slow unaligned loads, where manually doing two aligned loads and a palignr can be faster, but is only available with an immediate shift count. Since Core2 runs lddqu the same as movdqu, it doesn't help.

显然,Core1确实是专门实现了lddqu的,所以毕竟不只是P4.

Apparently Core1 does implement lddqu specially, so it's not just P4 after all.

关于lddqu/movdqu的历史的英特尔博客文章(我在Google中用2秒的时间找到了lddqu vs movdqu,/scold @Zboson),

This Intel blog post about the history of lddqu/movdqu (which I found in 2 seconds with google for lddqu vs movdqu, /scold @Zboson) explains:

(仅在P4上): 该指令由 加载在16字节边界上对齐的32字节块,提取对应于未对齐的16字节 访问.

(on P4 only): The instruction works by loading a 32-byte block aligned on a 16-byte boundary, extracting the 16 bytes corresponding to the unaligned access.

由于该指令加载的字节数超过了请求的字节数,因此存在一些使用限制. Lddqu应该 避免在未缓存(UC)和写合并(USWC)内存区域上使用.另外,通过其实施, 在应该进行存储负载转发的情况下,应避免使用lddqu.

Because the instruction loads more bytes than requested, some usage restrictions apply. Lddqu should be avoided on Uncached (UC) and Write-Combining (USWC) memory regions. Also, by its implementation, lddqu should be avoided in situations where store-load forwarding is expected.

所以我想这解释了为什么他们不只是一直使用该策略来实现movdqu的原因.

So I guess this explains why they didn't just use that strategy to implement movdqu all the time.

我想解码器没有可用的存储器类型信息,那就是当必须决定将指令解码到哪些微指令时.因此,即使在期望的情况下,试图机灵"地在WB内存上机会主义地使用更好的策略也是不可能的. (这不是因为存储转发).

I guess the decoders don't have the memory-type information available, and that's when the decision has to be made on which uops to decode the instruction to. So trying to be "smart" about using the better strategy opportunistically on WB memory probably wasn't possible, even if it was desirable. (Which it isn't because of store-forwarding).

该博客文章的摘要:

从Intel Core 2品牌(Core microarchitecture,从2006年中期开始,使用Merom CPU或更高版本)直到未来:lddqu的作用与movdqu相同.

starting from Intel Core 2 brand (Core microarchitecture , from mid 2006, Merom CPU and higher) up to the future: lddqu does the same thing as movdqu

换句话说:
*如果CPU支持补充流SIMD扩展3(SSSE3)-> lddqu与movdqu的作用相同,
*如果CPU不支持SSSE3,但支持SSE3->请访问lddqu (并注意有关内存类型的故事)

In the other words:
* if CPU supports Supplemental Streaming SIMD Extensions 3 (SSSE3) -> lddqu does the same thing as movdqu,
* If CPU doesn’t support SSSE3 but supports SSE3 -> go for lddqu (and note that story about memory types )

这篇关于很少使用的更快的整数SSE非指定负载的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆