_mm256_lddqu_si256和_mm256_loadu_si256有什么区别 [英] what's the difference between _mm256_lddqu_si256 and _mm256_loadu_si256

查看:128
本文介绍了_mm256_lddqu_si256和_mm256_loadu_si256有什么区别的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

根据我在网上找到的示例,我一直在使用_mm256_lddqu_si256.后来我发现了_mm256_loadu_si256.英特尔内部技术指南仅声明lddqu版本在越过缓存行边界时可能会表现更好. loadu的优点是什么?一般来说,这些功能有何不同?

I had been using _mm256_lddqu_si256 based on an example I found online. Later I discovered _mm256_loadu_si256. The Intel Intrinsics guide only states that the lddqu version may perform better when crossing a cache line boundary. What might be the advantages of loadu? In general how are these functions different?

推荐答案

没有理由使用_mm256_lddqu_si256 ,请认为它是_mm256_loadu_si256的同义词. lddqu仅出于历史原因存在,因为x86向更好的不对齐矢量负载支持发展,并且支持AVX版本的CPU均以相同的方式运行它们.没有AVX512版本.

There's no reason to ever use _mm256_lddqu_si256, consider it a synonym for _mm256_loadu_si256. lddqu only exists for historical reasons as x86 evolved towards having better unaligned vector load support, and CPUs that support the AVX version run them identically. There's no AVX512 version.

Compilers do still respect the lddqu intrinsic and emit that instruction, so you could use it if you want your code to run identically but have a different checksum or machine code bytes.

没有任何x86微体系结构运行vlddquvmovdqu有任何不同. IE.这两个操作码可能在所有AVX CPU上解码为相同的内部uop.除非某些超低功耗或专用的微体系结构没有有效的未对齐矢量负载(自Nehalem以来一直存在),否则它们可能总是会的.自动向量化时,编译器从不使用vlddqu.

No x86 microarchitectures run vlddqu any differently from vmovdqu. I.e. the two opcodes probably decode to the same internal uop on all AVX CPUs. They probably always will, unless some very-low-power or specialized microarchitecture comes along without efficient unaligned vector loads (which have been a thing since Nehalem). Compilers never use vlddqu when auto-vectorizing.

lddqu与奔腾4上的movdqu不同.请参见历史记录……一条CPU指令:第1部分.LDDQU/movdqu解释了.

lddqu was different from movdqu on Pentium 4. See History of … one CPU instructions: Part 1. LDDQU/movdqu explained.

lddqu进行两个对齐的16B加载(并且在P4上执行 ),并获取该数据的窗口. movdqu在架构上仅从预期的16个字节加载. 这对存储转发有影响:如果您要加载的数据只存储在未对齐的存储中,请使用movdqu,因为存储转发仅适用于完全包含在先前存储中的负载.但是否则,您通常总是想使用lddqu. (这就是为什么他们不只是让movdqu总是使用好的方法",而是引入了新的指令供程序员担心的原因.但是幸运的是,他们改变了设计,因此我们不必担心关于该使用哪条未对齐的加载指令了.)

lddqu is allowed to (and on P4 does do) two aligned 16B loads and takes a window of that data. movdqu architecturally only ever loads from the expected 16 bytes. This has implications for store-forwarding: if you're loading data that was just stored with an unaligned store, use movdqu because store-forwarding only works for loads that are fully contained within a previous store. But otherwise you generally always wanted to use lddqu. (This is why they didn't just make movdqu always use "the good way", and instead introduced a new instruction for programmers to worry about. But luckily for us, they changed the design so we don't have to worry about which unaligned load instruction to use anymore.)

它也影响不可缓存(UC)或不可缓存推测写合并(UCSW,亦称WC)内存类型(其后可能有MMIO寄存器)的可观察行为的正确性.

It also has implications for correctness of observable behaviour on UnCacheable (UC) or Uncacheable Speculate Write-combining (UCSW, aka WC) memory types (which might have MMIO registers behind them.)

两个asm指令中的代码大小没有区别:

There's no code-size difference in the two asm instructions:

  # SSE packed-single instructions are shorter than SSE2 integer / packed-double
  4000e3:       0f 10 07                movups xmm0, [rdi]   

  4000e6:       f2 0f f0 07             lddqu  xmm0, [rdi]
  4000ea:       f3 0f 6f 07             movdqu xmm0, [rdi]

  4000ee:       c5 fb f0 07             vlddqu xmm0, [rdi]
  4000f2:       c5 fa 6f 07             vmovdqu xmm0, [rdi]
  # AVX-256 is the same as AVX-128, but with one more bit set in the VEX prefix


在Core2和更高版本上,没有理由使用lddqu,但是与movdqu相比也没有缺点.英特尔为Core2放弃了特殊的lddqu东西,因此这两种选择都同样糟糕.


On Core2 and later, there's no reason to use lddqu, but also no downside vs. movdqu. Intel dropped the special lddqu stuff for Core2, so both options suck equally.

特别是在Core2上,避免在具有两个对齐负载和SSSE3 palignr的软件中进行高速缓存行拆分有时会胜过movdqu,尤其是在第二代Core2(Penryn)上,其中palignr只是一个改组urom,而不是Merom/Conroe上的2. (Penryn将随机执行单元扩大到128b).

On Core2 specifically, avoiding cache-line splits in software with two aligned loads and SSSE3 palignr is sometimes a win vs. movdqu, especially on 2nd-gen Core2 (Penryn) where palignr is only one shuffle uop instead of 2 on Merom/Conroe. (Penryn widened the shuffle execution unit to 128b).

请参阅Dark Shikaris的2009 x264开发者日记博客文章:

See Dark Shikaris's 2009 Diary Of An x264 Developer blog post: Cacheline splits, take two for more about unaligned-load strategies back in the bad old days.

Core2之后的一代是Nehalem,其中movdqu是单个uop指令,在加载端口中具有专用的硬件支持.告诉编译器何时对齐指针仍然很有用(尤其是对于自动矢量化,尤其是在没有AVX的情况下),但是对它们来说,仅在任何地方都使用movdqu不会对性能造成损害,尤其是当数据实际上在运行时对齐时,时间.

The generation after Core2 is Nehalem, where movdqu is a single uop instruction with dedicated hardware support in the load ports. It's still useful to tell compilers when pointers are aligned (especially for auto-vectorization, and especially without AVX), but it's not a performance disaster for them to just use movdqu everywhere, especially if the data is in fact aligned at run-time.

我完全不知道为什么英特尔甚至制作了lddqu的AVX版本.我想对于解码器来说,在所有模式下(使用旧版SSE前缀,或者使用AVX128/AVX256)将操作码作为movdqu/vmovdqu的别名,会比使用VEX将操作码解码为其他内容更简单前缀.

I don't know why Intel even made an AVX version of lddqu at all. I guess it's simpler for the decoders to just treat that opcode as an alias for movdqu / vmovdqu in all modes (with legacy SSE prefixes, or with AVX128 / AVX256), instead of having that opcode decode to something else with VEX prefixes.

当前所有支持AVX的CPU都具有有效的硬件未对齐加载/存储支持,可以对其进行最佳处理.例如当数据在运行时对齐时,与vmovdqa的性能差异恰好为零.

All current AVX-supporting CPUs have efficient hardware unaligned-load / store support that handles it as optimally as possible. e.g. when the data is aligned at runtime, there's exactly zero performance difference vs. vmovdqa.

这不是Nehalem之前的情况. movdqulddqu用于解码为多个uops以处理可能未对齐的地址,而不是将硬件支持权放到加载端口中,单个uop可以激活它,而不是在未对齐的地址上出错.

This was not the case before Nehalem; movdqu and lddqu used to decode to multiple uops to handle potentially-misaligned addresses, instead of putting hardware support for that right in the load ports where a single uop can activate it instead of faulting on unaligned addresses.

但是, lddqu 的英特尔ISA ref手动输入说,可以加载256b版本最多64个字节(取决于实现):

However, Intel's ISA ref manual entry for lddqu says the 256b version can load up to 64 bytes (implementation dependent):

如果源操作数越过缓存行边界,则该指令相对于(V)MOVDQU可以提高性能.在需要修改(V)LDDQU加载的数据并将其存储到同一位置的情况下,请使用(V)MOVDQU或(V)MOVDQA而不是(V)LDDQU.若要将双四字移入或移出已知在16字节边界上对齐的内存位置,请使用(V)MOVDQA指令.

This instruction may improve performance relative to (V)MOVDQU if the source operand crosses a cache line boundary. In situations that require the data loaded by (V)LDDQU be modified and stored to the same location, use (V)MOVDQU or (V)MOVDQA instead of (V)LDDQU. To move a double quadword to or from memory locations that are known to be aligned on 16-byte boundaries, use the (V)MOVDQA instruction.

IDK中有多少是有意写的,其中有多少是来自为AVX更新条目时以(V)开头的.我不认为英特尔的优化手册建议在任何地方都真正使用vlddqu,但是我没有检查.

IDK how much of that was written deliberately, and how much of that just came from prepending (V) when updating the entry for AVX. I don't think Intel's optimization manual recommends really using vlddqu anywhere, but I didn't check.

没有vlddqu 的AVX512版本,所以我认为这意味着Intel决定了另一种策略未对齐的加载指令不再有用,甚至不值得保留其选项打开.

There is no AVX512 version of vlddqu, so I think that means Intel has decided that an alternate-strategy unaligned load instruction is no longer useful, and isn't even worth keeping their options open.

这篇关于_mm256_lddqu_si256和_mm256_loadu_si256有什么区别的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆