_mm512_load_epi32和_mm512_load_si512有什么区别? [英] What is the difference between _mm512_load_epi32 and _mm512_load_si512?

查看:278
本文介绍了_mm512_load_epi32和_mm512_load_si512有什么区别?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

《英特尔内在函数指南》仅声明_mm512_load_epi32:

The Intel intrinsics guide states simply that _mm512_load_epi32:

将512位(由16个压缩的32位整数组成)从内存加载到dst

Load[s] 512-bits (composed of 16 packed 32-bit integers) from memory into dst

_mm512_load_si512:

将512位整数数据从内存加载到dst

Load[s] 512-bits of integer data from memory into dst

这两者之间有什么区别?文档不清楚.

What is the difference between these two? The documentation isn't clear.

推荐答案

没有区别,只是愚蠢的冗余命名.为清楚起见,请使用_mm512_load_si512.谢谢,英特尔.像往常一样,更容易理解AVX512的底层asm,然后您可以看到笨拙的内在命名要表达的意思.至少,您至少可以了解我们是如何最终得到这些杂乱无章的文档的,这些文档提示了_mm512_load_epi32_mm512_load_si512.

几乎所有的AVX512指令都支持合并屏蔽和零屏蔽. (例如vmovdqa32可以对k1具有零位的零个向量元素执行像vmovdqa32 zmm0{k1}{z}, [rdi]这样的掩码负载),这就是为什么存在像矢量负载和按位运算之类的元素大小不同的版本的原因. (例如 vpxordvpxorq ).

Almost all AVX512 instructions support merge-masking and zero-masking. (e.g. vmovdqa32 can do a masked load like vmovdqa32 zmm0{k1}{z}, [rdi] to zero vector elements where k1 had a zero bit), which is why different element-size versions of things like vector loads and bitwise operations exist. (e.g. vpxord vs. vpxorq).

但是这些内在函数是用于非屏蔽版本的.元素大小是完全无关的.我猜想存在_mm512_load_epi32是为了与_mm512_mask_load_epi32(合并掩码)和_mm512_maskz_load_epi32(零掩码)保持一致.请参见 vmovdqa32 asm指令的文档.

But these intrinsics are for the no-masking version. The element-size is totally irrelevant. I'm guessing _mm512_load_epi32 exists for consistency with _mm512_mask_load_epi32 (merge-masking) and _mm512_maskz_load_epi32 (zero-masking). See the docs for the vmovdqa32 asm instruction.

例如_mm512_maskz_loadu_epi64(0x55, x)在加载时免费将奇数元素清零. (至少可以免费将0x55放入k寄存器中,这是免费的.并且如果我们没有击败编译器将负载折叠到ALU的内存操作数中的机会,那么这是免费的.指令.)

e.g. _mm512_maskz_loadu_epi64(0x55, x) zeros the odd elements for free while loading. (At least it's free if the cost of putting 0x55 into a k register can be hoisted out of a loop. And if we haven't defeated the chance for the compiler to fold a load into a memory operand for an ALU instruction.)

当元素全部未更改地加载到目标中时,元素边界就没有意义.这就是为什么AVX2和更早版本没有像_mm_xor_si128这样的按位布尔值和_mm_load_si128这样的装载/存储都具有不同的元素大小版本的原因.

When elements are all loaded into the destination unchanged, element boundaries are meaningless. That's why AVX2 and earlier don't have different element-size versions of bitwise booleans like _mm_xor_si128 and loads/stores like _mm_load_si128.

某些编译器不支持未对齐未屏蔽负载的元素宽度名称.例如自从第一个完全支持AVX512内部函数的gcc版本以来,当前的gcc即使它已支持_mm512_load_epi64也不支持_mm512_loadu_epi64. (请参见错误:在此未声明'_mm512_loadu_epi64'范围)

Some compilers don't support the element-width names for unaligned unmasked loads. e.g. current gcc doesn't support _mm512_loadu_epi64 even though it's supported _mm512_load_epi64 since the first gcc version to support AVX512 intrinsics at all. (See error: '_mm512_loadu_epi64' was not declared in this scope)

在所有CPU中,vmovdqa64vmovdqa32的选择对于效率都没有任何影响,因此,试图暗示编译器使用一个或另一个,这毫无意义.数据的自然元素宽度的大小.

There are no CPUs where the choice of vmovdqa64 vs. vmovdqa32 matters at all for efficiency, so there's zero point in trying to hint the compiler to use one or the other, regardless of the natural element width of your data.

只有FP与整数可能对负载很重要,并且Intel的内在函数已经为此使用了不同的类型(__m512__m512i).

Only FP vs. integer might matter for loads, and Intel's intrinsics already uses different types (__m512 vs. __m512i) for that.

这篇关于_mm512_load_epi32和_mm512_load_si512有什么区别?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆