将int64_t移到AVX2 __m256i向量的高四倍数 [英] Move an int64_t to the high quadwords of an AVX2 __m256i vector
问题描述
这个问题类似于[1].但是,我不太了解它是如何解决使用GPR插入到ymm的高倍数的.另外,我希望该操作不使用任何中间内存访问.
This question is similar to [1]. However I didn't quite understand how it addressed inserting to high quadwords of a ymm using a GPR. Additionally I want the operation not use any intermediate memory accesses.
可以使用AVX2或更低版本(我没有AVX512)吗?
Can it be done with AVX2 or below (I don't have AVX512)?
[1] 推荐答案
My answer on the linked question didn't show a way to do that because it can't be done very efficiently without AVX512F for a masked broadcast ( (在Intel上,总计3 ups.端口5(vmovq +广播)为2 ups,并且可以在任何端口上运行的即时混合.
参见 https://agner.org/optimize/). (On Intel, 3 uops total. 2 uops for port 5 (vmovq + broadcast), and an immediate blend that can run on any port.
See https://agner.org/optimize/). 为此,我在那里用asm更新了答案.在具有英特尔内部函数的C ++中,您将执行以下操作: I updated my answer there with asm for this. In C++ with Intel's intrinsics, you'd do something like: Clang对所有4个可能的元素位置都几乎完美地进行了编译,这确实表明了shuffle优化器的出色表现.它利用了所有特殊情况.作为奖励,它会评论其组件以向您显示混合和混洗中的哪些元素来自何处. Clang compiles this nearly perfectly efficiently for all 4 possible element positions, which really shows off how nice its shuffle optimizer is. It takes advantage of all the special cases. And as a bonus, it comments its asm to show you which elements come from where in blends and shuffles. From the Godbolt compiler explorer, some test functions to see what happens with args in regs. 其他编译器盲目广播到完整的YMM,然后混合,即使对于elem = 0. 您可以对模板进行专业化处理,或在模板中添加 Other compilers blindly broadcast to the full YMM and then blend, even for elem=0. You can specialize the template, or add GCC 8.x和更早的版本使用通常不好的方式来广播整数:它们存储/重新加载.这样可以避免使用任何ALU改组端口,因为广播负载在Intel CPU上是免费的,但是它将存储转发延迟引入了从整数到最终向量结果的链中. GCC 8.x and earlier use a normally-bad way of broadcasting the integer: they store/reload. This avoids using any ALU shuffle ports because broadcast-loads are free on Intel CPUs, but it introduces store-forwarding latency into the chain from the integer to the final vector result. 这在gcc9的当前主干中已解决,但是我不知道是否有解决方法可以使早期的gcc获得非愚蠢的代码源.通常, This is fixed in current trunk for gcc9, but I don't know if there's a workaround to get non-silly code-gen with earlier gcc. Normally
对于运行时变量元素位置,混洗仍然有效,但是您必须创建一个混合掩码矢量,并在右侧元素中设置高位.例如从
For a runtime-variable element position, the shuffle still works but you'd have to create a blend mask vector with the high bit set in the right element. e.g. with a 这篇关于将int64_t移到AVX2 __m256i向量的高四倍数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!vpbroadcastq zmm0{k1}, rax
). But it's actually not all that bad using a scratch register, about the same cost as a vpinsrq
+ an immediate blend.#include <immintrin.h>
#include <stdint.h>
// integer version. An FP version would still use _mm256_set1_epi64x, then a cast
template<unsigned elem>
static inline
__m256i merge_epi64(__m256i v, int64_t newval)
{
static_assert(elem <= 3, "a __m256i only has 4 qword elements");
__m256i splat = _mm256_set1_epi64x(newval);
constexpr unsigned dword_blendmask = 0b11 << (elem*2); // vpblendd uses 2 bits per qword
return _mm256_blend_epi32(v, splat, dword_blendmask);
}
__m256i merge3(__m256i v, int64_t newval) {
return merge_epi64<3> (v, newval);
}
// and so on for 2..0
# clang7.0 -O3 -march=haswell
merge3(long long __vector(4), long):
vmovq xmm1, rdi
vpbroadcastq ymm1, xmm1
vpblendd ymm0, ymm0, ymm1, 192 # ymm0 = ymm0[0,1,2,3,4,5],ymm1[6,7]
# 192 = 0xC0 = 0b11000000
ret
merge2(long long __vector(4), long):
vmovq xmm1, rdi
vinserti128 ymm1, ymm0, xmm1, 1 # Runs on more ports than vbroadcast on AMD Ryzen
# But it introduced a dependency on v (ymm0) before the blend for no reason, for the low half of ymm1. Could have used xmm1, xmm1.
vpblendd ymm0, ymm0, ymm1, 48 # ymm0 = ymm0[0,1,2,3],ymm1[4,5],ymm0[6,7]
ret
merge1(long long __vector(4), long):
vmovq xmm1, rdi
vpbroadcastq xmm1, xmm1 # only an *XMM* broadcast, 1c latency instead of 3.
vpblendd ymm0, ymm0, ymm1, 12 # ymm0 = ymm0[0,1],ymm1[2,3],ymm0[4,5,6,7]
ret
merge0(long long __vector(4), long):
vmovq xmm1, rdi
# broadcast optimized away, newval is already in the low element
vpblendd ymm0, ymm0, ymm1, 3 # ymm0 = ymm1[0,1],ymm0[2,3,4,5,6,7]
ret
if()
个条件,以进行优化.例如splat = (elem?) set1() : v;
将广播保存为elem == 0.如果需要,您也可以捕获其他优化.if()
conditions in the template that will optimize away. e.g. splat = (elem?) set1() : v;
to save the broadcast for elem==0. You could capture the other optimizations, too, if you wanted.-march=<an intel uarch>
倾向于使用ALU而不是整数->向量的存储/重新加载,反之亦然,但是在这种情况下,成本模型仍然使用-march=haswell
选择存储/重新加载.-march=<an intel uarch>
favours ALU instead of store/reload for integer -> vector and vice versa, but in this case the cost model still picks store/reload with -march=haswell
.# gcc8.2 -O3 -march=haswell
merge0(long long __vector(4), long):
push rbp
mov rbp, rsp
and rsp, -32 # align the stack even though no YMM is spilled/loaded
mov QWORD PTR [rsp-8], rdi
vpbroadcastq ymm1, QWORD PTR [rsp-8] # 1 uop on Intel
vpblendd ymm0, ymm0, ymm1, 3
leave
ret
; GCC trunk: g++ (GCC-Explorer-Build) 9.0.0 20190103 (experimental)
; MSVC and ICC do this, too. (For MSVC, make sure to compile with -arch:AVX2)
merge0(long long __vector(4), long):
vmovq xmm2, rdi
vpbroadcastq ymm1, xmm2
vpblendd ymm0, ymm0, ymm1, 3
ret
alignas(8) int8_t mask[] = { 0,0,0,-1,0,0,0 };
中的mask[3-elem]
加载vpmovsxbq
的情况.但是vpblendvb
或vblendvpd
比立即混合要慢,尤其是在Haswell上,因此请尽可能避免这种情况.
vpmovsxbq
load from mask[3-elem]
in alignas(8) int8_t mask[] = { 0,0,0,-1,0,0,0 };
. But vpblendvb
or vblendvpd
is slower than an immediate blend, especially on Haswell, so avoid that if possible.