将int64_t移到AVX2 __m256i向量的高四倍数 [英] Move an int64_t to the high quadwords of an AVX2 __m256i vector

查看:86
本文介绍了将int64_t移到AVX2 __m256i向量的高四倍数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这个问题类似于[1].但是,我不太了解它是如何解决使用GPR插入到ymm的高倍数的.另外,我希望该操作不使用任何中间内存访问.

This question is similar to [1]. However I didn't quite understand how it addressed inserting to high quadwords of a ymm using a GPR. Additionally I want the operation not use any intermediate memory accesses.

可以使用AVX2或更低版本(我没有AVX512)吗?

Can it be done with AVX2 or below (I don't have AVX512)?

[1] 推荐答案

我的回答没有显示出这样做的方法,因为如果没有AVX512F进行掩蔽广播(例如, vpbroadcastq zmm0{k1}, rax).但是使用暂存寄存器实际上并没有那么糟糕,它的成本与vpinsrq +立即混合相同.

My answer on the linked question didn't show a way to do that because it can't be done very efficiently without AVX512F for a masked broadcast (vpbroadcastq zmm0{k1}, rax). But it's actually not all that bad using a scratch register, about the same cost as a vpinsrq + an immediate blend.

(在Intel上,总计3 ups.端口5(vmovq +广播)为2 ups,并且可以在任何端口上运行的即时混合. 参见 https://agner.org/optimize/).

(On Intel, 3 uops total. 2 uops for port 5 (vmovq + broadcast), and an immediate blend that can run on any port. See https://agner.org/optimize/).

为此,我在那里用asm更新了答案.在具有英特尔内部函数的C ++中,您将执行以下操作:

I updated my answer there with asm for this. In C++ with Intel's intrinsics, you'd do something like:

#include <immintrin.h>
#include <stdint.h>

// integer version.  An FP version would still use _mm256_set1_epi64x, then a cast
template<unsigned elem>
static inline
__m256i merge_epi64(__m256i v, int64_t newval)
{
    static_assert(elem <= 3, "a __m256i only has 4 qword elements");

    __m256i splat = _mm256_set1_epi64x(newval);

    constexpr unsigned dword_blendmask = 0b11 << (elem*2);  // vpblendd uses 2 bits per qword
    return  _mm256_blend_epi32(v, splat, dword_blendmask);
}

Clang对所有4个可能的元素位置都几乎完美地进行了编译,这确实表明了shuffle优化器的出色表现.它利用了所有特殊情况.作为奖励,它会评论其组件以向您显示混合和混洗中的哪些元素来自何处.

Clang compiles this nearly perfectly efficiently for all 4 possible element positions, which really shows off how nice its shuffle optimizer is. It takes advantage of all the special cases. And as a bonus, it comments its asm to show you which elements come from where in blends and shuffles.

From the Godbolt compiler explorer, some test functions to see what happens with args in regs.

__m256i merge3(__m256i v, int64_t newval) {
    return merge_epi64<3> (v, newval);
}
// and so on for 2..0

# clang7.0 -O3 -march=haswell
merge3(long long __vector(4), long):
    vmovq   xmm1, rdi
    vpbroadcastq    ymm1, xmm1
    vpblendd        ymm0, ymm0, ymm1, 192 # ymm0 = ymm0[0,1,2,3,4,5],ymm1[6,7]
                      # 192 = 0xC0 = 0b11000000
    ret

merge2(long long __vector(4), long):
    vmovq   xmm1, rdi
    vinserti128     ymm1, ymm0, xmm1, 1          # Runs on more ports than vbroadcast on AMD Ryzen
        #  But it introduced a dependency on  v (ymm0) before the blend for no reason, for the low half of ymm1.  Could have used xmm1, xmm1.
    vpblendd        ymm0, ymm0, ymm1, 48 # ymm0 = ymm0[0,1,2,3],ymm1[4,5],ymm0[6,7]
    ret

merge1(long long __vector(4), long):
    vmovq   xmm1, rdi
    vpbroadcastq    xmm1, xmm1           # only an *XMM* broadcast, 1c latency instead of 3.
    vpblendd        ymm0, ymm0, ymm1, 12 # ymm0 = ymm0[0,1],ymm1[2,3],ymm0[4,5,6,7]
    ret

merge0(long long __vector(4), long):
    vmovq   xmm1, rdi
           # broadcast optimized away, newval is already in the low element
    vpblendd        ymm0, ymm0, ymm1, 3 # ymm0 = ymm1[0,1],ymm0[2,3,4,5,6,7]
    ret

其他编译器盲目广播到完整的YMM,然后混合,即使对于elem = 0. 您可以对模板进行专业化处理,或在模板中添加if()个条件,以进行优化.例如splat = (elem?) set1() : v;将广播保存为elem == 0.如果需要,您也可以捕获其他优化.

Other compilers blindly broadcast to the full YMM and then blend, even for elem=0. You can specialize the template, or add if() conditions in the template that will optimize away. e.g. splat = (elem?) set1() : v; to save the broadcast for elem==0. You could capture the other optimizations, too, if you wanted.

GCC 8.x和更早的版本使用通常不好的方式来广播整数:它们存储/重新加载.这样可以避免使用任何ALU改组端口,因为广播负载在Intel CPU上是免费的,但是它将存储转发延迟引入了从整数到最终向量结果的链中.

GCC 8.x and earlier use a normally-bad way of broadcasting the integer: they store/reload. This avoids using any ALU shuffle ports because broadcast-loads are free on Intel CPUs, but it introduces store-forwarding latency into the chain from the integer to the final vector result.

这在gcc9的当前主干中已解决,但是我不知道是否有解决方法可以使早期的gcc获得非愚蠢的代码源.通常,-march=<an intel uarch>倾向于使用ALU而不是整数->向量的存储/重新加载,反之亦然,但是在这种情况下,成本模型仍然使用-march=haswell选择存储/重新加载.

This is fixed in current trunk for gcc9, but I don't know if there's a workaround to get non-silly code-gen with earlier gcc. Normally -march=<an intel uarch> favours ALU instead of store/reload for integer -> vector and vice versa, but in this case the cost model still picks store/reload with -march=haswell.

# gcc8.2 -O3 -march=haswell
merge0(long long __vector(4), long):
    push    rbp
    mov     rbp, rsp
    and     rsp, -32          # align the stack even though no YMM is spilled/loaded
    mov     QWORD PTR [rsp-8], rdi
    vpbroadcastq    ymm1, QWORD PTR [rsp-8]   # 1 uop on Intel
    vpblendd        ymm0, ymm0, ymm1, 3
    leave
    ret

; GCC trunk: g++ (GCC-Explorer-Build) 9.0.0 20190103 (experimental)
; MSVC and ICC do this, too.  (For MSVC, make sure to compile with -arch:AVX2)
merge0(long long __vector(4), long):
    vmovq   xmm2, rdi
    vpbroadcastq    ymm1, xmm2
    vpblendd        ymm0, ymm0, ymm1, 3
    ret


对于运行时变量元素位置,混洗仍然有效,但是您必须创建一个混合掩码矢量,并在右侧元素中设置高位.例如从alignas(8) int8_t mask[] = { 0,0,0,-1,0,0,0 };中的mask[3-elem]加载vpmovsxbq的情况.但是vpblendvbvblendvpd比立即混合要慢,尤其是在Haswell上,因此请尽可能避免这种情况.


For a runtime-variable element position, the shuffle still works but you'd have to create a blend mask vector with the high bit set in the right element. e.g. with a vpmovsxbq load from mask[3-elem] in alignas(8) int8_t mask[] = { 0,0,0,-1,0,0,0 };. But vpblendvb or vblendvpd is slower than an immediate blend, especially on Haswell, so avoid that if possible.

这篇关于将int64_t移到AVX2 __m256i向量的高四倍数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆