如何对32位整数进行vblend?或:为什么没有_mm256_blendv_epi32? [英] Howto vblend for 32-bit integer? or: Why is there no _mm256_blendv_epi32?

查看:160
本文介绍了如何对32位整数进行vblend?或:为什么没有_mm256_blendv_epi32?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用AVX2 x86 256位SIMD扩展.我想明智地执行32位整数分量if-then-else指令.在英特尔文档中,这样的指令称为vblend.

英特尔内部指南包含函数_mm256_blendv_epi8.此功能几乎可以满足我的需求.唯一的问题是它可以与8位整数一起使用.不幸的是,文档中没有_mm256_blendv_epi32.我的第一个问题是:为什么不存在此功能?我的第二个问题是:如何模仿它?

经过一些搜索,我发现_mm256_blendv_ps可以满足我对32位浮点数的要求.此外,我发现了转换函数_mm256_castsi256_ps和_mm256_castps_si256,它们从整数转换为32位浮点数,然后返回.将它们放在一起可以得到:

inline __m256i _mm256_blendv_epi32 (__m256i a, __m256i b, __m256i mask){
    return _mm256_castps_si256( 
        _mm256_blendv_ps(
            _mm256_castsi256_ps(a),
            _mm256_castsi256_ps(b),
            _mm256_castsi256_ps(mask)
        ) 
    );
}

虽然这看起来像5个函数,但其​​中4个只是美化的强制转换,一个直接映射到处理器指令上.因此,整个功能归结为一个处理器指令.

因此,真正尴尬的部分是似乎存在32位blendv,只是缺少相应的内在函数.

是否存在某些边界案例,这将导致惨不忍睹?例如,当整数位模式恰好表示浮点NAN时会发生什么? blendv会只是忽略它还是会发出一些信号?

在这种情况下有效:我是否正确地说有8位,32位和64位blendv,但缺少16位blendv?

解决方案

如果您的mask对于整个32位元素(如结果),直接使用_mm256_blendv_epi8.

我的代码仅依靠blendv检查最高位.

那么您有两个不错的选择:

  • 使用算术右移31广播每个元素中的高位,以设置 VPSRAD:mask=_mm256_srai_epi32(mask, 31) .

    VPSRAD在Intel Haswell上为端口0的1-uop. (Skylake上的更多吞吐量:p01).如果您的算法在端口0上遇到瓶颈(例如,整数乘法和移位),那就不好了.

  • 使用 VBLENDVPS .您是正确的,所有的强制转换只是为了使编译器满意,并且VBLENDVPS会在一条指令中完全执行您想要的操作.

static inline
__m256i blendvps_si256(__m256i a, __m256i b, __m256i mask) {
    __m256 res = _mm256_blendv_ps(_mm256_castsi256_ps(a), _mm256_castsi256_ps(b), _mm256_castsi256_ps(mask));
    return _mm256_castps_si256(res);
}

但是,英特尔SnB系列CPU在将整数结果转发到FP混合单元时具有1个周期的旁路延迟延迟,而在将混合结果转发到其他整数指令时具有1c的延迟延迟.如果延迟不是瓶颈,那么这可能不会损害吞吐量.

有关绕过延迟延迟的更多信息,请参见 Agner Fog的Microach指南.这就是为什么他们不为FP指令创建__m256i内在函数的原因,反之亦然.请注意,自Sandybridge以来,FP随机播放会有额外的延迟来往/去往诸如PADDD之类的指令.因此,如果PUNPCK *或PALIGNR不能完全满足您的要求,那么SHUFPS是合并来自两个整数向量的数据的好方法. (即使在Nehalem上,对整数的SHUFPS也是值得的,如果吞吐量成为瓶颈,双向确实会造成2c的损失).

尝试两种方式并进行基准测试.两种方法都可能更好,具体取决于周围的代码.

与uop吞吐量/指令数相比,延迟可能并不重要.另外请注意,如果您只是将结果存储到内存中,则存储指令并不关心数据来自哪个域.

但是,如果您将其用作长依赖性链的一部分,那么可能值得使用额外的指令,以避免数据混合产生额外的2个延迟周期.

请注意,如果在关键路径上生成掩码,那么VPSRAD的1个周期延迟等于旁路延迟延迟,因此使用FP混合对于掩码->结果链只有1个额外的延迟周期,相对于data-> result链的2个额外周期.


例如,当整数位模式恰好表示浮点NAN时会发生什么?

BLENDVPS不在乎.英特尔的 insn参考手册完全记录了指令可以/不能做的一切,并且 SIMD浮点异常:无"表示这不是问题.另请参见标签-0.0为否定).

I'm using the AVX2 x86 256-bit SIMD extensions. I want to do a 32-bit integer component wise if-then-else instruction. In the Intel documentations such an instruction is called vblend.

The Intel intrinsic guide contains the function _mm256_blendv_epi8. This function does nearly what I need. The only problem is that it works with 8-bit integers. Unfortunately there is no _mm256_blendv_epi32 in docs. My first question is: Why does this function not exist? My second question is: How to emulate it?

After some searching I found _mm256_blendv_ps which does what I want for 32-bit floating points. Further I found cast functions _mm256_castsi256_ps and _mm256_castps_si256 which cast from integers to 32-bit floats and back. Putting these together gives:

inline __m256i _mm256_blendv_epi32 (__m256i a, __m256i b, __m256i mask){
    return _mm256_castps_si256( 
        _mm256_blendv_ps(
            _mm256_castsi256_ps(a),
            _mm256_castsi256_ps(b),
            _mm256_castsi256_ps(mask)
        ) 
    );
}

While this looks like 5 functions, 4 of them are only glorified casts and one maps directly onto a processor instruction. The whole function therefore boils down to one processor instruction.

The real awkward part therefore is that there seems to be a 32-bit blendv, except that the corresponding intrinsic is missing.

Is there some border case where this will fail miserably? For example, what happens when the integer bit pattern happens to represent a floating point NAN? Does blendv simply ignore this or will it raise some signal?

In case this works: Am I correct that there is a 8-bit, a 32-bit and a 64-bit blendv but a 16-bit blendv is missing?

解决方案

If your mask is already all-zero / all-one for the whole 32-bit element (like a vpcmpgtd result), use _mm256_blendv_epi8 directly.

My code relies on blendv only checking the highest bit.

Then you have two good options:

  • Use VBLENDVPS. You're correct that all the casts are just to keep the compiler happy, and that VBLENDVPS will do exactly what you want in one instruction.

static inline
__m256i blendvps_si256(__m256i a, __m256i b, __m256i mask) {
    __m256 res = _mm256_blendv_ps(_mm256_castsi256_ps(a), _mm256_castsi256_ps(b), _mm256_castsi256_ps(mask));
    return _mm256_castps_si256(res);
}

However, Intel SnB-family CPUs have a bypass-delay latency of 1 cycle when forwarding integer results to the FP blend unit, and another 1c latency when forwarding the blend results to other integer instructions. This might not hurt throughput if latency isn't the bottleneck.

For more about bypass-delay latency, see Agner Fog's microach guide. It's the reason they don't make __m256i intrinsics for FP instructions, and vice versa. Note that since Sandybridge, FP shuffles don't have extra latency to forward from/to instructions like PADDD. So SHUFPS is a great way to combine data from two integer vectors if PUNPCK* or PALIGNR don't do exactly what you want. (SHUFPS on integers can be worth it even on Nehalem, where it does have a 2c penalty both ways, if throughput is your bottleneck).

Try both ways and benchmark. Either way could be better, depending on surrounding code.

Latency might not matter compared to uop throughput / instruction count. Also note that if you're just storing the result to memory, store instructions don't care which domain the data was coming from.

But if you are using this as part of a long dependency chain, then it might be worth the extra instruction to avoid the extra 2 cycles of latency for the data being blended.

Note that if the mask-generation is on the critical path, then VPSRAD's 1 cycle latency is equivalent to the bypass-delay latency, so using an FP blend is only 1 extra cycle of latency for the mask->result chain, vs. 2 extra cycles for the data->result chain.


For example, what happens when the integer bit pattern happens to represent a floating point NAN?

BLENDVPS doesn't care. Intel's insn ref manual fully documents everything an instruction can/can't do, and SIMD Floating-Point Exceptions: None means that this isn't a problem. See also the tag wiki for links to docs.

FP blend/shuffle/bitwise-boolean/load/store instructions don't care about NaNs. Only instructions that do actual FP math (including CMPPS, MINPS, and stuff like that) raise FP exceptions or can possibly slow down with denormals.


Am I correct that there is a 8-bit, a 32-bit and a 64-bit blendv but a 16-bit blendv is missing?

Yes. But there are 32 and 16-bit arithmetic shifts, so it costs at most one extra instruction to use the 8-bit granularity blend. (There is no PSRAQ, so blendv of 64-bit integers is often best done with BLENDVPD, unless maybe the mask-generation is off the critical path and/or the same mask will be reused many times on the critical path.)

The most common use-case is for compare-masks where each element is all-ones or all-zeros already, so you could blend with PAND/PANDN => POR. Of course, clever tricks that leave just the sign-bit of your mask with the truth value can save instructions and latency, especially since variable-blends are somewhat faster than three boolean bitwise instructions. (e.g. ORPS two float vectors to see if they're both non-negative, instead of 2x CMPPS and ORing the masks. This can work great if you don't care about negative zero, or you're happy to treat underflow to -0.0 as negative).

这篇关于如何对32位整数进行vblend?或:为什么没有_mm256_blendv_epi32?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆