是否有一个内在函数可以将 __m128i 向量的最后 n 个字节归零? [英] Is there an intrinsic function to zero out the last n bytes of a __m128i vector?

查看:30
本文介绍了是否有一个内在函数可以将 __m128i 向量的最后 n 个字节归零?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

给定 n,我想将 __m128i 向量的最后 n 个字节归零.

Given n, I want to zero out the last n bytes of a __m128i vector.

例如考虑以下 __m128i 向量:

For instance consider the following __m128i vector:

<代码> 11111111 11111111 11111111 11111111 11111111 11111111 11111111 11111111 11111111 11111111 11111111 11111111 11111111 11111111 11111111 11111111

将最后的 n = 4 个字节归零后,向量应如下所示:

After zeroing out the last n = 4 bytes, the vector should look like:

<代码> 11111111 11111111 11111111 11111111 11111111 11111111 11111111 11111111 11111111 11111111 11111111 11111111 00000000 00000000 00000000 00000000

是否有执行此操作的 SSE 内在函数(通过接受 __128i 向量和 n 作为参数)?

Is there a SSE intrinsic function that does this (by accepting a __128i vector and n as parameters)?

推荐答案

有多种不依赖于 AVX512 的选项.例如:

There are various options that don't rely on AVX512. For example:

char mask[32] = { 0, 0, 0, 0, 0, 0, 0, 0,
                  0, 0, 0, 0, 0, 0, 0, 0,
                  -1, -1, -1, -1, -1, -1, -1, -1,
                  -1, -1, -1, -1, -1, -1, -1, -1};

__m128i zeroLowestNBytes(__m128i x, uint32_t n)
{
    __m128i m = _mm_loadu_si128((__m128i*)&mask[16 - n]);
    return _mm_and_si128(x, m);
}

使用 AVX,负载可以成为 vpand 的内存操作数.如果没有 AVX,它仍然可以,使用 movdqupand.

With AVX, the load can become a memory operand of the vpand. Without AVX it's still fine, with movdqu and pand.

未对齐的负载通常不是问题,除非它跨越 4K 边界.如果你可以让 mask 32 对齐,那么这个问题就会消失.负载仍会未对齐,但不会达到特定的边缘情况.

The load being unaligned isn't normally a problem, unless it crosses a 4K boundary. If you can get mask 32-aligned then that problem would go away. The load would still be unaligned, but wouldn't hit that particular edge case.

n 是一个 uint32_t 以避免符号扩展.

n is an uint32_t to avoid sign-extension.

__m128i zeroLowestNBytes(__m128i x, int n)
{
    __m128i threshold = _mm_set1_epi8(n);
    __m128i index = _mm_set_epi8(15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0);
    return _mm_andnot_si128(_mm_cmpgt_epi8(threshold, index), x);
}

这避免了未对齐的负载,但这并不重要.更重要的是,它避免了依赖于输入的负载":在具有未对齐负载的版本中,负载取决于 n.在这个版本中,负载独立于n.例如,如果此函数是内联的,则允许编译器将其提升出循环.它还允许乱序执行更自由地尽早开始加载,也许在计算 n 之前.

This avoids the unaligned load, but that shouldn't really matter. More importantly, it avoids the "input-dependent load": in the version with the unaligned load, the load depends on n. In this version, the load is independent of n. For example, that allows a compiler to hoist it out of a loop, if this function is inlined. It also allows out-of-order execution more freedom to start the load early, perhaps before n has been computed.

另一方面,它基本上需要 AVX2 或 SSSE3 才能很好地实现 _mm_set1_epi8(n).此外,这通常会花费更多指令,这可能会降低吞吐量.延迟应该更好,因为主链"中没有负载.(有一个负载,但它不在一边,它不会将其延迟添加到计算的延迟中).

The flipside is, it basically requires AVX2 or SSSE3 for a decent realization of _mm_set1_epi8(n). Also, this normally costs more instructions, which may be worse for throughput. The latency should be better, since there is no load in the "main chain" (there is a load, but it's off to the side, it doesn't add its latency to the latency of the computation).

这篇关于是否有一个内在函数可以将 __m128i 向量的最后 n 个字节归零?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆