在16位字中仅保留10个有用位 [英] Keep only the 10 useful bits in 16-bit words

查看:90
本文介绍了在16位字中仅保留10个有用位的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有_m256i个向量,这些向量在16位整数内包含10位字(因此16 * 16位仅包含16 * 10个有用位).最好/最快的方法是仅提取那些10位并打包以产生10位值的输出位流?

I have _m256i vectors that contain 10-bit words inside 16-bit integers (so 16*16-bit containing only 16*10 useful bits). What is the best/fastest way to extract only those 10-bits and pack them to produce an output bitstream of 10-bit values?

推荐答案

这是我的尝试.

尚未进行基准测试,但我认为它总体上应该可以很快运行:指令太多,在现代处理器上所有指令都有1个延迟周期.存储也很有效,有2条存储指令可存储20个字节的数据.

Have not benchmarked, but I think it should work pretty fast overall: not too many instructions, all of them have 1 cycle of latency on modern processors. Also the stores are efficient, 2 store instructions for 20 bytes of data.

该代码仅使用3个常量.如果您在循环中调用此函数,好的编译器应在循环外部加载所有三个函数,并将它们保存在寄存器中.

The code only uses 3 constants. If you call this function in a loop, good compilers should load all three outside of the loop and keep them in registers.

// bitwise blend according to a mask
inline void combineHigh( __m256i& vec, __m256i high, const __m256i lowMask )
{
    vec = _mm256_and_si256( vec, lowMask );
    high = _mm256_andnot_si256( lowMask, high );
    vec = _mm256_or_si256( vec, high );
}

// Store 10-bit pieces from each of the 16-bit lanes of the AVX2 vector.
// The function writes 20 bytes to the pointer.
inline void store_10x16_avx2( __m256i v, uint8_t* rdi )
{
    // Pack pairs of 10 bits into 20, into 32-bit lanes
    __m256i high = _mm256_srli_epi32( v, 16 - 10 );
    const __m256i low10 = _mm256_set1_epi32( ( 1 << 10 ) - 1 ); // Bitmask of 10 lowest bits in 32-bit lanes
    combineHigh( v, high, low10 );

    // Now the vector contains 32-bit lanes with 20 payload bits / each
    // Pack pairs of 20 bits into 40, into 64-bit lanes
    high = _mm256_srli_epi64( v, 32 - 20 );
    const __m256i low20 = _mm256_set1_epi64x( ( 1 << 20 ) - 1 ); // Bitmask of 20 lowest bits in 64-bit lanes
    combineHigh( v, high, low20 );

    // Now the vector contains 64-bit lanes with 40 payload bits / each
    // 40 bits = 5 bytes, store initial 4 bytes of the result
    _mm_storeu_si32( rdi, _mm256_castsi256_si128( v ) );

    // Shuffle the remaining 16 bytes of payload into correct positions.
    // The indices of the payload bytes are [ 0 .. 4 ] and [ 8 .. 12 ]
    // _mm256_shuffle_epi8 can only move data within 16-byte lanes
    const __m256i shuffleIndices = _mm256_setr_epi8(
        // 6 remaining payload bytes from the lower half of the vector
        4, 8, 9, 10, 11, 12,
        // 10 bytes gap, will be zeros
        -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
        // 6 bytes gap, will be zeros
        -1, -1, -1, -1, -1, -1,
        // 10 payload bytes from the higher half of the vector
        0, 1, 2, 3, 4,
        8, 9, 10, 11, 12
    );
    v = _mm256_shuffle_epi8( v, shuffleIndices );

    // Combine and store the final 16 bytes of payload
    const __m128i low16 = _mm256_castsi256_si128( v );
    const __m128i high16 = _mm256_extracti128_si256( v, 1 );
    const __m128i result = _mm_or_si128( low16, high16 );
    _mm_storeu_si128( ( __m128i* )( rdi + 4 ), result );
}

此代码将未使用的值的高6位截断.

This code truncates unused higher 6 bits of the values.

如果要饱和,则需要另外一条指令 _mm256_min_epu16 .

If you want to saturate instead, you’ll need one more instruction, _mm256_min_epu16.

此外,如果执行此操作,则该函数的第一步可以使用 pmaddwd .这是使源编号饱和的完整功能,并进行了一些额外的调整.

Also, if you do that, the first step of the function can use pmaddwd. Here’s the complete function which saturates the source numbers, with couple extra adjustments.

// Store 10-bit pieces from 16-bit lanes of the AVX2 vector, with saturation.
// The function writes 20 bytes to the pointer.
inline void store_10x16_avx2( __m256i v, uint8_t* rdi )
{
    const __m256i low10 = _mm256_set1_epi16( ( 1 << 10 ) - 1 );
#if 0
    // Truncate higher 6 bits; pmaddwd won't truncate, it needs zeroes in the unused higher bits.
    v = _mm256_and_si256( v, low10 );
#else
    // Saturate numbers into the range instead of truncating
    v = _mm256_min_epu16( v, low10 );
#endif

    // Pack pairs of 10 bits into 20, into 32-bit lanes
    // pmaddwd computes a[ 0 ] * b[ 0 ] + a[ 1 ] * b[ 1 ] for pairs of 16-bit lanes, making a single 32-bit number out of two pairs.
    // Initializing multiplier with pairs of [ 1, 2^10 ] to implement bit shifts + packing
    const __m256i multiplier = _mm256_set1_epi32( 1 | ( 1 << ( 10 + 16 ) ) );
    v = _mm256_madd_epi16( v, multiplier );

    // Now the vector contains 32-bit lanes with 20 payload bits / each
    // Pack pairs of 20 bits into 40 in 64-bit lanes
    __m256i low = _mm256_slli_epi32( v, 12 );
    v = _mm256_blend_epi32( v, low, 0b01010101 );
    v = _mm256_srli_epi64( v, 12 );

    // Now the vector contains 64-bit lanes with 40 payload bits / each
    // 40 bits = 5 bytes, store initial 4 bytes of the result
    _mm_storeu_si32( rdi, _mm256_castsi256_si128( v ) );

    // Shuffle the remaining 16 bytes of payload into correct positions.
    const __m256i shuffleIndices = _mm256_setr_epi8(
        // Lower half
        4, 8, 9, 10, 11, 12,
        -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
        // Higher half
        -1, -1, -1, -1, -1, -1,
        0, 1, 2, 3, 4,
        8, 9, 10, 11, 12
    );
    v = _mm256_shuffle_epi8( v, shuffleIndices );

    // Combine and store the final 16 bytes of payload
    const __m128i low16 = _mm256_castsi256_si128( v );
    const __m128i high16 = _mm256_extracti128_si256( v, 1 );
    const __m128i result = _mm_or_si128( low16, high16 );
    _mm_storeu_si128( ( __m128i* )( rdi + 4 ), result );
}

根据处理器,编译器和调用该函数的代码的不同,总体上这可能更快或更慢,但是绝对有助于代码大小.没有人再关心二进制大小了,但是CPU的L1I和µop缓存有限.

This may be slightly faster or slower overall depending on the processor, compiler, and the code calling the function, but definitely helps with code size. No one cares about binary size anymore, but CPUs have limited L1I and µop caches.

为完整起见,这是另一种使用SSE2和可选的SSSE3而不是AVX2的方法,实际上只是稍慢一点.

For completeness here’s another one that uses SSE2 and optionally SSSE3 instead of AVX2, only slightly slower in practice.

// Compute v = ( v & lowMask ) | ( high & ( ~lowMask ) ), for 256 bits of data in two registers
inline void combineHigh( __m128i& v1, __m128i& v2, __m128i h1, __m128i h2, const __m128i lowMask )
{
    v1 = _mm_and_si128( v1, lowMask );
    v2 = _mm_and_si128( v2, lowMask );
    h1 = _mm_andnot_si128( lowMask, h1 );
    h2 = _mm_andnot_si128( lowMask, h2 );
    v1 = _mm_or_si128( v1, h1 );
    v2 = _mm_or_si128( v2, h2 );
}

inline void store_10x16_sse( __m128i v1, __m128i v2, uint8_t* rdi )
{
    // Pack pairs of 10 bits into 20, in 32-bit lanes
    __m128i h1 = _mm_srli_epi32( v1, 16 - 10 );
    __m128i h2 = _mm_srli_epi32( v2, 16 - 10 );
    const __m128i low10 = _mm_set1_epi32( ( 1 << 10 ) - 1 );
    combineHigh( v1, v2, h1, h2, low10 );

    // Pack pairs of 20 bits into 40, in 64-bit lanes
    h1 = _mm_srli_epi64( v1, 32 - 20 );
    h2 = _mm_srli_epi64( v2, 32 - 20 );
    const __m128i low20 = _mm_set1_epi64x( ( 1 << 20 ) - 1 );
    combineHigh( v1, v2, h1, h2, low20 );

#if 1
    // 40 bits is 5 bytes, for the final shuffle we use pshufb instruction from SSSE3 set
    // If you don't have SSSE3, below under `#else` there's SSE2-only workaround.
    const __m128i shuffleIndices = _mm_setr_epi8(
        0, 1, 2, 3, 4,
        8, 9, 10, 11, 12,
        -1, -1, -1, -1, -1, -1 );
    v1 = _mm_shuffle_epi8( v1, shuffleIndices );
    v2 = _mm_shuffle_epi8( v2, shuffleIndices );
#else
    // SSE2-only version of the above, uses 8 instructions + 2 constants to emulate 2 instructions + 1 constant
    // Need two constants because after this step we want zeros in the unused higher 6 bytes.
    h1 = _mm_srli_si128( v1, 3 );
    h2 = _mm_srli_si128( v2, 3 );
    const __m128i low40 = _mm_setr_epi8( -1, -1, -1, -1, -1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 );
    const __m128i high40 = _mm_setr_epi8( 0, 0, 0, 0, 0, -1, -1, -1, -1, -1, 0, 0, 0, 0, 0, 0 );
    const __m128i l1 = _mm_and_si128( v1, low40 );
    const __m128i l2 = _mm_and_si128( v2, low40 );
    h1 = _mm_and_si128( h1, high40 );
    h2 = _mm_and_si128( h2, high40 );
    v1 = _mm_or_si128( h1, l1 );
    v2 = _mm_or_si128( h2, l2 );
#endif

    // Now v1 and v2 vectors contain densely packed 10 bytes / each.
    // Produce final result: 16 bytes in the low part, 4 bytes in the high part
    __m128i low16 = _mm_or_si128( v1, _mm_slli_si128( v2, 10 ) );
    __m128i high16 = _mm_srli_si128( v2, 6 );
    // Store these 20 bytes with 2 instructions
    _mm_storeu_si128( ( __m128i* )rdi, low16 );
    _mm_storeu_si32( rdi + 16, high16 );
}

这篇关于在16位字中仅保留10个有用位的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆