SIMD:位包有符号整数 [英] SIMD: Bit-pack signed integers

查看:57
本文介绍了SIMD:位包有符号整数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

无符号整数可以通过使用位打包"进行压缩技术:在无符号整数块内仅存储有效位,当块中的所有整数都为小"时导致数据压缩.该方法称为 FOR(参考框架).

Unsigned integers can be compressed by using "bit-packing" techniques: Within a block of unsigned integers only the significant bits are stored, resulting in data compression when all integers in a block are "small". The method is known as FOR (frame of reference).

有 SIMD 可以非常有效地执行此操作.

There are SIMD libraries that do this very efficiently.

现在我想使用类似 FOR 的技术来编码 signed 整数,例如来自未排序的无符号整数的差分序列.每个有符号整数的符号都需要存储在某处,有两种选择:

Now I want to use FOR-like techniques to encode signed integers, e.g. from a differenced sequence of unsorted unsigned integers. The sign of each signed integer needs to be stored somewhere, there are two options:

  1. 将标志存储在单独的数据块中.这会增加开销.
  2. 将符号与每个有符号整数的绝对值存储在一起.

我现在正在遵循路径 2.2-s 补码在 msb 中有符号位(最重要的位),因此它不适用于像 FOR 那样的位打包.一种可能性是将符号存储在 lsb(最低有效位)中.以这种方式存储有符号整数是非常不寻常的,据我所知,没有支持这种方式的指令.现在的问题是:能否使用 SIMD 指令有效地对这些 lsb 签名的整数进行编码/解码?

I'm following path 2 right now. The 2-s complement has the sign bit in the msb (most signfificant bit), so that won't work for bit-packing à la FOR. One possibility is to store the sign in the lsb (least significant bit). Storing signed integers this way is very unusual, there are no instruction that support this, as far as I know. The question now is: Can these lsb-signed-integers be encoded/decoded efficiently using SIMD instructions?

我认为 AVX-512 _mm_testn_epi32_mask 可用于从每个 uint32 中提取 lsb,然后是移位,然后是某种类型的两个 mask_extract?相当复杂.

I think AVX-512 _mm_testn_epi32_mask can be used to extract the lsb from each uint32, followed by a shift, then two mask_extract of some sort? Quite convoluted.

推荐答案

未经测试 锯齿形调整浪 C 中使用 SSE2 处理 64 位整数的示例:

Untested ZigZag examples in C using SSE2 for 64-bit integers:

(注意:SSE2 缺少一些 64 位指令...)

(note: SSE2 is missing some 64-bit instructions...)

#include <emmintrin.h>


// from comment by Peter-Cordes 
__m128i zigzag_encode_epi64(__m128i v) {
    __m128i signmask = _mm_shuffle_epi32(v, _MM_SHUFFLE(3,3,1,1));
    signmask = _mm_srai_epi32(signmask, 31);
    return _mm_xor_si128(_mm_add_epi64(v, v), signmask);
}


__m128i zigzag_decode_epi64 (__m128i v) {
    __m128i signmask = _mm_and_si128(_mm_set_epi32(0, 1, 0, 1), v);
    signmask = _mm_sub_epi64(_mm_setzero_si128(), signmask);
    return _mm_xor_si128(_mm_srli_epi64(v, 1), signmask);
}


// no constant
__m128i zigzag_decodev3_epi64 (__m128i v) {
    __m128i t = _mm_srli_epi64(v, 1);
    __m128i signmask = _mm_sub_epi64(_mm_slli_epi64(t, 1), v);
    return _mm_xor_si128(t, signmask);
}

Zigzag 适用于按位 varint.然而,逐字节组变量可能希望从可变位宽进行符号扩展".

Zigzag is good for bitwise varints. However, a bytewise group-varint may wish to "sign extend from a variable bit-width".

32 位示例

我更喜欢比较而不是算术移位.我假设 - 展开时 - 比较的延迟会降低 1 个周期.

I favored compares over arithmetic shifts. I assume - when unrolled - compares will have 1 cycle lower latency.

__m128i zigzag_encode_epi32 (__m128i v) {
    __m128i signmask =_mm_cmpgt_epi32(_mm_setzero_si128(), v);
    return _mm_xor_si128(_mm_add_epi32(v, v), signmask);
}


__m128i zigzag_decode_epi32 (__m128i v) {
    const __m128i m = _mm_set1_epi32(1);
    __m128i signmask =_mm_cmpeq_epi32(_mm_and_si128(m, v), m);
    return _mm_xor_si128(_mm_srli_epi32(v, 1), signmask);
}


__m128i delta_encode_epi32 (__m128i v, __m128i prev) {
    return _mm_sub_epi32(v, _mm_alignr_epi8(v, prev, 12));
}


// prefix sum (see many of answers around stackoverflow...)
__m128i delta_decode_epi32 (__m128i v, __m128i prev) {
    prev = _mm_shuffle_epi32(prev, _MM_SHUFFLE(3,3,3,3)); // [P P P P]
    v = _mm_add_epi32(v, _mm_slli_si128(v, 4)); // [A AB BC CD]
    prev = _mm_add_epi32(prev, v); // [PA PAB PBC PCD]
    v = _mm_slli_si128(v, 8); // [0 0 A AB]
    return _mm_add_epi32(prev, v); // [PA PAB PABC PABCD]
}


__m128i delta_zigzag_encode_epi32 (__m128i v, __m128i prev) {
    return zigzag_encode_epi32(delta_encode_epi32(v, prev));
}


__m128i delta_zigzag_decode_epi32 (__m128i v, __m128i prev) {
    return delta_decode_epi32(zigzag_decode_epi32(v), prev);
}

注意:Delta 编码会更快(往返/解码)在编码时转置元素,然后在解码期间将它们再次转回;水平前缀和真的很慢.然而,确定每批转置的最佳元素数量似乎是一个难题.

Note: Delta coding would be faster (round-trip/decoding) to transpose the elements while encoding then transpose them back again during decoding; horizontal prefix sums are really slow. However, determining the optimum number of elements to transpose in each batch seems like a hard problem.

这篇关于SIMD:位包有符号整数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆