提取__m128i中每个布尔字节的低位?布尔数组到打包位图 [英] Extract the low bit of each bool byte in a __m128i? bool array to packed bitmap

查看:244
本文介绍了提取__m128i中每个布尔字节的低位?布尔数组到打包位图的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

(编者注:这个问题原本是:如何尝试在GCC的定义上使用特定于MSVC的方法的m128i_i8成员或__m128i对象的一般成员? __m128i.但这是一个XY问题,并且可接受的答案是关于XY问题的.另一个答案回答了这个问题.)

(Editor's note: this question was originally: How should one access the m128i_i8 member, or members in general, of the __m128i object?, trying to use an MSVC-specific method on GCC's definition of __m128i. But this was an XY problem and the accepted answer is about the XY problem here. Another answer does answer this question.)

我意识到Microsoft建议不要直接访问这些对象的成员,但是我需要设置它们以及文档.

I realize that Microsoft suggests against directly accessing the members of these objects, but I need to set them and the documentation is sorely lacking.

我继续收到错误消息我在'(my var name)'中请求成员'm128i_i8',它是非类类型'wirelabel {aka __vector(2)long long int}'"可以理解,因为我已经包含了所有正确的标头,并且可以识别__m128i变量.

I continue getting the error "request for member ‘m128i_i8’ in ‘(my var name)', which is of non-class type ‘wirelabel {aka __vector(2) long long int}’" which I don't understand because I've included all the correct headers and it does recognize __m128i variables.

注意1:wirelabel是__m128i的typedef,即标头中存在

Note1: wirelabel is a typedef for __m128i i.e. there exists in a header

typedef __m128i wirelabel 

注2:使用注1的原因在以下其他问题中进行了解释: tbb :: cache_aligned_allocator:获取请求成员...非类类型的成员"与__m128i.用户错误或错误?

Note2: The reason Note1 was used is explained in the following other question: tbb::cache_aligned_allocator: Getting "request for member...which is of non-class type" with __m128i. User error or bug?

注意3:我正在使用编译器g ++

Note3: I'm using the compiler g++

注意4:以下问题并不能回答我的问题,但确实讨论了相关信息

Note4: This following question doesn't answer mine but does discuss related information Why should you not access the __m128i fields directly?

我也知道有一个_mm_set_epi8函数,但是它要求您一次设置所有8位部分,而这对我来说不是一个选择.

I also know that there is a _mm_set_epi8 function but it requires you set all 8 bit sections at once and that is not an option for me currently.

有人问我为什么我需要访问__m128i对象的16个8位部分中的每个细节,这是为什么:我有一个bool数组,其大小为' n * 128"(n是size_t),我需要将它们存储在大小为"n"的"wirelabel"数组中.

I was asked for more specifics as to why I think I need to access each of the 16 8-bit parts of the __m128i object, and here is why: I have a bool array with size 'n*128' (n is a size_t) and I need to store these within an array of 'wirelabel' with size 'n'.

现在,由于wirelabel只是__m128i的别名/typedef(如果有区别,请纠正我),因此128个布尔值的'n'个索引中的每一个都可以存储在'wirelabel'数组中.

Now because wirelabel is just an alias/typedef (correct me if there is a difference) for __m128i, each of the 'n' indices of 128 bools can be stored in the 'wirelabel' array.

但是,为了做到这一点,我相信需要将每8位转换为它的带符号等效项,并将其存储在数组中每个"wirelabel"指针的正确8位索引中.

However, in order to do this I believe need to convert every 8-bits into it's signed equivalent and store it in the correct 8bit index in each 'wirelabel' pointer in the array.

推荐答案

您的源数据是连续的吗?您应该使用_mm_load_si128而不是弄乱矢量类型的标量分量.

So your source data is contiguous? You should use _mm_load_si128 instead of messing around with scalar components of vector types.

您真正的问题是将一个bool数组(x86上g ++使用的ABI中的每个元素1个字节)打包到位图中.您应该对SIMD进行 this ,而不要对标量代码一次设置1位或字节.

Your real problem is packing an array of bool (1 byte per element in the ABI used by g++ on x86) into a bitmap. You should do this with SIMD, not with scalar code to set 1 bit or byte at a time.

pmovmskb(_mm_movemask_epi8)非常适合于每个字节输入提取一位.您只需要安排就可以将所需的位提升到最高位.

pmovmskb (_mm_movemask_epi8) is fantastic for extracting one bit per byte of input. You just need to arrange to get the bit you want into the high bit.

显而易见的选择是移位,但是向量移位指令与Haswell(端口0)上的pmovmskb争夺相同的执行端口. ( http://agner.org/optimize/).相反,添加0x7F将为1的输入生成0x80(高位置1),但是为0的输入生成0x7F(高位清零). (而且x86-64 System V ABI中的bool必须以整数0或1的形式存储在内存中,而不仅仅是0与任何非零值的比较.)

The obvious choice would be a shift, but vector shift instructions compete for the same execution port as pmovmskb on Haswell (port 0). (http://agner.org/optimize/). Instead, adding 0x7F will produce 0x80 (high bit set) for an input of 1, but 0x7F (high bit clear) for an input of 0. (And a bool in the x86-64 System V ABI must be stored in memory as an integer 0 or 1, not simply 0 vs. any non-zero value).

pcmpeqb为什么不反对_mm_set1_epi8(1)? Skylake在端口0/1上运行pcmpeqb,但在所有3个矢量ALU端口(0/1/5)上运行paddb.不过,在pcmpeqb/w/d/q的结果上使用pmovmskb是很常见的.

Why not pcmpeqb against _mm_set1_epi8(1)? Skylake runs pcmpeqb on ports 0/1, but paddb on all 3 vector ALU ports (0/1/5). It's very common to use pmovmskb on the result of pcmpeqb/w/d/q, though.

#include <immintrin.h>
#include <stdint.h>

// n is the number of uint16_t dst elements
// We access n*16 bool elements from src.
void pack_bools(uint16_t *dst, const bool *src, size_t n)
{
     // you can later access dst with __m128i loads/stores

    __m128i carry_to_highbit = _mm_set1_epi8(0x7F);
    for (size_t i = 0 ; i < n ; i+=1) {
        __m128i boolvec = _mm_loadu_si128( (__m128i*)&src[i*16] );
        __m128i highbits = _mm_add_epi8(boolvec, carry_to_highbit);
        dst[i] = _mm_movemask_epi8(highbits);
    }
}

由于编写此位图时要使用标量存储,因此出于严格混淆的原因,我们希望dst位于uint16_t中.使用AVX2,您需要uint32_t. (或者,如果您将combine = tmp1 << 16 | tmp合并为两个pmovmskb结果,但是可能不这样做.)

Because we want to use scalar stores when writing this bitmap, we want dst to be in uint16_t for strict-aliasing reasons. With AVX2, you'd want uint32_t. (Or if you did combine = tmp1 << 16 | tmp to combine two pmovmskb results. But probably don't do that.)

这会编译成这样的asm循环(

This compiles into an asm loop like this (with gcc7.3 -O3, on the Godbolt compiler explorer)

.L3:
    movdqu  xmm0, XMMWORD PTR [rsi]
    add     rsi, 16
    add     rdi, 2
    paddb   xmm0, xmm1
    pmovmskb        eax, xmm0
    mov     WORD PTR [rdi-2], ax
    cmp     rdx, rsi
    jne     .L3

所以这并不妙(7个熔丝域uops->前端瓶颈,每〜1.75个时钟周期有16个布尔值).铛会展开2次,并且应该每1.5个周期处理16个布尔值.

So it's not wonderful (7 fuse-domain uops -> front-end bottleneck at 16 bools per ~1.75 clock cycles). Clang unrolls by 2, and should manage 16 bools per 1.5 cycles.

使用移位(pslld xmm0, 7)只能在端口0上出现瓶颈的Haswell上每2个周期运行一次迭代.

Using a shift (pslld xmm0, 7) would only run at one iteration per 2 cycles on Haswell, bottlenecked on port 0.

这篇关于提取__m128i中每个布尔字节的低位?布尔数组到打包位图的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆