如何使用vindex和_mm_i32gather_epi32进行缩放以收集元素? [英] How to use vindex and scale with _mm_i32gather_epi32 to gather elements?

查看:305
本文介绍了如何使用vindex和_mm_i32gather_epi32进行缩放以收集元素?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

Intel的本指南说:

__m128i _mm_i32gather_epi32 (int const* base_addr, __m128i vindex, const int scale)

并且:

说明

使用32位索引从内存中收集32位整数. 32位 元素从从base_addr开始的地址加载,并且偏移量为 vindex中的每个32位元素(每个索引均按 规模).收集的元素将合并到dst中.比例应为1、2、4 或8.

Gather 32-bit integers from memory using 32-bit indices. 32-bit elements are loaded from addresses starting at base_addr and offset by each 32-bit element in vindex (each index is scaled by the factor in scale). Gathered elements are merged into dst. scale should be 1, 2, 4 or 8.

操作

FOR j := 0 to 3
  i := j*32
  dst[i+31:i] := MEM[base_addr + SignExtend(vindex[i+31:i])*scale]
ENDFOR
dst[MAX:128] := 0

如果我正确地解析了内容,则vindex(带有scale)是用于创建__m128i resultbase_addr的索引.

If I am parsing things correctly then vindex (with scale) are the indexes into base_addr used to create the __m128i result.

下面,我尝试创建val = arr[1] << 96 | arr[5] << 64 | arr[9] << 32 | arr[13] << 0.也就是说,从1开始,每4个元素就有一个.

Below I am trying to create val = arr[1] << 96 | arr[5] << 64 | arr[9] << 32 | arr[13] << 0. That is, starting at 1 take every 4th element.

$ cat -n gather.cxx
 1  #include <immintrin.h>
 2  typedef unsigned int u32;
 3  int main(int argc, char* argv[])
 4  {
 5          u32 arr[16] = {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15};
 6          __m128i idx = _mm_set_epi32(1,5,9,13);
 7          __m128i val = _mm_i32gather_epi32(arr, idx, 1);
 8          return 0;
 9   }

但是当我检查val时:

(gdb) n
6               __m128i idx = _mm_set_epi32(1,5,9,13);
(gdb) n
7               __m128i val = _mm_i32gather_epi32(arr, idx, 1);
(gdb) n
8               return 0;
(gdb) p val
$1 = {0x300000004000000, 0x100000002000000}

似乎我在错误地使用vindex.看来我正在选择索引1,2,3,4.

It appears I am using vindex incorrectly. It appears I am selecting indices 1,2,3,4.

如何使用vindexscale选择数组索引1,5,9,13?

How do I use vindex and scale to select array indices 1,5,9,13?

推荐答案

您的数组元素为4个字节宽.因此,在使用元素索引而不是字节偏移量时,在VSIB寻址模式中使用比例因子4 .

Your array elements are 4 bytes wide. Therefore use a scale factor of 4 in the VSIB addressing mode when using element indices instead of byte offsets.

int const* base_addr参数的类型为int,但是绝没有任何C指针数学运算完成.它直接送入asm指令,因此您需要注意字节偏移量. (并且希望也要考虑严格的别名,以防您想从uint64_t[]char[]中抓取双字.)它也可能是const void*.

The int const* base_addr argument has type int, but at no point is any C pointer math done with it. It's fed directly to the asm instruction, so you need to take care of byte offsets. (And hopefully also taking care of strict aliasing in case you want to grab dwords out of a uint64_t[] or char[].) It could just as well be a const void*.

如果内在函数将比例因子乘以4,您将无法将其与字节偏移量结合使用,而仅将 int索引一起使用.使用通常的x86寻址模式编码:2位移位计数,asm指令可以按1,2、4或8的比例缩放.

If the intrinsic multiplied your scale factor by 4, you wouldn't be able to use it with byte offsets, only with int indices. The asm instruction can scale by 1,2,4, or 8, using the usual x86 addressing mode encoding: a 2 bit shift count.

跨度为4的跨度索引(从1开始)在每个元素的高字节以外的所有地方都为零.也就是说,它从数组的开头开始偏移了1个字节,而x86是小端的.

A strided index with a stride of 4, starting at 1, gets zeros everywhere except the high byte of each element. i.e. it's offset by 1 byte from the the start of the array, and x86 is little endian.

请注意,您没有得到1,2,3,4,得到了1<<242<<24等.将其打印为一个大的64位整数会更难发现.

Notice that you didn't get 1,2,3,4, you got 1<<24, 2<<24, etc. Printing as one big 64-bit integer makes that harder to spot.

随着源大小比例的变化= 1-> 4,您的聚集是一个身份映射:

With that source change of scale = 1 -> 4, your gather is an identity mapping:

(gdb) p  $xmm7.v4_int32
$2 = {13, 9, 5, 1}

我不确定GDB是否有一种便捷的方式来打印__m128i变量的元素,而又不知道它在哪个寄存器中.

I'm not sure if GDB has a convenient way to print the elements of a __m128i variable without knowing what register it's in.

这篇关于如何使用vindex和_mm_i32gather_epi32进行缩放以收集元素?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆