将分散索引转换为聚集索引的有效方法? [英] efficient way to convert scatter indices into gather indices?
问题描述
我正在尝试使用 SIMD 内在函数编写流压缩(采用数组并去除空元素).循环的每次迭代一次处理 8 个元素(SIMD 宽度).
I'm trying to write a stream compaction (take an array and get rid of empty elements) with SIMD intrinsics. Each iteration of the loop processes 8 elements at a time (SIMD width).
使用 SSE 内在函数,我可以使用 _mm_shuffle_epi8() 相当有效地执行此操作,它执行 16 个条目表查找(收集并行计算术语).shuffle 索引是预先计算好的,并使用位掩码查找.
With SSE intrinsics, I can do this fairly efficiently with _mm_shuffle_epi8(), which does a 16 entry table lookup (gather in parallel computing terminology). The shuffle indices are precomputed, and looked up with a bit mask.
for (i = 0; i < n; i += 8)
{
v8n_Data = _mm_load_si128(&data[i]);
mask = _mm_movemask_epi8(&is_valid[i]) & 0xff; // is_valid is byte array
v8n_Compacted = _mm_shuffle_epi8(v16n_ShuffleIndices[mask]);
_mm_storeu_si128(&compacted[count], v8n_Compacted);
count += bitCount[mask];
}
我的问题是现在我也想为 Altivec SIMD 实现这个(不要问为什么 - 错误的商业决策).Altivec 没有 _mm_movemask_epi8() 的等效项,这是一种关键成分.所以,我需要找到一种方法来
My problem is now I would like to implement this for Altivec SIMD too (don't ask why - misguided business decision). Altivec doesn't have an equivalent for _mm_movemask_epi8(), a critical ingredient. So, I will need to find a way to either
模拟 _mm_movemask_epi8() - 看起来很昂贵,有几个班次和 OR
emulate _mm_movemask_epi8() - seems expensive, several shifts and ORs
直接高效地生成shuffle索引-
directly generate the shuffle indices efficiently -
即索引 i 将是未压缩数据中第 i 个有效元素的索引
namely, index i will be the index of the ith valid element in the uncompacted data
element_valid: 0 0 1 0 1 0 0 1 0
gather_indices: x x x x x x 6 4 1
scatter_indices: 3 3 2 2 1 1 1 0 0
串行执行此操作很简单,但我需要并行 (SIMD).使用前缀和生成散点索引似乎很容易,但由于 AltiVec 和 SSE 都没有散点指令,我需要改为收集索引.聚集指数是散布指数的反函数,但如何并行获得呢?我知道在 GPU 编程的先驱时代,将散点转换为聚集是一种常见的做法技术,但所描述的这两种方法都不实用.
It's simple to do this serially, but I need it to be parallel (SIMD). It seems easy to generate scatter indices with a prefix sum, but since neither AltiVec nor SSE has a scatter instruction, I need gather indices instead. Gather indices are the inverse function of the scatter indices, but how can that be gotten in parallel? I know in the pioneering days of GPU programming, converting scatters to gathers was a common technique, but none of those 2 described methods seem practical.
也许如果不坚持压缩保留元素顺序将允许更有效的实现?我可以放弃.
Maybe if not insisting the compaction preserves the element order will allow more efficient implementation? I can give that up.
推荐答案
如果你想模拟 _mm_movemask_epi8
并且你只需要一个来自 8 字节元素的 8 位标量掩码,那么你可以做这样的事情使用 AltiVec:
If you want to emulate _mm_movemask_epi8
and you just need an 8 bit scalar mask from 8 byte elements then you can do something like this using AltiVec:
#include <stdio.h>
int main(void)
{
const vector unsigned char vShift = { 0, 1, 2, 3, 4, 5, 6, 7, 0, 0, 0, 0, 0, 0, 0, 0 };
// constant shift vector
vector unsigned char isValid = { 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0 };
// sample input
vector unsigned char v1 = vec_sl(isValid, vShift);
// shift input values
vector unsigned int v2 = vec_sum4s(v1, (vector unsigned int)(0));
vector signed int v3 = vec_sum2s((vector signed int)v2, (vector signed int)(0));
// sum shifted values
vector signed int v4 = vec_splat(v3, 1);
unsigned int mask __attribute__ ((aligned(16)));
vec_ste((vector unsigned int)v4, 0, &mask);
// store sum in scalar
printf("v1 = %vu\n", v1);
printf("v2 = %#vlx\n", v2);
printf("v3 = %#vlx\n", v3);
printf("v4 = %#vlx\n", v4);
printf("mask = %#x\n", mask);
return 0;
}
这是 5 条 AltiVec 指令,而 SSE 是 1 条.您可能会丢失 vec_splat
并将其降至 4.
This is 5 AltiVec instructions versus 1 in SSE. You might be able to lose the vec_splat
and get it down to 4.
这篇关于将分散索引转换为聚集索引的有效方法?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!