如何使用SIMD计算数组中一个字节的出现? [英] How can I count the occurrence of a byte in array using SIMD?

查看:67
本文介绍了如何使用SIMD计算数组中一个字节的出现?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

给出以下输入字节:

var vBytes = new Vector<byte>(new byte[] {72, 101, 55, 08, 108, 111, 55, 87, 111, 114, 108, 55, 100, 55, 55, 20});

和给定的蒙版:

var mask = new Vector<byte>(55);

如何在输入数组中找到字节 55 的计数?

How can I find the count of byte 55 in the input array?

我尝试使用 mask vBytes 进行 xoring :

var xored = Vector.Xor(mask, vBytes);

给出:

< 127,82,0,91,91,88,0,96,88,69,91,0,83,0,0,35>

<127, 82, 0, 91, 91, 88, 0, 96, 88, 69, 91, 0, 83, 0, 0, 35>

但是不知道如何从中获得计数.

But don't know how I can get the count from that.

为简单起见,我们假设输入字节的长度始终等于 Vector< byte> .Count .

For the sake of simplicity let's assume that the input byte length is always equal to the size of Vector<byte>.Count.

推荐答案

(以下示例的AVX2 C内部函数实现,以防具体示例有所帮助:

(AVX2 C intrinsics implementation of the below idea, in case a concrete example helps: How to count character occurrences using SIMD)

在asm中,您希望 pcmpeqb 产生一个0或0xFF的向量.视为有符号整数,即0/-1.

In asm, you want pcmpeqb to produce a vector of 0 or 0xFF. Treated as signed integers, that's 0/-1.

然后使用比较结果作为整数值 psubb 将0/1添加到该元素的计数器.(减去-1 =加+1)

Then use the compare-result as integers values with psubb to add 0 / 1 to the counter for that element. (Subtract -1 = add +1)

这可能会在256次迭代后溢出,因此在此之前的某个时间,对 _mm_setzero_si128()使用 psadbw 将这些无符号字节水平求和(无低位)为64位整数(每组8个字节一个64位整数).然后 paddq 累计64位总数.

That can overflows after 256 iterations, so sometime before that, use psadbw against _mm_setzero_si128() to horizontally sum those unsigned bytes (without overlow) into 64-bit integers (one 64-bit integer per group of 8 bytes). Then paddq to accumulate 64-bit totals.

可以使用嵌套循环或在常规展开循环结束时进行溢出之前的累加. psadbw 快速(因为它是视频编码运动搜索的关键构建块),因此仅累加每4个比较,甚至是每1个比较并跳过 psubb .

Accumulating before you overflow can be done with a nested loop, or just at the end of a regular unrolled loop. psadbw is fast (because it's a key building block for video encoding motion-search), so it's not bad to just accumulate every 4 compares, or even every 1 and skip the psubb.

有关x86的更多详细信息,请参见 Agner Fog的优化指南.根据他的指令表, psadbw xmm / vpsadbw ymm 在Skylake上每个时钟周期以1个向量运行,并具有3个周期延迟.(仅1 uop的前端带宽.)上面提到的所有指令也是单uop的,并且在多个端口上运行(因此不必为吞吐量而相互冲突).它们的128位版本仅需要SSE2.

See Agner Fog's optimization guides for more details on x86. According to his instruction tables, psadbw xmm / vpsadbw ymm runs at 1 vector per clock cycle on Skylake, with 3 cycle latency. (Only 1 uop of front-end bandwidth.) All the instructions mentioned above are also single-uop, and run on more than one port (so don't necessarily conflict with each other for throughput). Their 128-bit versions only require SSE2.

如果您一次确实只有一个向量要计数,并且没有遍历内存,那么可能 pcmpeqb / psadbw / pshufd (将高位从一半复制到低位)/ paddd /移动的eax,xmm0 为您提供255 *整数寄存器中的匹配项数.一条额外的向量指令(例如,从零开始减去,或与1进行AND运算,或 pabsb (绝对值)将删除x255比例因子.

If you really only have one vector at a time to count, and aren't looping over memory, then probably pcmpeqb / psadbw / pshufd (copy high half to low) / paddd / movd eax, xmm0 gives you 255 * number of matches in an integer register. One extra vector instruction (like subtract from zero, or AND with 1, or pabsb (absolute value) would remove the x255 scale factor.

IDK如何在C#SIMD中编写代码,但是您肯定不是想要点积!解压缩并转换为FP的速度比上述速度慢大约4倍,这是因为固定宽度的向量比浮点数和 dpps ( _mm_dp_ps )容纳的字节数多4倍不是很快.在Skylake上为4 oups,每1.5周期吞吐量中的一个.如果要做必须对除无符号字节以外的内容进行水平求和,请参见

IDK how to write that in C# SIMD, but you definitely do not want a dot-product! Unpack and convert to FP would be about 4x slower than the above, just from the fact that a fixed-width vector holds 4x more bytes than floats, and dpps (_mm_dp_ps) is not fast. 4 uops, and one per 1.5 cycle throughput on Skylake. If you do have to horizontal-sum something other than unsigned bytes, see Fastest way to do horizontal SSE vector sum (or other reduction) (my answer also include integer).

或者,如果 Vector.Dot pmaddubsw / pmaddwd 用于整数矢量,则可能不那么糟糕,但是可以进行多次与 psadbw 相比,比较结果的每个向量的步进水平和都是不好的,尤其是与字节累加器(有时只是水平和)相比.

Or if Vector.Dot uses pmaddubsw / pmaddwd for integer vectors, then that might not be as bad, but doing a multi-step horizontal sum for each vector of compare results is just bad compared to psadbw, or especially to byte accumulators that you only horizontal sum occasionally.

或者,如果C#使用常数向量 1 优化了任何实际的乘法运算.无论如何,此答案的第一部分是您希望CPU运行的代码.做到这一点,但是您喜欢使用任何源代码来实现它.

Or if C# optimizes out any actual multiplying with a constant vector of 1. Anyway, the first part of this answer is the code you want the CPU to be running. Make that happen however you like using whatever source code gets it to happen.

这篇关于如何使用SIMD计算数组中一个字节的出现?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆