使用simd在双打数组中找到nan [英] find nan in array of doubles using simd

查看:91
本文介绍了使用simd在双打数组中找到nan的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这个问题非常类似于:

用于浮点相等比较的SIMD指令(使用NaN == NaN)

尽管该问题集中在128位向量上,并且要求识别+0和-0.

Although that question focused on 128 bit vectors and had requirements about identifying +0 and -0.

我感觉自己也许可以自己得到一个,但是intel内在函数指南页面似乎已经关闭了:/

I had a feeling I might be able to get this one myself but the intel intrinsics guide page seems to be down :/

我的目标是获取一个双精度数组,并返回该数组中是否存在NaN.我希望在大多数情况下不会出现这种情况,并且希望该路线具有最佳性能.

My goal is to take an array of doubles and to return whether a NaN is present in the array. I am expecting that the majority of the time that there won't be one, and would like that route to have the best performance.

最初,我要与自己进行4次双精度比较,以镜像非SIMD方法检测NaN(即,仅NaN值,其中 a!= a 为true).像这样:

Initially I was going to do a comparison of 4 doubles to themselves, mirroring the non-SIMD approach for NaN detection (i.e. NaN only value where a != a is true). Something like:

data *double = ...
__m256d a, b;
int temp = 0;

//This bit would be in a loop over the array
//I'd probably put a sentinel in and loop over while !temp
a = _mm256_loadu_pd(data);
b = _mm256_cmp_pd(a, a, _CMP_NEQ_UQ);
temp = temp | _mm256_movemask_pd(b);

但是,在一些比较示例中,除了比较本身之外,似乎还正在进行某种NaN检测.我简短地认为,好吧,如果 _CMP_EQ_UQ 之类的东西可以检测到NaN,我可以使用它,然后我可以比较4个双打到4个双打,并同时神奇地查看8个双打.

However, in some of the examples of comparison it looks like there is some sort of NaN detection already going on in addition to the comparison itself. I briefly thought, well if something like _CMP_EQ_UQ will detect NaNs, I can just use that and then I can compare 4 doubles to 4 doubles and magically look at 8 doubles at once at the same time.

__m256d a, b, c;
a = _mm256_loadu_pd(data);
b = _mm256_loadu_pd(data+4);
c = _mm256_cmp_pd(a, b, _CMP_EQ_UQ);

在这一点上,我意识到我并没有直截了当,因为我可能碰巧将一个非NaN的数字与其自身进行比较(即3 == 3),并以此方式获得成功.

At this point I realized I wasn't quite thinking straight because I might happen to compare a number to itself that is not a NaN (i.e. 3 == 3) and get a hit that way.

所以我的问题是,将4个双打与自己比较(如上所述)是我能做的最好的事情,还是有其他更好的方法来确定我的数组是否具有NaN?

So my question is, is comparing 4 doubles to themselves (as done above) the best I can do or is there some other better approach to finding out whether my array has a NaN?

推荐答案

您也许可以通过检查fenv状态来完全避免这种情况,或者如果没有,则对其进行高速缓存阻止和/或将其折叠到同一数据的另一遍中,因为它的计算强度非常低(加载/存储的每个字节的工作量),所以它很容易成为内存带宽的瓶颈.见下文.

You might be able to avoid this entirely by checking fenv status, or if not then cache block it and/or fold it into another pass over the same data, because it's very low computational intensity (work per byte loaded/stored), so it easily bottlenecks on memory bandwidth. See below.

您要查找的比较谓词是 _CMP_UNORD_Q _CMP_ORD_Q ,以告诉您比较是无序的或有序的,即至少一个操作数为NaN,或者两个操作数均为非NaN.有序/无序比较是什么意思?

cmppd 的asm文档列出了谓词并具有与内在函数指南相同或更好的细节.

The asm docs for cmppd list the predicates and have equal or better details than the intrinsics guide.

所以是的,如果您希望NaN稀有并且想要快速扫描很多非NaN值,则可以 vcmppd 两个不同的向量彼此相对.如果您关心NaN的位置,一旦知道两个输入向量中的至少一个输入向量中的至少一个,就可以做一些额外的工作来解决.(就像 _mm256_cmp_pd(a,a,_CMP_UNORD_Q)一样,将移动掩码+位扫描送入最低设置位.)

So yes, if you expect NaN to be rare and want to quickly scan through lots of non-NaN values, you can vcmppd two different vectors against each other. If you cared about where the NaN was, you could do extra work to sort that out once you know that there is at least one in either of two input vectors. (Like _mm256_cmp_pd(a,a, _CMP_UNORD_Q) to feed movemask + bitscan for lowest set bit.)

与其他SSE/AVX搜索循环一样,您还可以通过将一些比较结果与 _mm256_or_pd (查找任何无序的)或_mm256_and_pd (检查所有订购的商品).例如.在每个移动掩码/测试/分支中检查几个缓存行(4x _mm256d 和2x _mm256_cmp_pd ).(glibc的asm memchr strlen 使用此技巧.)再次,这将优化您的普通情况,即您不希望过早使用,而必须扫描整个阵列.

Like with other SSE/AVX search loops, you can also amortize the movemask cost by combining a few compare results with _mm256_or_pd (find any unordered) or _mm256_and_pd (check for all ordered). E.g. check a couple cache lines (4x _mm256d with 2x _mm256_cmp_pd) per movemask / test/branch. (glibc's asm memchr and strlen use this trick.) Again, this optimizes for your common case where you expect no early-outs and have to scan the whole array.

还请记住,一次检查同一元素完全可以,因此您的清理很简单:向量加载到数组末尾,有可能与您已经检查过的元素重叠.

Also remember that it's totally fine to check the same element twice, so your cleanup can be simple: a vector that loads up to the end of the array, potentially overlapping with elements you already checked.

// checks 4 vectors = 16 doubles
// non-zero means there was a NaN somewhere in p[0..15]
static inline
int any_nan_block(double *p) {
    __m256d a = _mm256_loadu_pd(p+0);
    __m256d abnan = _mm256_cmp_pd(a, _mm256_loadu_pd(p+ 4), _CMP_UNORD_Q);
    __m256d c = _mm256_loadu_pd(p+8);
    __m256d cdnan = _mm256_cmp_pd(c, _mm256_loadu_pd(p+12), _CMP_UNORD_Q);
    __m256d abcdnan = _mm256_or_pd(abnan, cdnan);
    return _mm256_movemask_pd(abcdnan);
}
// more aggressive ORing is possible but probably not needed
// especially if you expect any memory bottlenecks.

我写的C语言就像是汇编语言,每条源代码行一条指令.(加载/内存源cmppd).如果在Intel上使用非索引寻址模式,则这6条指令在现代CPU上的融合域中都是单地址. test/jnz 作为 break 条件会使其达到7微妙.

I wrote the C as if it were assembly, one instruction per source line. (load / memory-source cmppd). These 6 instructions are all single-uop in the fused-domain on modern CPUs, if using non-indexed addressing modes on Intel. test/jnz as a break condition would bring it up to 7 uops.

在循环中, add reg,16 * 8 指针增量是另一个1 uop,而 cmp/jne 作为循环条件又是另外一个,将其调高到9微克.因此,不幸的是,在Skylake上,此瓶颈以4 oups/时钟的速率出现在前端,至少需要9/4个周期才能发出1次迭代,而负载端口并没有完全饱和.Zen 2或Ice Lake可以在每个时钟上承受2个负载,而无需进一步展开或进行 vorpd 的另一级合并.

In a loop, an add reg, 16*8 pointer increment is another 1 uop, and cmp / jne as a loop condition is one more, bringing it up to 9 uops. So unfortunately on Skylake this bottlenecks on the front-end at 4 uops / clock, taking at least 9/4 cycles to issue 1 iteration, not quite saturating the load ports. Zen 2 or Ice Lake could sustain 2 loads per clock without any more unrolling or another level of vorpd combining.

另一个可能的技巧是在两个向量上使用 vptest vtestpd 来检查它们是否均为非零.但是我不确定是否可以正确检查两个向量的每个元素是否为非零.

Another trick that might be possible is to use vptest or vtestpd on two vectors to check that they're both non-zero. But I'm not sure it's possible to correctly check that every element of both vectors is non-zero. Can PTEST be used to test if two registers are both zero or some other condition? shows that the other way (that _CMP_UNORD_Q inputs are both all-zero) is not possible.

但这并不能真正帮助您: vtestpd / jcc 总计3 uops,而 vorpd / vmovmskpd / test + jcc 也是具有AVX的现有Intel/AMD CPU上的3个融合域uops,因此,当您分支结果时,它甚至都不是吞吐量上的胜利.因此,即使可能,它也可能收支平衡,尽管它可以节省一些代码大小.而且,是否需要花费多个分支来从全为一的情况中挑选出全零或mix_zeros_and_ones的情况也不值得考虑.

But this wouldn't really help: vtestpd / jcc is 3 uops total, vs. vorpd / vmovmskpd / test+jcc also being 3 fused-domain uops on existing Intel/AMD CPUs with AVX, so it's not even a win for throughput when you're branching on the result. So even if it's possible, it's probably break even, although it might save a bit of code size. And wouldn't be worth considering if it takes more than one branch to sort out the all-zeros or mix_zeros_and_ones cases from the all-ones case.

如果数组是该线程中计算的结果,只需检查FP异常粘性标志(手动在MXCSR中,或通过 fenv.h fegetexcept )即可查看自上次清除FP异常以来发生FP无效"异常的情况.如果没有,我认为这意味着FPU没有产生任何NaN输出,因此此线程此后没有在数组中写入任何内容.

If your array was the result of computation in this thread, just check the FP exception sticky flags (in MXCSR manually, or via fenv.h fegetexcept) to see if an FP "invalid" exception has happened since you last cleared FP exceptions. If not, I think that means the FPU hasn't produced any NaN outputs and thus there are none in arrays written since then by this thread.

如果已设置,则必须检查;对于未传播到此数组中的临时结果,可能引发了无效异常.

If it is set, you'll have to check; the invalid exception might have been raised for a temporary result that didn't propagate into this array.

如果/当fenv标志不能让您完全避免工作,或者它不是您的程序的好策略时,请尝试将此检查折叠到产生数组的任何内容中,或折叠到下一个读取的过程中它.因此,您已经在将数据加载到矢量寄存器中的同时重用了数据,从而增加了计算强度.(每个加载/存储的ALU工作量.)

If/when fenv flags don't let you avoid the work entirely, or aren't a good strategy for your program, try to fold this check into whatever produced the array, or into the next pass that reads it. So you're reusing data while it's already loaded into vector registers, increasing computational intensity. (ALU work per load/store.)

即使L1d中的数据已经很热,它仍然会成为加载端口带宽的瓶颈:每个 cmppd 负载为2个/时钟加载端口带宽仍然是瓶颈,在具有2个时钟/时钟的CPU上仍然存在瓶颈.> vcmppd ymm (Skylake,但不是Haswell).

Even if data is already hot in L1d, it will still bottleneck on load port bandwidth: 2 loads per cmppd still bottlenecks on 2/clock load port bandwidth, on CPUs with 2/clock vcmppd ymm (Skylake but not Haswell).

还值得调整指针以确保从L1d缓存中获得满负荷吞吐量,尤其是在L1d中数据有时已经很热的情况下.

Also worthwhile to align your pointers to make sure you're getting full load throughput from L1d cache, especially if data is sometimes already hot in L1d.

或者至少在 对其进行高速缓存阻止,因此,在高速缓存中的同一块上运行另一个循环之前,请先检查一个128kiB的块.这是256k L2大小的一半,因此您的数据应该比上一次通过时更热,和/或对于下一次通过而言很热.

Or at least cache-block it so you check a 128kiB block before running another loop on that same block while it's hot in cache. That's half the size of 256k L2 so your data should still be hot from the previous pass, and/or hot for the next pass.

绝对避免在整个数兆字节的数组上运行该文件,并避免将其从DRAM或L3缓存中放入CPU内核的开销,然后在另一个循环读取它之前再次退出.这是最坏的情况,计算强度很高,要付出不止一次地将其放入CPU内核的专用缓存的费用.

Definitely avoid running this over a whole multi-megabyte array and paying the cost of getting it into the CPU core from DRAM or L3 cache, then evicting again before another loop reads it. That's worst case computational intensity, paying the cost of getting it into a CPU core's private cache more than once.

这篇关于使用simd在双打数组中找到nan的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆