如何加快LUT查找的直方图? [英] How to speed up this histogram of LUT lookups?

查看:130
本文介绍了如何加快LUT查找的直方图?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

首先,我有一个数组int a[1000][1000].所有这些整数都在0到32767之间,它们是已知的常数:它们在程序运行期间永不改变.

First, I have an array int a[1000][1000]. All these integers are between 0 and 32767 ,and they are known constants: they never change during a run of the program.

第二,我有一个数组b [32768],其中包含0到32之间的整数.我使用此数组将a中的所有数组映射到32个bin:

Second, I have an array b[32768], which contains integers between 0 and 32. I use this array to map all arrays in a to 32 bins:

int bins[32]{};
for (auto e : a[i])//mapping a[i] to 32 bins.
    bins[b[e]]++;

每次,数组b都会用一个新数组初始化,我需要将数组a 中的所有1000个数组(每个包含1000个整数)散列为1000个数组,每个数组包含32个整数,分别表示每个容器中有多少个整数.

Each time, array b will be initialized with a new array, and I need to hash all those 1000 arrays in array a (each contains 1000 ints) to 1000 arrays each contains 32 ints represents for how many ints fall into its each bin .

int new_array[32768] = {some new mapping};
copy(begin(new_array), end(new_array), begin(b));//reload array b;

int bins[1000][32]{};//output array to store results .
for (int i = 0; i < 1000;i++)
    for (auto e : a[i])//hashing a[i] to 32 bins.
        bins[1000][b[e]]++;

我可以在0.00237秒内映射1000 * 1000个值.还有什么其他方法可以加快我的代码的速度吗? (就像SIMD吗?)这段代码是我程序的瓶颈.

I can map 1000*1000 values in 0.00237 seconds. Is there any other way that I can speed up my code? (Like SIMD?) This piece of code is the bottleneck of my program.

推荐答案

这本质上是一个直方图问题.您正在使用16位查找表映射16位值和5位值,但是之后只是对LUT结果进行直方图绘制.有关直方图的更多信息,请参见下文.

This is essentially a histogram problem. You're mapping values 16-bit values 5-bit values with a 16-bit lookup table, but after that it's just histogramming the LUT results. See below for more about histograms.

首先,可以使用最小的数据类型来增加LUT(和原始数据)的密度.在x86上,将零或符号扩展的8位或16位数据加载到寄存器中的成本几乎与常规32位int加载(假定两者都命中高速缓存)的成本相同,而8位则为8.位或16位存储区也和32位存储区一样便宜.

First of all, you can use the smallest possible data-types to increase the density of your LUT (and of the original data). On x86, a zero or sign-extending load of 8-bit or 16-bit data into a register is almost exactly the same cost as a regular 32-bit int load (assuming both hit in cache), and an 8-bit or 16-bit store is also just as cheap as a 32-bit store.

由于您的数据大小超过了L1 d缓存大小(对于所有最新的Intel设计,均为32kiB),并且您以分散的方式访问它,因此缩小缓存的占用量将带来很多好处. (有关x86性能的更多信息,请参见标记Wiki,尤其是Agner Fog的东西).

Since your data size exceeds L1 d-cache size (32kiB for all recent Intel designs), and you access it in a scattered pattern, you have a lot to gain from shrinking your cache footprint. (For more x86 perf info, see the x86 tag wiki, especially Agner Fog's stuff).

由于a在每个平面中的条目少于65536,因此bin计数永远不会溢出16位计数器,因此bins也可以是uint16_t.

Since a has less than 65536 entries in each plane, your bin counts will never overflow a 16-bit counter, so bins can be uint16_t as well.

您的copy()没有任何意义.为什么要复制到b[32768],而不是让内部循环使用指向当前LUT的指针?您以只读方式使用它.您仍然要复制的唯一原因是,如果无法将产生不同LUT的代码更改为首先生成int8_tuint8_t,则将从int复制到uin8_t.

Your copy() makes no sense. Why are you copying into b[32768] instead of having your inner loop use a pointer to the current LUT? You use it read-only. The only reason you'd still want to copy is to copy from int to uin8_t if you can't change the code that produces different LUTs to produce int8_t or uint8_t in the first place.

此版本利用了这些想法和一些直方图技巧,并编译为看起来不错的asm(

This version takes advantage of those ideas and a few histogram tricks, and compiles to asm that looks good (Godbolt compiler explorer: gcc6.2 -O3 -march=haswell (AVX2)):

// untested
//#include <algorithm>
#include <stdint.h>

const int PLANES = 1000;
void use_bins(uint16_t bins[PLANES][32]);   // pass the result to an extern function so it doesn't optimize away

                       // 65536 or higher triggers the static_assert
alignas(64) static uint16_t a[PLANES][1000];  // static/global, I guess?

void lut_and_histogram(uint8_t __restrict__ lut[32768])
{

    alignas(16) uint16_t bins[PLANES][32];  // don't zero the whole thing up front: that would evict more data from cache than necessary
                                            // Better would be zeroing the relevant plane of each bin right before using.
                                            // you pay the rep stosq startup overhead more times, though.

    for (int i = 0; i < PLANES;i++) {
        alignas(16) uint16_t tmpbins[4][32] = {0};
        constexpr int a_elems = sizeof(a[0])/sizeof(uint16_t);
        static_assert(a_elems > 1, "someone changed a[] into a* and forgot to update this code");
        static_assert(a_elems <= UINT16_MAX, "bins could overflow");
        const uint16_t *ai = a[i];
        for (int j = 0 ; j<a_elems ; j+=4) { //hashing a[i] to 32 bins.
            // Unrolling to separate bin arrays reduces serial dependencies
            // to avoid bottlenecks when the same bin is used repeatedly.
            // This has to be balanced against using too much L1 cache for the bins.

            // TODO: load a vector of data from ai[j] and unpack it with pextrw.
            // even just loading a uint64_t and unpacking it to 4 uint16_t would help.
            tmpbins[0][ lut[ai[j+0]] ]++;
            tmpbins[1][ lut[ai[j+1]] ]++;
            tmpbins[2][ lut[ai[j+2]] ]++;
            tmpbins[3][ lut[ai[j+3]] ]++;
            static_assert(a_elems % 4 == 0, "unroll factor doesn't divide a element count");
        }

        // TODO: do multiple a[i] in parallel instead of slicing up a single run.
        for (int k = 0 ; k<32 ; k++) {
            // gcc does auto-vectorize this with a short fully-unrolled VMOVDQA / VPADDW x3
            bins[i][k] = tmpbins[0][k] + tmpbins[1][k] +
                         tmpbins[2][k] + tmpbins[3][k];
        }
    }

    // do something with bins.  An extern function stops it from optimizing away.
    use_bins(bins);
}

内环汇编如下:

.L2:
    movzx   ecx, WORD PTR [rdx]
        add     rdx, 8                    # pointer increment over ai[]
    movzx   ecx, BYTE PTR [rsi+rcx]
    add     WORD PTR [rbp-64272+rcx*2], 1     # memory-destination increment of a histogram element
    movzx   ecx, WORD PTR [rdx-6]
    movzx   ecx, BYTE PTR [rsi+rcx]
    add     WORD PTR [rbp-64208+rcx*2], 1
    ... repeated twice more

使用rbp的32位偏移量(而不是rsp的8位偏移量,或使用另一个寄存器:/),代码密度不是很好.尽管如此,平均指令长度还没有那么长,以至于它在任何现代CPU上的指令解码中都可能成为瓶颈.

With those 32-bit offsets from rbp (instead of 8-bit offsets from rsp, or using another register :/) the code density isn't wonderful. Still, the average instruction length isn't so long that it's likely to bottleneck on instruction decode on any modern CPU.

由于您仍然需要执行多个直方图,因此只需并行处理4到8个直方图,而不是对单个直方图切分条带.展开因子甚至不必为2的幂.

Since you need to do multiple histograms anyway, just do 4 to 8 of them in parallel instead of slicing the bins for a single histogram. The unroll factor doesn't even have to be a power of 2.

这消除了最后在k上进行bins[i][k] = sum(tmpbins[0..3][k])循环的需要.

That eliminates the need for the bins[i][k] = sum(tmpbins[0..3][k]) loop over k at the end.

在使用前将bins[i..i+unroll_factor][0..31]置零,而不是在循环外将整个零置零.这样一来,当您启动时,所有垃圾箱都会在L1缓存中变热,并且这项工作可能与内部循环中比较繁重的工作重叠.

Zero bins[i..i+unroll_factor][0..31] right before use, instead of zeroing the whole thing outside the loop. That way all the bins will be hot in L1 cache when you start, and this work can overlap with the more load-heavy work of the inner loop.

硬件预取器可以跟踪多个顺序流,因此不必担心从a加载时会有更多的高速缓存未命中. (也可以使用矢量加载,并在加载后将它们切成薄片.)

Hardware prefetchers can keep track of multiple sequential streams, so don't worry about having a lot more cache misses in loading from a. (Also use vector loads for this, and slice them up after loading).

有关直方图的有用答案的其他问题:

Other questions with useful answers about histograms:

  • Methods to vectorise histogram in SIMD? suggests the multiple-bin-arrays and sum at the end trick.
  • Optimizing SIMD histogram calculation x86 asm loading a vector of a values and extracting to integer registers with pextrb. (In your code, you'd use pextrw / _mm_extract_epi16). With all the load/store uops happening, doing a vector load and using ALU ops to unpack makes sense. With good L1 hit rates, memory uop throughput may be the bottleneck, not memory / cache latency.
  • How to optimize histogram statistics with neon intrinsics? some of the same ideas: multiple copies of the bins array. It also has an ARM-specific suggestion for doing address calculations in a SIMD vector (ARM can get two scalars from a vector in a single instruction), and laying out the multiple-bins array the opposite way.

如果要在Intel Skylake上运行它,甚至可以使用AVX2收集指令进行LUT查找. (在Broadwell上,这可能是收支平衡,而在Haswell上,它将失去;他们支持 vpgatherdd(_mm_i32gather_epi32),但实现效率不高.希望Skylake避免在元素之间重叠时多次击中同一条缓存行).

If you're going to run this on Intel Skylake, you could maybe even do the LUT lookups with AVX2 gather instructions. (On Broadwell, it's probably a break-even, and on Haswell it would lose; they support vpgatherdd (_mm_i32gather_epi32), but don't have as efficient an implementation. Hopefully Skylake avoids hitting the same cache line multiple times when there is overlap between elements).

是的,即使最小的收集粒度是32位元素,您仍然可以从uint16_t数组(比例因子= 2)进行收集.这意味着您会在每个32位向量元素的上半部分得到垃圾,而不是0,但这没关系.缓存行拆分不是理想的选择,因为我们可能会遇到缓存吞吐量瓶颈.

And yes, you can still gather from an array of uint16_t (with scale factor = 2), even though the smallest gather granularity is 32-bit elements. It means you get garbage in the high half of each 32-bit vector element instead of 0, but that shouldn't matter. Cache-line splits aren't ideal, since we're probably bottlenecked on cache throughput.

收集的元素的上半部分的垃圾没关系,因为无论如何,您都只能使用pextrw提取有用的16位. (并使用标量代码执行过程的直方图部分).

Garbage in the high half of gathered elements doesn't matter because you're extracting only the useful 16 bits anyway with pextrw. (And doing the histogram part of the process with scalar code).

可以潜在地使用另一个聚集从直方图箱中加载数据,只要每个元素都来自直方图箱的单独切片/平面即可.否则,如果两个元素来自同一仓,则只有当您手动将增加的矢量散布回直方图(带有标量存储)时,它才只会增加1.分散存储的这种冲突检测是为什么存在 AVX512CD 的原因. AVX512确实具有分散指令以及收集指令(已在AVX2中添加).

You could potentially use another gather to load from the histogram bins, as long as each element comes from a separate slice/plane of histogram bins. Otherwise, if two elements come from the same bin, it would only be incremented by one when you manually scatter the incremented vector back into the histogram (with scalar stores). This kind of conflict detection for scatter stores is why AVX512CD exists. AVX512 does have scatter instructions, as well as gather (already added in AVX2).

AVX512

请参见 2014年Kirill Yukhin幻灯片的第50页为示例循环,该循环重试直到没有冲突为止;但它没有显示如何根据__m512i _mm512_conflict_epi32 (__m512i a)(vpconflictd)(在与之冲突的所有先前元素的每个元素中返回一个位图)实现get_conflict_free_subset()的方式.正如@Mysticial所指出的那样,简单的实现要比冲突检测指令仅产生掩码寄存器结果而不是另一个向量要简单.

See page 50 of Kirill Yukhin's slides from 2014 for an example loop that retries until there are no conflicts; but it doesn't show how get_conflict_free_subset() is implemented in terms of __m512i _mm512_conflict_epi32 (__m512i a) (vpconflictd) (which returns a bitmap in each element of all the preceding elements it conflicts with). As @Mysticial points out, a simple implementation is less simple than it would be if the conflict-detect instruction simply produced a mask-register result, instead of another vector.

我搜索了但没有找到英特尔发布的有关使用AVX512CD的教程/指南,但是大概他们认为在某些情况下,在vpconflictd的结果上使用_mm512_lzcnt_epi32(vplzcntd)很有用,因为也是AVX512CD的一部分.

I searched for but didn't find an Intel-published tutorial/guide on using AVX512CD, but presumably they think using _mm512_lzcnt_epi32 (vplzcntd) on the result of vpconflictd is useful for some cases, because it's also part of AVX512CD.

也许您应该"做一些更聪明的事情,而不只是跳过所有有冲突的元素?也许可以检测到标量回退会更好的情况,例如所有16个dword元素都具有相同的索引? vpbroadcastmw2d将掩码寄存器广播到结果的所有32位元素,以便使您可以将掩码寄存器值与vpconflictd中每个元素中的位图对齐. (并且AVX512F中的元素之间已经存在比较,按位和其他操作).

Maybe you're "supposed" to do something more clever than just skipping all elements that have any conflicts? Maybe to detect a case where a scalar fallback would be better, e.g. all 16 dword elements have the same index? vpbroadcastmw2d broadcasts a mask register to all 32-bit elements of the result, so that lets you line up a mask-register value with the bitmaps in each element from vpconflictd. (And there are already compare, bitwise, and other operations between elements from AVX512F).

Kirill的幻灯片列表VPTESTNM{D,Q}(来自AVX512F)以及冲突检测说明.它从DEST[j] = (SRC1[i+31:i] BITWISE AND SRC2[i+31:i] == 0)? 1 : 0生成一个掩码.即将AND元素放在一起,如果它们不相交,则将该元素的遮罩结果设置为1.

Kirill's slides list VPTESTNM{D,Q} (from AVX512F) along with the conflict-detection instructions. It generates a mask from DEST[j] = (SRC1[i+31:i] BITWISE AND SRC2[i+31:i] == 0)? 1 : 0. i.e. AND elements together, and set the mask result for that element to 1 if they don't intersect.

可能也相关: http://colfaxresearch.com/knl-avx512/ 为进行实际的说明,我们构建和优化了用于对粒子进行装仓的微内核" ,其中包含一些AVX2的代码(我认为).但这是我没有做的免费注册的背后.根据该图,我认为在将某些矢量化的东西生成它们想要散布的数据之后,他们将标量做为实际的散布部分.第一个链接表示第二个链接是用于先前的指令集".

Possibly also relevant: http://colfaxresearch.com/knl-avx512/ says "For a practical illustration, we construct and optimize a micro-kernel for particle binning particles", with some code for AVX2 (I think). But it's behind a free registration which I haven't done. Based on the diagram, I think they're doing the actual scatter part as scalar, after some vectorized stuff to produce data they want to scatter. The first link says the 2nd link is "for previous instruction sets".

这篇关于如何加快LUT查找的直方图?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆