大型数组或列表的4桶直方图的微优化 [英] Micro Optimization of a 4-bucket histogram of a large array or list

查看:66
本文介绍了大型数组或列表的4桶直方图的微优化的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个特殊的问题.我将尽力描述这一点.

我正在做一个非常重要的微优化".一次运行几天的循环.因此,如果我可以减少此循环时间,则只需花费一半的时间. 10天会减少到只有5天,等等.

我现在拥有的循环是函数:"testbenchmark1".

我有4个索引需要像这样在循环中增加.但是,当我从列表中访问索引时,实际上需要花费一些额外的时间.如果没有其他解决方案,这就是我要尝试的方法.

indexes[n]++; //increase correct index

"testbenchmark1"的完整代码需要122毫秒:

void testbenchmark00()
{
    Random random = new Random();
    List<int> indexers = new List<int>();
    for (int i = 0; i < 9256408; i++)
    {
        indexers.Add(random.Next(0, 4));
    }
    int[] valueLIST = indexers.ToArray();


    Stopwatch stopWatch = new Stopwatch();
    stopWatch.Start();

    int[] indexes = { 0, 0, 0, 0 };
    foreach (int n in valueLIST) //Takes 122 ms
    {
        indexes[n]++; //increase correct index
    }

    stopWatch.Stop();
    MessageBox.Show("stopWatch: " + stopWatch.ElapsedMilliseconds.ToString() + " milliseconds");
}

现在下面的"testbenchmark2"代码只是实验性的,我知道它是不正确的,但我想知道是否有任何类似的方式来使用此类数字:"1_00_00_00_00",并且是否有可能看到: 00_00_00_00"作为四个不同的整数.例如,如果我求和: 1_00_00_00_00 + 1_00_01_00_00 = 1_00_01_00_00 ,然后最后可以提取每个数字,则四个数字中的每个数字都是这样:00、01、00、00

但是我不知道即使使用二进制数也不能以任何方式做到这一点.是的,任何一种解决方案.只是这样添加数字.就像测试一样,循环仅花费59毫秒,是122毫秒的一半.因此,我很感兴趣看看是否有任何想法吗?

double num3 = 1_00_00_00_00;
double num4 = 1_00_01_00_00;
for (int i = 0; i < valueLIST.Count; i++) //Takes 59 ms
{
    num3 += num4;
}

"testbenchmark2"的完整代码需要59毫秒:

void testbenchmark2()
{
    List<String> valueLIST = new List<String>(); 
    for (int i = 0; i < 9256408; i++) //56
    {
        valueLIST.Add(i.ToString());
    }

    //https://www.geeksforgeeks.org/binary-literals-and-digit-separators-in-c-sharp/
    double num3 = 1_00_00_00_00;
    double num4 = 1_00_01_00_00;

    Stopwatch stopWatch = new Stopwatch();
    stopWatch.Start();
    for (int i = 0; i < valueLIST.Count; i++) //Takes 59 ms
    {
        num3 += num4;
    }
    stopWatch.Stop();
    MessageBox.Show("stopWatch: " + stopWatch.ElapsedMilliseconds.ToString() + " milliseconds\n\n" + num3);
}

编辑
下面是我正在尝试做的更干净的代码!
但是下面的代码可能是正确的或解决方案,但是它表明了我想做的事情.

        void newtest()
        {
            double num1 = 1_00_00_00_00;
            double num2 = 1_00_01_00_00;
            double num3 = 1_00_01_01_00;

            List<double> testnumbers = new List<double>();
            testnumbers.Add(num1);
            testnumbers.Add(num2);
            testnumbers.Add(num3);

            double SUM = 0;
            for (int i = 0; i < testnumbers.Count; i++)
            {
                SUM += testnumbers[i];
            }

            //The result is
            //300020100

            //Would it possible to extract the "four buckets" that I am interesting in somehow?
            //00_02_01_00
        }

解决方案

在现代的x86-64(如Skylake或Zen)上,每2.5个时钟周期(每核)大约有8个元素(1个AVX2矢量),这应该是可能的2,使用AVX2.或每2个时钟展开一次.或者在您的PILEDRIVER CPU上,使用AVX1 _mm_cmpeq_epi32,每3个时钟可能有1x 16字节索引向量.

一般策略适用于2到8个存储桶.对于字节,16位或32位元素. (因此,字节元素为您提供了每2个时钟周期对32个元素进行直方图显示的最佳情况,并带有一点外循环开销,以便在字节计数器溢出之前收集字节计数器.)

更新:或者将一个int映射到1UL << (array[i]*8)以增加具有SIMD/SWAR的计数器的4个字节之一,我们可以在SKL上每个8 int的向量接近1个时钟,在Zen2上每个2个时钟接近. (这甚至更具体地适用于4个或更少的存储桶以及int输入,并且不会缩小到SSE2.它需要可变移位或至少需要AVX1可变混洗.)将字节元素与第一种策略一起使用可能会更好就每个周期的元素而言.

正如@JonasH指出的那样,您可以在输入数组的不同部分上使用不同的内核.在典型的台式机上,单核可以接近饱和内存带宽,但是多核Xeon具有较低的每核内存带宽和更高的聚合量,并且需要更多的核才能饱和L3或DRAM带宽. 为什么Skylake会这样在单线程内存吞吐量方面比Broadwell-E好得多吗?


一次运行几天的循环.

单个输入列表上,迭代非常慢,因此它仍然不会溢出int计数器吗?还是使用不同的大型列表(例如您的约900k测试数组)重复调用?

我相信我想避免增加列表或数组的索引,因为它似乎要花费很多时间?

那可能是因为您在禁用优化的情况下进行了基准测试.不要那样做,那根本没有意义.通过禁用优化,不同的代码会减慢不同的数量.更明确的步骤和tmp var通常会使调试模式的代码变慢,因为调试器需要查看的内容更多.但是当您使用常规优化进行编译时,它们只能将其优化为常规的指针增量循环.

遍历数组可以有效地编译为asm.

最慢的部分是通过内存的依赖链,用于增加数组的可变索引.例如,在Skylake CPU上,具有相同地址的内存目标add每6个时钟周期以大约一个增量重复出现瓶颈,因为下一个add必须等待加载前一个存储的值. (从存储缓冲区进行存储转发意味着它不必等待首先提交就进行缓存,但是它比添加到寄存器中要慢得多.)另请参见Agner Fog的优化指南:在SIMD中对直方图进行矢量化的方法?.像代替int[] indexes = { 0, 0, 0, 0 };一样,可以使它成为一个二维数组,每个数组包含四个计数器.您必须手动在源代码中展开循环以遍历输入数组,并处理展开的部分之后剩余的最后0..3元素.

对于中小型计数数组,这是一项好技术,但如果复制计数器开始导致缓存未命中,则该方法会变得很糟糕.


使用窄整数可以节省缓存占用空间/内存带宽.

您可以/应该做的另一件事是为0..3值的数组使用尽可能窄的类型:每个数字都可以容纳一个字节,因此使用8位整数可以节省您将缓存占用量/内存带宽提高了4倍.

x86可以有效地将字节加载到完整的寄存器中/从完整的寄存器中存储字节.使用SSE4.1,您还具有SIMD pmovzxbd,可以在循环中将byte_array[i]int_array[i]一起使用时更有效地自动矢量化.

(当我说x86时,我的意思是包括x86-64,而不是ARM或PowerPC.当然,您实际上并不想编译32位代码,Microsoft称之为"x86")


只有很少数量的水桶,例如4

这看起来像SIMD的一项工作.使用x86 SSE2,每16字节数据向量中int个元素的数量等于直方图箱的数量.

您已经有了SIMD的想法,尝试将数字视为四个独立的字节元素.请参见 https://en.wikipedia.org/wiki/SIMD#Software

但是00_01_10_11只是人类可读的数字分隔符的源代码级语法,而double是其内部表示形式与整数表示形式不同的浮点类型.而且您绝对不想使用字符串; SIMD使您可以进行一次操作,例如一次操作整数数组的4个元素.

我能看到的最好的方法是分别对4个值中的每个值进行匹配计数,而不是将元素映射到计数器.我们要并行处理多个元素,但将它们映射到当元素的一个向量中有重复的值时,计数器可能会发生冲突.您需要将该计数器增加两次.

该函数的标量等效值为:

int counts[4] = {0,0,0,0};
for () {
    counts[0] += (arr[i] == 0);
    counts[1] += (arr[i] == 1);
    counts[2] += (arr[i] == 2);  // count matches
  //counts[3] += (arr[i] == 3);  // we assume any that aren't 0..2 are this
}
counts[3] = size - counts[0] - counts[1] - counts[2];
// calculate count 3 from other counts

(在C ++中) GCC -O3实际上将完全按照我在下面手动进行的方式自动矢量化: https://godbolt.org/z/UJfzuH .当自动矢量化时,Clang甚至会展开它,因此对于int输入,它应该比我的手工矢量化版本更好.不过,这种情况仍然不如vpermilps替代策略好.

(而且,如果您想要字节元素具有有效的窄和,仅在外部循环中加宽,则仍然需要手动向量化.)


关于字节元素,请参见如何使用SIMD计算字符出现次数 .元素大小对于计数器而言太窄; 256个计数后它将溢出.因此,您必须在内部循环中进行扩展,或者使用嵌套循环在扩展之前进行一些累加.

我不了解C#,所以我可以用x86汇编语言或带有内在函数的C ++编写代码.也许C ++内在函数对您更有用. C#具有某种矢量扩展名,应该可以移植它.

这是用于x86-64的C ++,使用AVX2 SIMD内部函数.有关某些信息,请参见 https://stackoverflow.com/tags/sse/info .

// Manually vectorized for AVX2, for int element size
// Going nearly 4x as fast should be possible for byte element size

#include <immintrin.h>

void count_elements_avx2(const std::vector<int> &input,  unsigned output_counts[4])
{
    __m256i  counts[4] = { _mm256_setzero_si256() };  // 4 vectors of zeroed counters
                  // each vector holds counts for one bucket, to be hsummed at the end

    size_t size = input.size();
    for(size_t i = 0 ; i<size ; i+=8) {  // 8x 32-bit elements per vector
        __m256i v = _mm256_loadu_si256((const __m256i*)&input[i]);  // unaligned load of 8 ints
        for (int val = 0 ; val < 3; val++) {
           // C++ compilers will unroll this with 3 vector constants and no memory access
            __m256i match = _mm256_cmpeq_epi32(v, _mm256_set1_epi32(val));  // 0 or all-ones aka -1
            counts[val] = _mm256_sub_epi32(counts[val], match);   // x -= -1 or 0 conditional increment
        }
    }


    // transpose and sum 4 vectors of 8 elements down to 1 vector of 4 elements
    __m128i summed_counts = hsum_xpose(counts);   // helper function defined in Godbolt link
    _mm_storeu_si128((__m128i*)output_counts, summed_counts);

    output_counts[3] = size - output_counts[0]
                       - output_counts[1] - output_counts[2];

    // TODO: handle the last size%8 input elements; scalar would be easy
}

这很好地与编译铛(在 <强> Godbolt编译器资源管理器 ).大概您可以编写可编译为类似机器代码的C#.如果不是,请考虑从C ++编译器调用本机代码(或者,如果无法从编译器获得真正的最佳代码,请使用asm手写).如果实际用例的运行次数与基准测试次数相同,那么在不必复制输入数组的情况下,这可以分摊额外的开销.

 # from an earlier version of the C++, doing all 4 compares in the inner loop
 # clang -O3 -march=skylake
.LBB0_2:                                     # do {
    vmovdqu ymm7, ymmword ptr [rcx + 4*rdx]    # v = load arr[i + 0..7]
    vpcmpeqd        ymm8, ymm7, ymm3           # compare v == 0
    vpsubd  ymm4, ymm4, ymm8                   # total0 -= cmp_result
    vpcmpeqd        ymm8, ymm7, ymm5
    vpsubd  ymm2, ymm2, ymm8
    vpcmpeqd        ymm7, ymm7, ymm6           # compare v == 2
    vpsubd  ymm1, ymm1, ymm7                   # total2 -= cmp_result
    add     rdx, 8                             # i += 8
    cmp     rdx, rax
    jb      .LBB0_2                          # }while(i < size)

估计的最佳情况Skylake性能:每个向量约2.5个周期(8个int或32个int8_t)

或者2展开.

没有AVX2,仅使用SSE2,您将有一些额外的movdqa指令,并且每个向量仅处理4个元素.不过,这仍然是内存中直方图与标量直方图的胜利.甚至1个元素/时钟也不错,并且应该可以在可以在任何x86-64 CPU上运行的SSE2上实现.

当然,假设没有高速缓存未命中,并且将对L1d的硬件预取保持在循环的前面.仅当数据在L2缓存中已经很热时才可能发生这种情况. 我还假设不会因内存对齐而停顿;理想情况下,您的数据按32字节对齐.如果通常不对齐,则可能值得处理第一个未对齐部分,然后如果数组足够大,则使用对齐负载.

对于字节元素,最里面的循环看起来很相似(使用vpcmpeqbvpsubb,但是在将其累加到64位计数器之前最多只能运行255次(而不是256次)迭代,以避免溢出.向量将是相同的,但每个向量具有4倍的元素.

请参见 https://agner.org/optimize/在uops.info上

对于Haswell/Skylake,内部循环只有9个融合域uops,因此最好的前端瓶颈是每2.25个周期约1次迭代(管道为4 uops).循环效果有些碍事: https://docs.microsoft.com/zh-我们/dotnet/standard/collections/)中,列表可以有效地建立索引.我认为它等效于C ++ std::vector<T>.您只需要对其进行迭代即可,而无需将其复制到数组中.


替代策略-将0..3映射到int的一个字节中的某个位

如果您不能将元素的范围缩小到字节以节省内存带宽,则很好.

但是说到这,在使用3x pcmpeqb/psubb进行计数之前,使用2x _mm256_packs_epi32(vpackssdw)和_mm256_packs_epi16(vpacksswb)缩小到8位整数也许值得.每4个输入向量要花3微码才能将1个字节元素打包.

但是,如果您输入的内容以int元素开头,那么最好先打包然后再比较3种方式.

您有4个存储桶,而int有4个字节. 如果我们可以将每个int元素转换为相应字节底部的1,则可以使用_mm256_add_epi8 添加最多255个内循环迭代,然后再扩展到64位计数器. (使用标准的_mm256_sad_epu8抵制零技巧以hsum无符号字节而不会发生溢出.)

有两种方法可以做到这一点.第一个:使用洗牌作为查找表. AVX2 vpermd使用数据作为索引向量,并使用常数_mm256_set_epi32(0,0,0,0, 1UL<<24, 1UL<<16, 1UL<<8, 1UL<<0)作为被洗牌的数据来工作(_mm256_permutexvar_epi32).或键入该向量以将AVX1 vpermilps用作LUT,并且LUT向量的那些字节也位于上半部.

vpermilps更好:AMD Zen 1上的uops更少,并且由于在车道内,因此所有地方的延迟都更低. (可能会在某些CPU上造成旁路延迟,从而减少了延迟带来的好处,但仍然不比vpermd差.)

由于某些原因,带有矢量控件的vpermilps在Zen2上具有2个循环的吞吐量,即使它仍然是单个uop.或在Zen1上进行4个循环(对于2 uop YMM版本).在Intel上是1个周期. vpermd在AMD上更糟:更多的机会和同样差的吞吐量.

根据Agner Fog的测试,Piledriver上的

vpermilps xmm(16字节向量)具有1时钟的吞吐量,并且在"ivec"域中运行. (因此,在预定"浮点操作数上使用它时,实际上具有额外的旁路延迟延迟,而在整数上则没有).

   // Or for Piledriver, __m128 version of this

    __m256 bytepatterns = _mm256_casts256_ps(_mm256_set_epi32(
         1<<24, 1<<16, 1<<8, 1<<0,
         1<<24, 1<<16, 1<<8, 1<<0) );
    __m256i v = _mm256_loadu_si256((const __m256i*)&input[i]);
    v = _mm256_castps_si256(_mm256_permutevar_ps(bytepatterns, v));  // vpermilps 32-bit variable shuffle
    counts = _mm256_add_epi8(counts, v);

     // after some inner iterations, separate out the 
     // set1_epi32(0x000000ff) counts, 0x0000ff00 counts, etc.

这将在每个int元素内生成交错的计数器.如果您在256个计数之前不累积它们,它们将溢出.有关使用一个计数器的简单版本,请参见如何使用SIMD来计数字符出现次数.

在这里,我们可能会展开并使用2个不同的LUT向量,因此当我们希望将0的所有计数分组在一起时,我们可以将 2个向量合并在一起并掩盖其他向量.


除了改组之外,我们还可以通过AVX2变量移位来做到这一点.

sums += 1UL << (array[i]*8);其中,*8是字节中的位数,也可以通过移位完成.我将其编写为标量C ++表达式,是因为现在您有机会了解整数整数的概念是如何工作的.只要我们不让单个字节溢出,SIMD字节是否在字节之间添加块进位还是使用32位dword元素都没有关系.

我们将使用AVX2来做到这一点:

__m256i v = loadu...();
v = _mm256_slli_epi32(v, 3);  // v *= 8
v = _mm256_sllv_epi32(_mm256_set1_epi32(1), v);
counts = _mm256_add_epi8(counts, v);

这是2个移位指令加vpaddb.在Skylake上,可变计数移位 vpsllvd 很便宜:单联并在多个端口上运行.但是在Haswell和Zen上,速度要慢一些. (吞吐量与AMD上的vpermilps相同)

对于2个端口,只有2个uop不能为shuffle版本的1个端口击败1个uop. (除非您交替使用两种策略将工作分配到SKL的所有ALU端口上.)

因此,无论哪种方式,最内层的循环每个时钟都可以传输1个向量,或者通过仔细地结合shift和shuffle方法,可能会稍微好一些.

但是,这需要在128或255个内部循环迭代中分摊少量开销.

最后的清除操作可能将2个向量混合在一起,以获得一个仅具有2个存储桶的计数的向量,然后vpshufb(_mm256_shuffle_epi8)将同一存储桶的字节计数器分组为相同的qword.然后,对零的vpsadbw(_mm256_sad_epu8)可以水平求和_mm256_add_epi64每个qword中的那些字节元素.因此,外循环工作应为2 vpblendw,2x vpshufb,2x vpsadbw,2x vpaddq,然后再返回到内循环的另外255次迭代中.可能还会检查您是否在数组末尾的255次迭代中,以设置内部迭代的循环边界.

I have a special question. I will try to describe this as accurate as possible.

I am doing a very important "micro-optimization". A loop that runs for days at a time. So if I can cut this loops time it takes to half the time. 10 days would decrease to only 5 days etc.

The loop I have now is the function: "testbenchmark1".

I have 4 indexes that I need to increase in a loop like this. But when accessing an index from a list that takes some extra time actually as I have noticed. This is what I am trying to se if there is another solution to.

indexes[n]++; //increase correct index

Complete code for "testbenchmark1" which takes 122 ms:

void testbenchmark00()
{
    Random random = new Random();
    List<int> indexers = new List<int>();
    for (int i = 0; i < 9256408; i++)
    {
        indexers.Add(random.Next(0, 4));
    }
    int[] valueLIST = indexers.ToArray();


    Stopwatch stopWatch = new Stopwatch();
    stopWatch.Start();

    int[] indexes = { 0, 0, 0, 0 };
    foreach (int n in valueLIST) //Takes 122 ms
    {
        indexes[n]++; //increase correct index
    }

    stopWatch.Stop();
    MessageBox.Show("stopWatch: " + stopWatch.ElapsedMilliseconds.ToString() + " milliseconds");
}

Now the below "testbenchmark2" code is just experimental and I know it is not correct but I wonder if there is any simular way to use such kind of numbers: "1_00_00_00_00" and if it would be possible to see the: "00_00_00_00" as four different integer numbers. For example if I would do a summation of: 1_00_00_00_00 + 1_00_01_00_00 = 1_00_01_00_00 and then one could in the end extract each number, each of the four like this: 00, 01, 00, 00

But I don't know if this is possible in any way even using Binary numbers. Yes any kind of solution. To just add numbers like this. Just as a test that loop took only 59 ms which is half the time of 122 ms. So I am interesting to see if there is any idéa to this?

double num3 = 1_00_00_00_00;
double num4 = 1_00_01_00_00;
for (int i = 0; i < valueLIST.Count; i++) //Takes 59 ms
{
    num3 += num4;
}

Complete code for "testbenchmark2" which takes 59 ms:

void testbenchmark2()
{
    List<String> valueLIST = new List<String>(); 
    for (int i = 0; i < 9256408; i++) //56
    {
        valueLIST.Add(i.ToString());
    }

    //https://www.geeksforgeeks.org/binary-literals-and-digit-separators-in-c-sharp/
    double num3 = 1_00_00_00_00;
    double num4 = 1_00_01_00_00;

    Stopwatch stopWatch = new Stopwatch();
    stopWatch.Start();
    for (int i = 0; i < valueLIST.Count; i++) //Takes 59 ms
    {
        num3 += num4;
    }
    stopWatch.Stop();
    MessageBox.Show("stopWatch: " + stopWatch.ElapsedMilliseconds.ToString() + " milliseconds\n\n" + num3);
}

EDIT
The below is a more clean code of what I am trying to do Exactly!
But the below code will probably to be correct or the solution but it shows what I try to do I beleive.

        void newtest()
        {
            double num1 = 1_00_00_00_00;
            double num2 = 1_00_01_00_00;
            double num3 = 1_00_01_01_00;

            List<double> testnumbers = new List<double>();
            testnumbers.Add(num1);
            testnumbers.Add(num2);
            testnumbers.Add(num3);

            double SUM = 0;
            for (int i = 0; i < testnumbers.Count; i++)
            {
                SUM += testnumbers[i];
            }

            //The result is
            //300020100

            //Would it possible to extract the "four buckets" that I am interesting in somehow?
            //00_02_01_00
        }

解决方案

This should be possible at about 8 elements (1 AVX2 vector) per 2.5 clock cycles or so (per core) on a modern x86-64 like Skylake or Zen 2, using AVX2. Or per 2 clocks with unrolling. Or on your Piledriver CPU, maybe 1x 16-byte vector of indexes per 3 clocks with AVX1 _mm_cmpeq_epi32.

The general strategy works with 2 to 8 buckets. And for byte, 16-bit, or 32-bit elements. (So byte elements gives you 32 elements histogrammed per 2 clock cycles best case, with a bit of outer-loop overhead to collect byte counters before they overflow.)

Update: or mapping an int to 1UL << (array[i]*8) to increment one of 4 bytes of a counter with SIMD / SWAR addition, we can go close to 1 clock per vector of 8 int on SKL, or per 2 clocks on Zen2. (This is even more specific to 4 or fewer buckets, and int input, and doesn't scale down to SSE2. It needs variable-shifts or at least AVX1 variable-shuffles.) Using byte elements with the first strategy is probably still better in terms of elements per cycle.

As @JonasH points out, you could have different cores working on different parts of the input array. A single core can come close to saturating memory bandwidth on typical desktops, but many-core Xeons have lower per-core memory bandwidth and higher aggregate, and need more cores to saturate L3 or DRAM bandwidth. Why is Skylake so much better than Broadwell-E for single-threaded memory throughput?


A loop that runs for days at a time.

On a single input list that's very very slow to iterate so it still doesn't overflow int counters? Or repeated calls with different large Lists (like your ~900k test array)?

I believe I want to avoid increasing an index for a list or array as it seem to consume a lot of time?

That's probably because you were benchmarking with optimization disabled. Don't do that, it's not meaningful at all; different code is slowed down different amounts by disabling optimization. More explicit steps and tmp vars can often make slower debug-mode code because there are more things that have to be there to look at with a debugger. But they can just optimize into a normal pointer-increment loop when you compile with normal optimization.

Iterating through an array can compile efficiently into asm.

The slow part is the dependency chain through memory for incrementing a variable index of the array. For example on a Skylake CPU, memory-destination add with the same address repeatedly bottlenecks at about one increment per 6 clock cycles because the next add has to wait to load the value stored by the previous one. (Store-forwarding from the store buffer means it doesn't have to wait for it to commit to cache first, but it's still much slower than add into a register.) See also Agner Fog's optimization guides: https://agner.org/optimize/

With counts only distributed across 4 buckets, you'll have a lot of cases where instructions are waiting to reload the data stored by another recent instruction, so you can't even achieve the nearly 1 element per clock cycle you might if counts were well distributed over more counters that still were all hot in L1d cache.

One good solution to this problem is unrolling the loop with multiple arrays of counters. Methods to vectorise histogram in SIMD?. Like instead of int[] indexes = { 0, 0, 0, 0 }; you can make it a 2D array of four counters each. You'd have to manually unroll the loop in the source to iterate over the input array, and handle the last 0..3 left over elements after the unrolled part.

This is a good technique for small to medium arrays of counts, but becomes bad if replicating the counters starts to lead to cache misses.


Use narrow integers to save cache footprint / mem bandwidth.

Another thing you can/should do is use as narrow a type as possible for your arrays of 0..3 values: each number can fit in a byte so using 8-bit integers would save you a factor of 4 cache footprint / memory bandwidth.

x86 can efficiently load/store bytes into to/from full registers. With SSE4.1 you also have SIMD pmovzxbd to make it more efficient to auto-vectorize when you have a byte_array[i] used with an int_array[i] in a loop.

(When I say x86 I mean including x86-64, as opposed to ARM or PowerPC. Of course you don't actually want to compile 32-bit code, what Microsoft calls "x86")


With very small numbers of buckets, like 4

This looks like a job for SIMD compares. With x86 SSE2 the number of int elements per 16-byte vector of data is equal to your number of histogram bins.

You already had a SIMD sort of idea with trying to treat a number as four separate byte elements. See https://en.wikipedia.org/wiki/SIMD#Software

But 00_01_10_11 is just source-level syntax for human-readable separators in numbers, and double is a floating-point type whose internal representation isn't the same as for integers. And you definitely don't want to use strings; SIMD lets you do stuff like operating on 4 elements of an integer array at once.

The best way I can see to approach this is to separately count matches for each of the 4 values, rather than map elements to counters. We want to process multiple elements in parallel but mapping them to counters can have collisions when there are repeated values in one vector of elements. You'd need to increment that counter twice.

The scalar equivalent of this is:

int counts[4] = {0,0,0,0};
for () {
    counts[0] += (arr[i] == 0);
    counts[1] += (arr[i] == 1);
    counts[2] += (arr[i] == 2);  // count matches
  //counts[3] += (arr[i] == 3);  // we assume any that aren't 0..2 are this
}
counts[3] = size - counts[0] - counts[1] - counts[2];
// calculate count 3 from other counts

which (in C++) GCC -O3 will actually auto-vectorize exactly the way I did manually below: https://godbolt.org/z/UJfzuH. Clang even unrolls it when auto-vectorizing, so it should be better than my hand-vectorized version for int inputs. Still not as good as the alternate vpermilps strategy for that case, though.

(And you do still need to manually vectorize if you want byte elements with efficient narrow sums, only widening in an outer loop.)


With byte elements, see How to count character occurrences using SIMD. The element size is too narrow for a counter; it would overflow after 256 counts. So you have to widen either in the inner loop, or use nested loops to do some accumulating before widening.

I don't know C#, so I could write the code in x86 assembly or in C++ with intrinsics. Perhaps C++ intrinsics is more useful for you. C# has some kind of vector extensions that should make it possible to port this.

This is C++ for x86-64, using AVX2 SIMD intrinsics. See https://stackoverflow.com/tags/sse/info for some info.

// Manually vectorized for AVX2, for int element size
// Going nearly 4x as fast should be possible for byte element size

#include <immintrin.h>

void count_elements_avx2(const std::vector<int> &input,  unsigned output_counts[4])
{
    __m256i  counts[4] = { _mm256_setzero_si256() };  // 4 vectors of zeroed counters
                  // each vector holds counts for one bucket, to be hsummed at the end

    size_t size = input.size();
    for(size_t i = 0 ; i<size ; i+=8) {  // 8x 32-bit elements per vector
        __m256i v = _mm256_loadu_si256((const __m256i*)&input[i]);  // unaligned load of 8 ints
        for (int val = 0 ; val < 3; val++) {
           // C++ compilers will unroll this with 3 vector constants and no memory access
            __m256i match = _mm256_cmpeq_epi32(v, _mm256_set1_epi32(val));  // 0 or all-ones aka -1
            counts[val] = _mm256_sub_epi32(counts[val], match);   // x -= -1 or 0 conditional increment
        }
    }


    // transpose and sum 4 vectors of 8 elements down to 1 vector of 4 elements
    __m128i summed_counts = hsum_xpose(counts);   // helper function defined in Godbolt link
    _mm_storeu_si128((__m128i*)output_counts, summed_counts);

    output_counts[3] = size - output_counts[0]
                       - output_counts[1] - output_counts[2];

    // TODO: handle the last size%8 input elements; scalar would be easy
}

This compiles nicely with clang (on the Godbolt compiler explorer). Presumably you can write C# that compiles to similar machine code. If not, consider calling native code from a C++ compiler (or hand-written in asm if you can't get truly optimal code from the compiler). If your real use-case runs as many iterations as your benchmark, that could amortize the extra overhead if the input array doesn't have to get copied.

 # from an earlier version of the C++, doing all 4 compares in the inner loop
 # clang -O3 -march=skylake
.LBB0_2:                                     # do {
    vmovdqu ymm7, ymmword ptr [rcx + 4*rdx]    # v = load arr[i + 0..7]
    vpcmpeqd        ymm8, ymm7, ymm3           # compare v == 0
    vpsubd  ymm4, ymm4, ymm8                   # total0 -= cmp_result
    vpcmpeqd        ymm8, ymm7, ymm5
    vpsubd  ymm2, ymm2, ymm8
    vpcmpeqd        ymm7, ymm7, ymm6           # compare v == 2
    vpsubd  ymm1, ymm1, ymm7                   # total2 -= cmp_result
    add     rdx, 8                             # i += 8
    cmp     rdx, rax
    jb      .LBB0_2                          # }while(i < size)

Estimated best-case Skylake performance: ~2.5 cycles per vector (8 int or 32 int8_t)

Or 2 with unrolling.

Without AVX2, using only SSE2, you'd have some extra movdqa instructions and only be doing 4 elements per vector. This would still be a win vs. scalar histogram in memory, though. Even 1 element / clock is nice, and should be doable with SSE2 that can run on any x86-64 CPU.

Assuming no cache misses of course, with hardware prefetch into L1d staying ahead of the loop. This might only happen with the data already hot in L2 cache at least. I'm also assuming no stalls from memory alignment; ideally your data is aligned by 32 bytes. If it usually isn't, possibly worth processing the first unaligned part and then using aligned loads, if the array is large enough.

For byte elements, the inner-most loop will look similar (with vpcmpeqb and vpsubb but run only at most 255 (not 256) iterations before hsumming to 64-bit counters, to avoid overflow. So throughput per vector will be the same, but with 4x as many elements per vector.

See https://agner.org/optimize/ and https://uops.info/ for performance analysis details. e.g. vpcmpeqd on uops.info

The inner loop is only 9 fused-domain uops for Haswell/Skylake, so best case front-end bottleneck of about 1 iteration per 2.25 cycles (the pipeline is 4 uops wide). Small-loop effects get in the way somewhat: Is performance reduced when executing loops whose uop count is not a multiple of processor width? - Skylake has its loop buffer disabled by a microcode update for an erratum, but even before that a 9 uop loop ended up issuing slightly worse than one iter per 2.25 cycles on average, let's say 2.5 cycles.

Skylake runs vpsubd on ports 0,1, or 5, and runs vpcmpeqd on ports 0 or 1. So the back-end bottleneck on ports 0,1,5 is 6 vector ALU uops for 3 ports, or 1 iteration per 2 cycles. So the front-end bottleneck dominates. (Ice Lake's wider front-end may let it bottleneck on the back-end even without unrolling; same back-end throughputs there unless you use AVX512...)

If clang had indexed from the end of the array and counted the index up towards zero (since it chose to use an indexed addressing mode anyway) it could have saved a uop for a total of 8 uops = one iter per 2 cycles in the front-end, matching the back-end bottleneck. (Either way, scalar add and macro-fused cmp/jcc, or add/jcc loop branch can run on port 6, and the load doesn't compete for ALU ports.) Uop replays of ALU uops dependent on the load shouldn't be a problem even on cache misses, if ALU uops are the bottleneck there will normally be plenty of older uops just waiting for an execution unit to be ready, not waiting for load data.

Unrolling by 2 would have the same benefit: amortizing that 2 uops of loop overhead. So 16 uops for 2 input vectors. That's a nice multiple of the pipeline width on SKL and IceLake, and the single-uop pipeline width on Zen. Unrolling even more could let the front-end can stay ahead of execution, but with them even any back-end delays will let the front end build up a cushion of uops in the scheduler. This will let it execute loads early enough.

Zen2 has a wider front-end (6 uops or 5 instructions wide, IIUC). None of these instructions are multi-uop because Zen2 widened the vector ALUs to 256-bit, so that's 5 single-uop instructions. vpcmpeq* runs on FP 0,1, or 3, same as vpsubd, so the back-end bottleneck is the same as on Skylake: 1 vector per 2 cycles. But the wider front-end removes that bottleneck, leaving the critical path being the back-end even without unrolling.

Zen1 takes 2 uops per 256-bit vector operation (or more for lane-crossing, but these are simple 2 uop). So presumably 12/3 = 4 cycles per vector of 8 or 32 elements, assuming it can get those uops through the front-end efficiently.

I'm assuming that the 1-cycle latency dependency chains through the count vectors are scheduled well by the back-ends and don't result in many wasted cycles. Probably not a big deal, especially if you have any memory bottlenecks in real life. (On Piledriver, SIMD-integer operations have 2 cycle latency, but 6 ALU uops for 2 vector ALU ports that can run them is 1 vector (128-bit) per 3 cycles so even without unrolling there's enough work to hide that latency.)

I didn't analyze the horizontal-sum part of this. It's outside the loop so it only has to run once per call. You did tag this micro-optimization, but we probably don't need to worry about that part.


Other numbers of buckets

The base case of this strategy is 2 buckets: count matches for one thing, count_other = size - count.

We know that every element is one of these 4 possibilities, so we can assume that any x that isn't 0, 1, or 2 is a 3 without checking. This means we don't have to count matches for 3 at all, and can get the count for that bucket from size - sum(counts[0..2]).

(See the edit history for the above perf analysis before doing this optimizations. I changed the numbers after doing this optimization and updating the Godbolt link, hopefully I didn't miss anything.)


AVX512 on Skylake-Xeon

For 64-byte vectors there is no vpcmpeqd to make a vector of all-zero (0) or all-one (-1) elements. Instead you'd compare into a mask register and use that to do a merge-masked add of set1(1). Like c = _mm512_mask_add_epi32(c, _mm512_set1_epi32(1)).

Unfortunately it's not efficient to do a scalar popcount of the compare-result bitmasks.


Random code review: in your first benchmark:

int[] valueLIST = indexers.ToArray();

This seems pointless; According to MS's docs (https://docs.microsoft.com/en-us/dotnet/standard/collections/), a List is efficiently indexable. I think it's equivalent to C++ std::vector<T>. You can just iterate it without copying to an array.


Alt strategy - map 0..3 to a set a bit in one byte of an int

Good if you can't narrow your elements to bytes for the input to save mem bandwidth.

But speaking of which, maybe worth it to use 2x _mm256_packs_epi32 (vpackssdw) and _mm256_packs_epi16 (vpacksswb) to narrow down to 8-bit integers before counting with 3x pcmpeqb / psubb. That costs 3 uops per 4 input vectors to pack down to 1 with byte elements.

But if your input has int elements to start with, this may be best instead of packing and then comparing 3 ways.

You have 4 buckets, and an int has 4 bytes. If we can transform each int element to a 1 at the bottom of the appropriate byte, that would let us add with _mm256_add_epi8 for up to 255 inner-loop iterations before widening to 64-bit counters. (With the standard _mm256_sad_epu8 against zero trick to hsum unsigned bytes without overflow.)

There are 2 ways to do this. The first: use a shuffle as a lookup table. AVX2 vpermd works (_mm256_permutexvar_epi32) using the data as the index vector and a constant _mm256_set_epi32(0,0,0,0, 1UL<<24, 1UL<<16, 1UL<<8, 1UL<<0) as the data being shuffled. Or type-pun the vector to use AVX1 vpermilps as a LUT with the LUT vector having those bytes in the upper half as well.

vpermilps is better: it's fewer uops on AMD Zen 1, and lower latency everywhere because it's in-lane. (Might cause a bypass delay on some CPUs, cutting into the latency benefit, but still not worse than vpermd).

For some reason vpermilps with a vector control has 2 cycle throughput on Zen2 even though it's still a single uop. Or 4 cycles on Zen1 (for the 2 uop YMM version). It's 1 cycle on Intel. vpermd is even worse on AMD: more uops and same poor throughput.

vpermilps xmm (16-byte vector) on Piledriver has 1/clock throughput according to Agner Fog's testing, and runs in the "ivec" domain. (So it actually has extra bypass delay latency when used on the "intended" floating point operands, but not on integer).

   // Or for Piledriver, __m128 version of this

    __m256 bytepatterns = _mm256_casts256_ps(_mm256_set_epi32(
         1<<24, 1<<16, 1<<8, 1<<0,
         1<<24, 1<<16, 1<<8, 1<<0) );
    __m256i v = _mm256_loadu_si256((const __m256i*)&input[i]);
    v = _mm256_castps_si256(_mm256_permutevar_ps(bytepatterns, v));  // vpermilps 32-bit variable shuffle
    counts = _mm256_add_epi8(counts, v);

     // after some inner iterations, separate out the 
     // set1_epi32(0x000000ff) counts, 0x0000ff00 counts, etc.

This will produce interleaved counters inside each int element. They will overflow if you don't accumulate them before 256 counts. See How to count character occurrences using SIMD for a simple version of that with a single counter.

Here we might unroll and use 2 different LUT vectors so when we want to group all the counts for 0 together, we could blend 2 vectors together and mask away the others.


Alternatively to shuffling, we can do this with AVX2 variable shifts.

sums += 1UL << (array[i]*8); where the *8 is the number of bits in a byte, also done with a shift. I wrote it as a scalar C++ expression because now's your chance to see how your bytes-in-an-integer idea can really work. As long as we don't let an individual byte overflow, it doesn't matter if SIMD bytes adds block carry between bytes or if we use 32-bit dword elements.

We'd do this with AVX2 as:

__m256i v = loadu...();
v = _mm256_slli_epi32(v, 3);  // v *= 8
v = _mm256_sllv_epi32(_mm256_set1_epi32(1), v);
counts = _mm256_add_epi8(counts, v);

This is 2 shift instructions plus the vpaddb. On Skylake the variable-count shifts vpsllvd is cheap: single-uop and runs on multiple ports. But on Haswell and Zen it's slower. (Same throughput as vpermilps on AMD)

And 2 uops for 2 ports still doesn't beat 1 uop for 1 port for the shuffle version. (Unless you use both strategies alternating to distribute the work over all ALU ports on SKL.)

So either way the inner-most loop can go 1 vector per clock or maybe slightly better with careful interleaving of shift vs. shuffle methods.

But it will require some small amount of overhead amortized over 128 or 255 inner loop iterations.

That cleanup at the end might blend 2 vectors together to get a vector with counts for just 2 buckets, then vpshufb (_mm256_shuffle_epi8) to group byte counters for the same bucket into the same qwords. Then vpsadbw (_mm256_sad_epu8) against zero can horizontal sum those byte elements within each qword for _mm256_add_epi64. So outer-loop work should be 2 vpblendw, 2x vpshufb, 2x vpsadbw, 2x vpaddq then back into another 255 iterations of the inner loop. Probably also checking if you're within 255 iterations of the end of the array to set the loop bound for the inner iteration.

这篇关于大型数组或列表的4桶直方图的微优化的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆