大型数组或列表的 4 桶直方图的微观优化 [英] Micro Optimization of a 4-bucket histogram of a large array or list

查看:22
本文介绍了大型数组或列表的 4 桶直方图的微观优化的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个特别的问题.我会尽量准确地描述这一点.

我正在做一个非常重要的微优化".一次运行数天的循环.所以如果我能减少这个循环时间,它需要一半的时间.10 天会减少到只有 5 天等等.

我现在拥有的循环是函数:testbenchmark1".

我有 4 个索引需要像这样循环增加.但是当从列表中访问索引时,实际上需要一些额外的时间,正如我所注意到的.这就是我想知道是否有其他解决方案.

indexes[n]++;//增加正确的索引

testbenchmark1"的完整代码需要 122 毫秒:

void testbenchmark00(){随机随机=新随机();列表indexers = new List();for (int i = 0; i <9256408; i++){indexers.Add(random.Next(0, 4));}int[] valueLIST = indexers.ToArray();秒表 stopWatch = new Stopwatch();stopWatch.Start();int[] 索引 = { 0, 0, 0, 0 };foreach (int n in valueLIST)//需要 122 毫秒{索引[n]++;//增加正确的索引}stopWatch.Stop();MessageBox.Show("stopWatch:" + stopWatch.ElapsedMilliseconds.ToString() + "毫秒");}

现在下面的testbenchmark2"代码只是实验性的,我知道它是不正确的,但我想知道是否有任何类似的方法来使用这种数字:1_00_00_00_00",如果有可能看到:"00_00_00_00" 作为四个不同的整数.例如,如果我要做一个总和:1_00_00_00_00 + 1_00_01_00_00 = 1_00_01_00_00 然后最后可以提取每个数字,四个中的每一个都像这样:00、01、00、00

但我不知道即使使用二进制数是否有可能.是的任何类型的解决方案.只需添加这样的数字.就像测试一样,循环只用了 59 毫秒,这是 122 毫秒时间的一半.所以我很想看看是否有任何想法?

double num3 = 1_00_00_00_00;双 num4 = 1_00_01_00_00;for (int i = 0; i < valueLIST.Count; i++)//需要 59 ms{num3 += num4;}

testbenchmark2"的完整代码需要 59 毫秒:

void testbenchmark2(){列表<字符串>valueLIST = new List();for (int i = 0; i <9256408; i++)//56{valueLIST.Add(i.ToString());}//https://www.geeksforgeeks.org/binary-literals-and-digit-separators-in-c-sharp/双 num3 = 1_00_00_00_00;双 num4 = 1_00_01_00_00;秒表 stopWatch = new Stopwatch();stopWatch.Start();for (int i = 0; i < valueLIST.Count; i++)//需要 59 ms{num3 += num4;}stopWatch.Stop();MessageBox.Show("stopWatch: " + stopWatch.ElapsedMilliseconds.ToString() + " 毫秒

" + num3);}

编辑
下面是我正在尝试做的更干净的代码 完全正确!
但是下面的代码可能是正确的或解决方案,但它显示了我相信我尝试做的事情.

 void newtest(){双 num1 = 1_00_00_00_00;双 num2 = 1_00_01_00_00;双 num3 = 1_00_01_01_00;列表<double>testnumbers = new List();testnumbers.Add(num1);testnumbers.Add(num2);testnumbers.Add(num3);双总和 = 0;for (int i = 0; i < testnumbers.Count; i++){SUM += testnumbers[i];}//结果是//300020100//是否有可能以某种方式提取我感兴趣的四个桶"?//00_02_01_00}

解决方案

这应该可以在现代 x86-64(如 Skylake 或 Zen)上每 2.5 个时钟周期左右(每个内核)大约 8 个元素(1 个 AVX2 向量)2、使用AVX2.或每 2 个时钟展开.或者在您的 Piledriver CPU 上,使用 AVX1 _mm_cmpeq_epi32 可能每 3 个时钟有 1 个 16 字节的索引向量.

一般策略适用于 2 到 8 个存储桶.对于字节、16 位或 32 位元素.(因此字节元素为您提供每 2 个时钟周期直方图的 32 个元素 最佳情况,在字节计数器溢出之前收集一些外循环开销.)

更新:或将 int 映射到 1UL <<(array[i]*8) 要使用 SIMD/SWAR 添加来增加计数器的 4 个字节之一,我们可以在 SKL 上接近每个 8 int 向量 1 个时钟,或者在 Zen2 上每 2 个时钟.(这更特定于 4 个或更少的存储桶和 int 输入,并且不会缩小到 SSE2.它需要变量移位或至少 AVX1 变量洗牌.)将字节元素与第一种策略一起使用可能仍然更好就每个周期的元素而言.

正如@JonasH 指出的那样,您可以让不同的内核处理输入数组的不同部分.在典型的台式机上,单核可以接近饱和内存带宽,但众核至强具有较低的每核内存带宽和更高的聚合,并且需要更多内核来饱和 L3 或 DRAM 带宽.为什么 Skylake 如此单线程内存吞吐量比 Broadwell-E 好多少?

<小时><块引用>

一次运行数天的循环.

在一个单个输入列表上,它的迭代速度非常慢,所以它仍然不会溢出 int 计数器?或者用不同的大列表重复调用(比如你的 ~900k 测试数组)?

<块引用>

我相信我想避免增加列表或数组的索引,因为它似乎消耗了大量时间?

那可能是因为您在禁用优化的情况下进行了基准测试.不要那样做,一点意义都没有;通过禁用优化,不同的代码会减慢不同的速度.更明确的步骤和 tmp 变量通常会使调试模式代码变慢,因为需要使用调试器查看更多内容.但是当您使用正常优化进行编译时,它们可以优化为正常的指针增量循环.

遍历数组可以高效地编译成 asm.

缓慢的部分是通过内存的依赖链,用于增加数组的变量索引.例如,在 Skylake CPU 上,具有相同地址的内存目标 add 重复瓶颈,大约每 6 个时钟周期增加一个增量,因为下一个 add 必须等待加载值由前一个存储.(来自存储缓冲区的存储转发意味着它不必先等待它提交到缓存,但它仍然比添加到寄存器慢得多.)另请参阅 Agner Fog 的优化指南:https://agner.org/optimize/

由于计数仅分布在 4 个存储桶中,您会遇到很多指令等待重新加载另一条最近指令存储的数据的情况,因此您甚至无法实现每个时钟周期接近 1 个元素,如果计数很好地分布在更多计数器上,这些计数器在 L1d 缓存中仍然很热.

这个问题的一个很好的解决方案是使用多个计数器数组展开循环. 在 SIMD 中矢量化直方图的方法?.就像代替 int[] 索引 = { 0, 0, 0, 0 }; 你可以把它变成一个有四个计数器的二维数组.您必须手动展开源中的循环以遍历输入数组,并处理展开部分之后剩下的最后 0..3 个元素.

这对于中小型计数数组来说是一种很好的技术,但如果复制计数器开始导致缓存未命中,则这种技术就会变得很糟糕.

<小时>

使用窄整数来节省缓存占用空间/内存带宽.

您可以/应该做的另一件事是为包含 0..3 个值的数组使用尽可能窄的类型:每个数字都可以放入一个字节中,因此使用 8 位整数可以节省您是缓存占用空间/内存带宽的 4 倍.

x86 可以有效地向/从完整寄存器加载/存储字节.使用 SSE4.1,您还拥有 SIMD pmovzxbd 以在 byte_array[i]int_array[i] 一起使用时更有效地自动矢量化 循环.

(当我说 x86 时,我的意思是包括 x86-64,而不是 ARM 或 PowerPC.当然,您实际上并不想编译 32 位代码,Microsoft 称之为x86")

<小时>

桶的数量非常少,比如 4

这看起来像是 SIMD 比较的工作.使用 x86 SSE2,每 16 字节数据向量的 int 元素数量等于直方图箱的数量.

您已经有了 SIMD 类型的想法,试图将数字视为四个单独的字节元素.请参阅https://en.wikipedia.org/wiki/SIMD#Software>

但是 00_01_10_11 只是数字中人类可读分隔符的源级语法,而 double 是一种浮点类型,其内部表示与对于整数.而且你绝对不想使用字符串;SIMD 可让您执行诸如一次对整数数组的 4 个元素进行运算之类的操作.

我认为解决这个问题的最好方法是分别为 4 个值中的每一个计算匹配项,而不是将元素映射到计数器.我们希望并行处理多个元素,但将它们映射到当一个元素向量中存在重复值时,计数器可能会发生冲突.您需要将该计数器增加两次.

等价的标量是:

int counts[4] = {0,0,0,0};为了 () {计数[0] += (arr[i] == 0);计数[1] += (arr[i] == 1);计数[2] += (arr[i] == 2);//计数匹配//计数[3] += (arr[i] == 3);//我们假设任何不是 0..2 的都是这个}计数 [3] = 大小 - 计数 [0] - 计数 [1] - 计数 [2];//从其他计数计算计数 3

其中(在 C++ 中)GCC -O3 实际上会按照我在下面手动执行的方式自动矢量化:https://godbolt.org/z/UJfzuH.Clang 甚至在自动矢量化时展开它,所以它应该 更好 比我的 int 输入的手工矢量化版本.不过,对于这种情况,仍然不如替代的 vpermilps 策略.

(如果您希望字节元素具有有效的窄总和,仅在外循环中加宽,您仍然需要手动矢量化.)

<小时>

对于字节元素,请参阅如何使用 SIMD 计算字符出现次数.元素尺寸对于计数器来说太窄了;它会在 256 次计数后溢出.所以你要么在内部循环中加宽,要么使用嵌套循环在加宽之前做一些累加.

我不会 C#,所以我可以用 x86 程序集或带有内在函数的 C++ 编写代码.也许 C++ 内在函数对您更有用.C# 有某种向量扩展,应该可以移植它.

这是用于 x86-64 的 C++,使用 AVX2 SIMD 内在函数.有关一些信息,请参阅 https://stackoverflow.com/tags/sse/info.

//为 AVX2 手动矢量化,为 int 元素大小//对于字节元素大小,速度应该可以提高近 4 倍#include void count_elements_avx2(const std::vector &input, unsigned output_counts[4]){__m256i 计数 [4] = { _mm256_setzero_si256() };//4 个归零计数器的向量//每个向量保存一个桶的计数,在最后求和size_t size = input.size();for(size_t i = 0 ; i

这很好地与编译铛(在 <强> Godbolt编译器资源管理器 ).据推测,您可以编写编译为类似机器代码的 C#.如果不是,请考虑从 C++ 编译器调用本机代码(如果无法从编译器获得真正最佳的代码,则使用 asm 手写).如果您的实际用例运行的迭代次数与您的基准测试一样多,那么如果不必复制输入数组,则可以分摊额外开销.

 # 来自早期版本的 C++,在内部循环中进行所有 4 次比较# clang -O3 -march=skylake.LBB0_2:#做{vmovdqu ymm7, ymmword ptr [rcx + 4*rdx] # v = 加载 arr[i + 0..7]vpcmpeqd ymm8, ymm7, ymm3 # 比较 v == 0vpsubd ymm4, ymm4, ymm8 # total0 -= cmp_resultvpcmpeqd ymm8, ymm7, ymm5vpsubd ymm2, ymm2, ymm8vpcmpeqd ymm7, ymm7, ymm6 # 比较 v == 2vpsubd ymm1, ymm1, ymm7 # total2 -= cmp_result添加 rdx, 8 # i += 8cmp rdx, raxjb .LBB0_2 # }while(i < size)

估计的最佳 Skylake 性能:每个向量约 2.5 个周期(8 个 int 或 32 个 int8_t)

或 2 个展开.

如果没有 AVX2,仅使用 SSE2,您将有一些额外的 movdqa 指令,并且每个向量只能处理 4 个元素.不过,这仍然是内存中的标量直方图的胜利.甚至 1 个元素/时钟也不错,并且应该可以使用可以在任何 x86-64 CPU 上运行的 SSE2.

当然假设没有缓存未命中,硬件预取到 L1d 保持在循环之前.这可能只发生在 L2 缓存中已经很热的数据中.我还假设内存对齐没有停顿;理想情况下,您的数据按 32 字节对齐.如果通常不是,则可能值得处理第一个未对齐的部分,然后使用对齐的加载(如果数组足够大).

对于字节元素,最内层循环看起来很相似(使用 vpcmpeqbvpsubb,但在 hsum 到 64-位计数器,以避免溢出.因此每个向量的吞吐量将相同,但每个向量的元素数量是其 4 倍.

参见 https://agner.org/optimize/https://uops.info/ 了解性能分析详情.例如vpcmpeqd on uops.info

Haswell/Skylake 的内部循环只有 9 个融合域 uops,因此最好的前端瓶颈是每 2.25 个周期大约 1 次迭代(管道宽度为 4 个 uops).循环效果有点妨碍:执行 uop 计数不是处理器宽度倍数的循环时性能是否会降低? - Skylake 的循环缓冲区因错误而被微代码更新禁用,但即使在 9 uop 循环结束之前平均每 2.25 个周期发出略差于 1 个迭代,假设为 2.5 个周期.

Skylake 在端口 0,1 或 5 上运行 vpsubd,在端口 0 或 1 上运行 vpcmpeqd.所以端口 0,1 上的后端瓶颈,5 是 3 个端口的 6 个向量 ALU uops,或每 2 个周期 1 次迭代.因此前端瓶颈占主导地位.(Ice Lake 更宽的前端可能会使其在后端出现瓶颈,即使没有展开;除非您使用 AVX512,否则后端吞吐量相同......)

如果 clang 从数组的末尾开始索引并将索引向上计数到零(因为它无论如何都选择使用索引寻址模式)它可以节省一个 uop,总共 8 个 uops = 每 2 个周期一次迭代在前端,匹配后端瓶颈.(无论哪种方式,标量add 和宏融合cmp/jcc,或add/jcc 循环分支都可以在端口6 上运行,并且负载不竞争 ALU 端口.)依赖于负载的 ALU uops 的 Uop 重放即使在缓存未命中时也不应该成为问题,如果 ALU uops 是瓶颈,通常会有很多旧的 uops 等待执行单元准备好,而不是等待加载数据.

按 2 展开将具有相同的好处:分摊 2 uop 的循环开销.所以 16 uop 用于 2 个输入向量. 这是 SKL 和 IceLake 上的管道宽度以及 Zen 上的单 uop 管道宽度的一个很好的倍数.展开更多可以让前端保持在执行之前,但即使是任何后端延迟也会让前端在调度程序中建立一个 uop 缓冲.这将让它足够早地执行加载.

Zen2 具有更宽的前端(6 uop 或 5 条指令宽度,IIUC).这些指令都不是多微操作,因为 Zen2 将向量 ALU 扩展到 256 位,所以这是 5 条单微操作指令.vpcmpeq* 在 FP 0,1 或 3 上运行,与 vpsubd 相同,因此后端瓶颈与 Skylake 相同:每 2 个周期 1 个向量.但是更广泛的前端消除了这个瓶颈,即使没有展开,关键路径仍然是后端.

Zen1 每个 256 位向量操作需要 2 uop(或更多用于车道交叉,但这些是简单的 2 uop).所以大概 12/3 = 每个包含 8 或 32 个元素的向量有 4 个周期,假设它可以有效地通过前端获得这些微指令.

我假设通过计数向量的 1 周期延迟依赖链由后端很好地安排并且不会导致许多浪费的周期.可能没什么大不了的,特别是如果您在现实生活中遇到任何内存瓶颈.(在 Piledriver 上,SIMD 整数操作有 2 个周期的延迟,但是可以运行它们的 2 个向量 ALU 端口的 6 个 ALU uops 是每 3 个周期 1 个向量(128 位),因此即使没有展开,也有足够的工作来隐藏该延迟.)

我没有分析这个的水平和部分.它在循环之外,因此每次调用只需运行一次.您确实标记了这个微优化,但我们可能不需要担心那部分.

<小时>

其他桶数

该策略的基本情况是 2 个桶:计数一件事的匹配项,count_other = 大小 - 计数.

我们知道每个元素都是这 4 种可能性中的一种,因此我们可以假设任何不是 0、1 或 2 的 x 是 3,而无需检查.这意味着我们不必为 3 计算匹配项,并且可以从 size - sum(counts[0..2]) 获取该桶的计数.

(在做这个优化之前,请参阅上面性能分析的编辑历史.我在做这个优化和更新 Godbolt 链接后更改了数字,希望我没有遗漏任何东西.)

<小时>

Skylake-Xeon 上的 AVX512

对于 64 字节向量,没有 vpcmpeqd 来生成全零 (0) 或全一 (-1) 元素的向量.相反,您将比较掩码寄存器并使用它来执行 set1(1) 的合并掩码添加.比如c = _mm512_mask_add_epi32(c, _mm512_set1_epi32(1)).

不幸的是,对比较结果位掩码进行标量 popcount 效率不高.

<小时>

随机代码审查:在您的第一个基准测试中:

<块引用>

int[] valueLIST = indexers.ToArray();

这似乎毫无意义;根据 MS 的文档(https://docs.microsoft.com/en-us/dotnet/standard/collections/),一个列表可以有效地索引.我认为它相当于 C++ std::vector.您可以只迭代它而无需复制到数组.

<小时>

Alt 策略 - 将 0..3 映射到 int 的一个字节中的一组位

如果您不能将输入的元素缩小到字节以节省内存带宽,那就太好了.

但说到这里,也许值得使用 2x _mm256_packs_epi32 (vpackssdw) 和 _mm256_packs_epi16 (vpacksswb) 缩小到 8-使用 3x pcmpeqb/psubb 计数之前的位整数.每 4 个输入向量需要 3 uop 才能用字节元素压缩为 1.

但如果您的输入以 int 元素开头,这可能最好而不是打包然后比较 3 种方式.

你有 4 个桶,一个 int 有 4 个字节.如果我们可以将每个 int 元素转换为相应字节底部的 1,那么我们就可以添加 _mm256_add_epi8 在扩大到 64 位计数器之前最多进行 255 次内循环迭代.(使用标准的 _mm256_sad_epu8 反对零技巧来对无符号字节进行 hsum 而不溢出.)

有两种方法可以做到这一点.第一种:使用 shuffle 作为查找表. AVX2 vpermd 工作(_mm256_permutexvar_epi32)使用数据作为索引向量和常量 _mm256_set_epi32(0,0,0,0, 1UL<<24, 1UL<<16, 1UL<<8, 1UL<<0) 作为被打乱的数据.或者输入双关语向量以使用 AVX1 vpermilps 作为 LUT,而 LUT 向量的上半部分也包含这些字节.

vpermilps 更好:它在 AMD Zen 1 上的 uops 更少,并且因为它是 in-lane,所以所有地方的延迟都更低.(可能会在某些 CPU 上造成旁路延迟,从而减少延迟优势,但仍不比 vpermd 差).

出于某种原因,带有矢量控制的 vpermilps 在 Zen2 上具有 2 个周期的吞吐量,即使它仍然是单个 uop.或 Zen1 上的 4 个周期(对于 2 uop YMM 版本).在英特尔上是 1 个周期.vpermd 在 AMD 上更糟糕:更多 uops 和同样较差的吞吐量.

根据 Agner Fog 的测试,Piledriver 上的

vpermilps xmm(16 字节向量)具有 1/clock 的吞吐量,并在ivec"域中运行.(因此,当用于预期的"浮点操作数时,它实际上具有额外的旁路延迟延迟,但没有用于整数).

//或者对于 Piledriver,__m128 版本的这个__m256 字节模式 = _mm256_casts256_ps(_mm256_set_epi32(1<<24、1<<16、1<<8、1<<0、1<<24、1<<16、1<<8、1<<0));__m256i v = _mm256_loadu_si256((const __m256i*)&input[i]);v = _mm256_castps_si256(_mm256_permutevar_ps(bytepatterns, v));//vpermilps 32位变量shuffle计数 = _mm256_add_epi8(counts, v);//经过一些内部迭代,分离出//set1_epi32(0x000000ff) 计数,0x0000ff00 计数等.

这将在每个 int 元素内产生交错计数器.如果您没有在 256 次计数之前累积它们,它们就会溢出.请参阅如何使用 SIMD 计算字符出现次数,了解使用单个计数器的简单版本.

在这里我们可能会展开并使用 2 个不同的 LUT 向量,因此当我们想要将 0 的所有计数组合在一起时,我们可以混合 2 个向量并屏蔽掉其他.

<小时>

除了改组之外,我们还可以通过 AVX2 变量转换来实现.

sums += 1UL <<<(array[i]*8); 其中 *8 是一个字节中的位数,也是用移位完成的.我把它写成一个标量 C++ 表达式,因为现在你有机会看到你的整数字节的想法是如何真正起作用的.只要我们不让单个字节溢出,无论是 SIMD 字节在字节之间添加块进位还是使用 32 位双字元素都没有关系.

我们将使用 AVX2 这样做:

__m256i v = loadu...();v = _mm256_slli_epi32(v, 3);//v *= 8v = _mm256_sllv_epi32(_mm256_set1_epi32(1), v);计数 = _mm256_add_epi8(counts, v);

这是 2 个移位指令加上 vpaddb.在 Skylake 上,可变计数变化 vpsllvd 是便宜:单 uop 并在多个端口上运行.但在 Haswell 和 Zen 上,速度较慢.(与 vpermilps 在 AMD 上的吞吐量相同)

对于 shuffle 版本,2 个端口的 2 uop 仍然无法击败 1 个端口的 1 uop.(除非您交替使用两种策略在 SKL 上的所有 ALU 端口上分配工作.)

因此,无论哪种方式,最内层循环都可以每个时钟运行 1 个向量,或者通过仔细交错移位与洗牌方法可能会稍微好一些.

但它需要在 128 或 255 次内循环迭代中分摊少量开销.

最后的清理可能会将 2 个向量混合在一起以获得一个仅包含 2 个桶的计数的向量,然后 vpshufb (_mm256_shuffle_epi8) 将字节计数器分组为相同的存储到相同的 qwords 中.然后 vpsadbw (_mm256_sad_epu8) 对零可以对 _mm256_add_epi64 的每个 qword 中的那些字节元素进行水平求和.所以外循环工作应该是 2 vpblendw, 2x vpshufb, 2x vpsadbw, 2x vpaddq 然后回到另一个内循环的 255 次迭代.可能还会检查您是否在数组末尾的 255 次迭代内,以设置内部迭代的循环边界.

I have a special question. I will try to describe this as accurate as possible.

I am doing a very important "micro-optimization". A loop that runs for days at a time. So if I can cut this loops time it takes to half the time. 10 days would decrease to only 5 days etc.

The loop I have now is the function: "testbenchmark1".

I have 4 indexes that I need to increase in a loop like this. But when accessing an index from a list that takes some extra time actually as I have noticed. This is what I am trying to se if there is another solution to.

indexes[n]++; //increase correct index

Complete code for "testbenchmark1" which takes 122 ms:

void testbenchmark00()
{
    Random random = new Random();
    List<int> indexers = new List<int>();
    for (int i = 0; i < 9256408; i++)
    {
        indexers.Add(random.Next(0, 4));
    }
    int[] valueLIST = indexers.ToArray();


    Stopwatch stopWatch = new Stopwatch();
    stopWatch.Start();

    int[] indexes = { 0, 0, 0, 0 };
    foreach (int n in valueLIST) //Takes 122 ms
    {
        indexes[n]++; //increase correct index
    }

    stopWatch.Stop();
    MessageBox.Show("stopWatch: " + stopWatch.ElapsedMilliseconds.ToString() + " milliseconds");
}

Now the below "testbenchmark2" code is just experimental and I know it is not correct but I wonder if there is any simular way to use such kind of numbers: "1_00_00_00_00" and if it would be possible to see the: "00_00_00_00" as four different integer numbers. For example if I would do a summation of: 1_00_00_00_00 + 1_00_01_00_00 = 1_00_01_00_00 and then one could in the end extract each number, each of the four like this: 00, 01, 00, 00

But I don't know if this is possible in any way even using Binary numbers. Yes any kind of solution. To just add numbers like this. Just as a test that loop took only 59 ms which is half the time of 122 ms. So I am interesting to see if there is any idéa to this?

double num3 = 1_00_00_00_00;
double num4 = 1_00_01_00_00;
for (int i = 0; i < valueLIST.Count; i++) //Takes 59 ms
{
    num3 += num4;
}

Complete code for "testbenchmark2" which takes 59 ms:

void testbenchmark2()
{
    List<String> valueLIST = new List<String>(); 
    for (int i = 0; i < 9256408; i++) //56
    {
        valueLIST.Add(i.ToString());
    }

    //https://www.geeksforgeeks.org/binary-literals-and-digit-separators-in-c-sharp/
    double num3 = 1_00_00_00_00;
    double num4 = 1_00_01_00_00;

    Stopwatch stopWatch = new Stopwatch();
    stopWatch.Start();
    for (int i = 0; i < valueLIST.Count; i++) //Takes 59 ms
    {
        num3 += num4;
    }
    stopWatch.Stop();
    MessageBox.Show("stopWatch: " + stopWatch.ElapsedMilliseconds.ToString() + " milliseconds

" + num3);
}

EDIT
The below is a more clean code of what I am trying to do Exactly!
But the below code will probably to be correct or the solution but it shows what I try to do I beleive.

        void newtest()
        {
            double num1 = 1_00_00_00_00;
            double num2 = 1_00_01_00_00;
            double num3 = 1_00_01_01_00;

            List<double> testnumbers = new List<double>();
            testnumbers.Add(num1);
            testnumbers.Add(num2);
            testnumbers.Add(num3);

            double SUM = 0;
            for (int i = 0; i < testnumbers.Count; i++)
            {
                SUM += testnumbers[i];
            }

            //The result is
            //300020100

            //Would it possible to extract the "four buckets" that I am interesting in somehow?
            //00_02_01_00
        }

解决方案

This should be possible at about 8 elements (1 AVX2 vector) per 2.5 clock cycles or so (per core) on a modern x86-64 like Skylake or Zen 2, using AVX2. Or per 2 clocks with unrolling. Or on your Piledriver CPU, maybe 1x 16-byte vector of indexes per 3 clocks with AVX1 _mm_cmpeq_epi32.

The general strategy works with 2 to 8 buckets. And for byte, 16-bit, or 32-bit elements. (So byte elements gives you 32 elements histogrammed per 2 clock cycles best case, with a bit of outer-loop overhead to collect byte counters before they overflow.)

Update: or mapping an int to 1UL << (array[i]*8) to increment one of 4 bytes of a counter with SIMD / SWAR addition, we can go close to 1 clock per vector of 8 int on SKL, or per 2 clocks on Zen2. (This is even more specific to 4 or fewer buckets, and int input, and doesn't scale down to SSE2. It needs variable-shifts or at least AVX1 variable-shuffles.) Using byte elements with the first strategy is probably still better in terms of elements per cycle.

As @JonasH points out, you could have different cores working on different parts of the input array. A single core can come close to saturating memory bandwidth on typical desktops, but many-core Xeons have lower per-core memory bandwidth and higher aggregate, and need more cores to saturate L3 or DRAM bandwidth. Why is Skylake so much better than Broadwell-E for single-threaded memory throughput?


A loop that runs for days at a time.

On a single input list that's very very slow to iterate so it still doesn't overflow int counters? Or repeated calls with different large Lists (like your ~900k test array)?

I believe I want to avoid increasing an index for a list or array as it seem to consume a lot of time?

That's probably because you were benchmarking with optimization disabled. Don't do that, it's not meaningful at all; different code is slowed down different amounts by disabling optimization. More explicit steps and tmp vars can often make slower debug-mode code because there are more things that have to be there to look at with a debugger. But they can just optimize into a normal pointer-increment loop when you compile with normal optimization.

Iterating through an array can compile efficiently into asm.

The slow part is the dependency chain through memory for incrementing a variable index of the array. For example on a Skylake CPU, memory-destination add with the same address repeatedly bottlenecks at about one increment per 6 clock cycles because the next add has to wait to load the value stored by the previous one. (Store-forwarding from the store buffer means it doesn't have to wait for it to commit to cache first, but it's still much slower than add into a register.) See also Agner Fog's optimization guides: https://agner.org/optimize/

With counts only distributed across 4 buckets, you'll have a lot of cases where instructions are waiting to reload the data stored by another recent instruction, so you can't even achieve the nearly 1 element per clock cycle you might if counts were well distributed over more counters that still were all hot in L1d cache.

One good solution to this problem is unrolling the loop with multiple arrays of counters. Methods to vectorise histogram in SIMD?. Like instead of int[] indexes = { 0, 0, 0, 0 }; you can make it a 2D array of four counters each. You'd have to manually unroll the loop in the source to iterate over the input array, and handle the last 0..3 left over elements after the unrolled part.

This is a good technique for small to medium arrays of counts, but becomes bad if replicating the counters starts to lead to cache misses.


Use narrow integers to save cache footprint / mem bandwidth.

Another thing you can/should do is use as narrow a type as possible for your arrays of 0..3 values: each number can fit in a byte so using 8-bit integers would save you a factor of 4 cache footprint / memory bandwidth.

x86 can efficiently load/store bytes into to/from full registers. With SSE4.1 you also have SIMD pmovzxbd to make it more efficient to auto-vectorize when you have a byte_array[i] used with an int_array[i] in a loop.

(When I say x86 I mean including x86-64, as opposed to ARM or PowerPC. Of course you don't actually want to compile 32-bit code, what Microsoft calls "x86")


With very small numbers of buckets, like 4

This looks like a job for SIMD compares. With x86 SSE2 the number of int elements per 16-byte vector of data is equal to your number of histogram bins.

You already had a SIMD sort of idea with trying to treat a number as four separate byte elements. See https://en.wikipedia.org/wiki/SIMD#Software

But 00_01_10_11 is just source-level syntax for human-readable separators in numbers, and double is a floating-point type whose internal representation isn't the same as for integers. And you definitely don't want to use strings; SIMD lets you do stuff like operating on 4 elements of an integer array at once.

The best way I can see to approach this is to separately count matches for each of the 4 values, rather than map elements to counters. We want to process multiple elements in parallel but mapping them to counters can have collisions when there are repeated values in one vector of elements. You'd need to increment that counter twice.

The scalar equivalent of this is:

int counts[4] = {0,0,0,0};
for () {
    counts[0] += (arr[i] == 0);
    counts[1] += (arr[i] == 1);
    counts[2] += (arr[i] == 2);  // count matches
  //counts[3] += (arr[i] == 3);  // we assume any that aren't 0..2 are this
}
counts[3] = size - counts[0] - counts[1] - counts[2];
// calculate count 3 from other counts

which (in C++) GCC -O3 will actually auto-vectorize exactly the way I did manually below: https://godbolt.org/z/UJfzuH. Clang even unrolls it when auto-vectorizing, so it should be better than my hand-vectorized version for int inputs. Still not as good as the alternate vpermilps strategy for that case, though.

(And you do still need to manually vectorize if you want byte elements with efficient narrow sums, only widening in an outer loop.)


With byte elements, see How to count character occurrences using SIMD. The element size is too narrow for a counter; it would overflow after 256 counts. So you have to widen either in the inner loop, or use nested loops to do some accumulating before widening.

I don't know C#, so I could write the code in x86 assembly or in C++ with intrinsics. Perhaps C++ intrinsics is more useful for you. C# has some kind of vector extensions that should make it possible to port this.

This is C++ for x86-64, using AVX2 SIMD intrinsics. See https://stackoverflow.com/tags/sse/info for some info.

// Manually vectorized for AVX2, for int element size
// Going nearly 4x as fast should be possible for byte element size

#include <immintrin.h>

void count_elements_avx2(const std::vector<int> &input,  unsigned output_counts[4])
{
    __m256i  counts[4] = { _mm256_setzero_si256() };  // 4 vectors of zeroed counters
                  // each vector holds counts for one bucket, to be hsummed at the end

    size_t size = input.size();
    for(size_t i = 0 ; i<size ; i+=8) {  // 8x 32-bit elements per vector
        __m256i v = _mm256_loadu_si256((const __m256i*)&input[i]);  // unaligned load of 8 ints
        for (int val = 0 ; val < 3; val++) {
           // C++ compilers will unroll this with 3 vector constants and no memory access
            __m256i match = _mm256_cmpeq_epi32(v, _mm256_set1_epi32(val));  // 0 or all-ones aka -1
            counts[val] = _mm256_sub_epi32(counts[val], match);   // x -= -1 or 0 conditional increment
        }
    }


    // transpose and sum 4 vectors of 8 elements down to 1 vector of 4 elements
    __m128i summed_counts = hsum_xpose(counts);   // helper function defined in Godbolt link
    _mm_storeu_si128((__m128i*)output_counts, summed_counts);

    output_counts[3] = size - output_counts[0]
                       - output_counts[1] - output_counts[2];

    // TODO: handle the last size%8 input elements; scalar would be easy
}

This compiles nicely with clang (on the Godbolt compiler explorer). Presumably you can write C# that compiles to similar machine code. If not, consider calling native code from a C++ compiler (or hand-written in asm if you can't get truly optimal code from the compiler). If your real use-case runs as many iterations as your benchmark, that could amortize the extra overhead if the input array doesn't have to get copied.

 # from an earlier version of the C++, doing all 4 compares in the inner loop
 # clang -O3 -march=skylake
.LBB0_2:                                     # do {
    vmovdqu ymm7, ymmword ptr [rcx + 4*rdx]    # v = load arr[i + 0..7]
    vpcmpeqd        ymm8, ymm7, ymm3           # compare v == 0
    vpsubd  ymm4, ymm4, ymm8                   # total0 -= cmp_result
    vpcmpeqd        ymm8, ymm7, ymm5
    vpsubd  ymm2, ymm2, ymm8
    vpcmpeqd        ymm7, ymm7, ymm6           # compare v == 2
    vpsubd  ymm1, ymm1, ymm7                   # total2 -= cmp_result
    add     rdx, 8                             # i += 8
    cmp     rdx, rax
    jb      .LBB0_2                          # }while(i < size)

Estimated best-case Skylake performance: ~2.5 cycles per vector (8 int or 32 int8_t)

Or 2 with unrolling.

Without AVX2, using only SSE2, you'd have some extra movdqa instructions and only be doing 4 elements per vector. This would still be a win vs. scalar histogram in memory, though. Even 1 element / clock is nice, and should be doable with SSE2 that can run on any x86-64 CPU.

Assuming no cache misses of course, with hardware prefetch into L1d staying ahead of the loop. This might only happen with the data already hot in L2 cache at least. I'm also assuming no stalls from memory alignment; ideally your data is aligned by 32 bytes. If it usually isn't, possibly worth processing the first unaligned part and then using aligned loads, if the array is large enough.

For byte elements, the inner-most loop will look similar (with vpcmpeqb and vpsubb but run only at most 255 (not 256) iterations before hsumming to 64-bit counters, to avoid overflow. So throughput per vector will be the same, but with 4x as many elements per vector.

See https://agner.org/optimize/ and https://uops.info/ for performance analysis details. e.g. vpcmpeqd on uops.info

The inner loop is only 9 fused-domain uops for Haswell/Skylake, so best case front-end bottleneck of about 1 iteration per 2.25 cycles (the pipeline is 4 uops wide). Small-loop effects get in the way somewhat: Is performance reduced when executing loops whose uop count is not a multiple of processor width? - Skylake has its loop buffer disabled by a microcode update for an erratum, but even before that a 9 uop loop ended up issuing slightly worse than one iter per 2.25 cycles on average, let's say 2.5 cycles.

Skylake runs vpsubd on ports 0,1, or 5, and runs vpcmpeqd on ports 0 or 1. So the back-end bottleneck on ports 0,1,5 is 6 vector ALU uops for 3 ports, or 1 iteration per 2 cycles. So the front-end bottleneck dominates. (Ice Lake's wider front-end may let it bottleneck on the back-end even without unrolling; same back-end throughputs there unless you use AVX512...)

If clang had indexed from the end of the array and counted the index up towards zero (since it chose to use an indexed addressing mode anyway) it could have saved a uop for a total of 8 uops = one iter per 2 cycles in the front-end, matching the back-end bottleneck. (Either way, scalar add and macro-fused cmp/jcc, or add/jcc loop branch can run on port 6, and the load doesn't compete for ALU ports.) Uop replays of ALU uops dependent on the load shouldn't be a problem even on cache misses, if ALU uops are the bottleneck there will normally be plenty of older uops just waiting for an execution unit to be ready, not waiting for load data.

Unrolling by 2 would have the same benefit: amortizing that 2 uops of loop overhead. So 16 uops for 2 input vectors. That's a nice multiple of the pipeline width on SKL and IceLake, and the single-uop pipeline width on Zen. Unrolling even more could let the front-end can stay ahead of execution, but with them even any back-end delays will let the front end build up a cushion of uops in the scheduler. This will let it execute loads early enough.

Zen2 has a wider front-end (6 uops or 5 instructions wide, IIUC). None of these instructions are multi-uop because Zen2 widened the vector ALUs to 256-bit, so that's 5 single-uop instructions. vpcmpeq* runs on FP 0,1, or 3, same as vpsubd, so the back-end bottleneck is the same as on Skylake: 1 vector per 2 cycles. But the wider front-end removes that bottleneck, leaving the critical path being the back-end even without unrolling.

Zen1 takes 2 uops per 256-bit vector operation (or more for lane-crossing, but these are simple 2 uop). So presumably 12/3 = 4 cycles per vector of 8 or 32 elements, assuming it can get those uops through the front-end efficiently.

I'm assuming that the 1-cycle latency dependency chains through the count vectors are scheduled well by the back-ends and don't result in many wasted cycles. Probably not a big deal, especially if you have any memory bottlenecks in real life. (On Piledriver, SIMD-integer operations have 2 cycle latency, but 6 ALU uops for 2 vector ALU ports that can run them is 1 vector (128-bit) per 3 cycles so even without unrolling there's enough work to hide that latency.)

I didn't analyze the horizontal-sum part of this. It's outside the loop so it only has to run once per call. You did tag this micro-optimization, but we probably don't need to worry about that part.


Other numbers of buckets

The base case of this strategy is 2 buckets: count matches for one thing, count_other = size - count.

We know that every element is one of these 4 possibilities, so we can assume that any x that isn't 0, 1, or 2 is a 3 without checking. This means we don't have to count matches for 3 at all, and can get the count for that bucket from size - sum(counts[0..2]).

(See the edit history for the above perf analysis before doing this optimizations. I changed the numbers after doing this optimization and updating the Godbolt link, hopefully I didn't miss anything.)


AVX512 on Skylake-Xeon

For 64-byte vectors there is no vpcmpeqd to make a vector of all-zero (0) or all-one (-1) elements. Instead you'd compare into a mask register and use that to do a merge-masked add of set1(1). Like c = _mm512_mask_add_epi32(c, _mm512_set1_epi32(1)).

Unfortunately it's not efficient to do a scalar popcount of the compare-result bitmasks.


Random code review: in your first benchmark:

int[] valueLIST = indexers.ToArray();

This seems pointless; According to MS's docs (https://docs.microsoft.com/en-us/dotnet/standard/collections/), a List is efficiently indexable. I think it's equivalent to C++ std::vector<T>. You can just iterate it without copying to an array.


Alt strategy - map 0..3 to a set a bit in one byte of an int

Good if you can't narrow your elements to bytes for the input to save mem bandwidth.

But speaking of which, maybe worth it to use 2x _mm256_packs_epi32 (vpackssdw) and _mm256_packs_epi16 (vpacksswb) to narrow down to 8-bit integers before counting with 3x pcmpeqb / psubb. That costs 3 uops per 4 input vectors to pack down to 1 with byte elements.

But if your input has int elements to start with, this may be best instead of packing and then comparing 3 ways.

You have 4 buckets, and an int has 4 bytes. If we can transform each int element to a 1 at the bottom of the appropriate byte, that would let us add with _mm256_add_epi8 for up to 255 inner-loop iterations before widening to 64-bit counters. (With the standard _mm256_sad_epu8 against zero trick to hsum unsigned bytes without overflow.)

There are 2 ways to do this. The first: use a shuffle as a lookup table. AVX2 vpermd works (_mm256_permutexvar_epi32) using the data as the index vector and a constant _mm256_set_epi32(0,0,0,0, 1UL<<24, 1UL<<16, 1UL<<8, 1UL<<0) as the data being shuffled. Or type-pun the vector to use AVX1 vpermilps as a LUT with the LUT vector having those bytes in the upper half as well.

vpermilps is better: it's fewer uops on AMD Zen 1, and lower latency everywhere because it's in-lane. (Might cause a bypass delay on some CPUs, cutting into the latency benefit, but still not worse than vpermd).

For some reason vpermilps with a vector control has 2 cycle throughput on Zen2 even though it's still a single uop. Or 4 cycles on Zen1 (for the 2 uop YMM version). It's 1 cycle on Intel. vpermd is even worse on AMD: more uops and same poor throughput.

vpermilps xmm (16-byte vector) on Piledriver has 1/clock throughput according to Agner Fog's testing, and runs in the "ivec" domain. (So it actually has extra bypass delay latency when used on the "intended" floating point operands, but not on integer).

   // Or for Piledriver, __m128 version of this

    __m256 bytepatterns = _mm256_casts256_ps(_mm256_set_epi32(
         1<<24, 1<<16, 1<<8, 1<<0,
         1<<24, 1<<16, 1<<8, 1<<0) );
    __m256i v = _mm256_loadu_si256((const __m256i*)&input[i]);
    v = _mm256_castps_si256(_mm256_permutevar_ps(bytepatterns, v));  // vpermilps 32-bit variable shuffle
    counts = _mm256_add_epi8(counts, v);

     // after some inner iterations, separate out the 
     // set1_epi32(0x000000ff) counts, 0x0000ff00 counts, etc.

This will produce interleaved counters inside each int element. They will overflow if you don't accumulate them before 256 counts. See How to count character occurrences using SIMD for a simple version of that with a single counter.

Here we might unroll and use 2 different LUT vectors so when we want to group all the counts for 0 together, we could blend 2 vectors together and mask away the others.


Alternatively to shuffling, we can do this with AVX2 variable shifts.

sums += 1UL << (array[i]*8); where the *8 is the number of bits in a byte, also done with a shift. I wrote it as a scalar C++ expression because now's your chance to see how your bytes-in-an-integer idea can really work. As long as we don't let an individual byte overflow, it doesn't matter if SIMD bytes adds block carry between bytes or if we use 32-bit dword elements.

We'd do this with AVX2 as:

__m256i v = loadu...();
v = _mm256_slli_epi32(v, 3);  // v *= 8
v = _mm256_sllv_epi32(_mm256_set1_epi32(1), v);
counts = _mm256_add_epi8(counts, v);

This is 2 shift instructions plus the vpaddb. On Skylake the variable-count shifts vpsllvd is cheap: single-uop and runs on multiple ports. But on Haswell and Zen it's slower. (Same throughput as vpermilps on AMD)

And 2 uops for 2 ports still doesn't beat 1 uop for 1 port for the shuffle version. (Unless you use both strategies alternating to distribute the work over all ALU ports on SKL.)

So either way the inner-most loop can go 1 vector per clock or maybe slightly better with careful interleaving of shift vs. shuffle methods.

But it will require some small amount of overhead amortized over 128 or 255 inner loop iterations.

That cleanup at the end might blend 2 vectors together to get a vector with counts for just 2 buckets, then vpshufb (_mm256_shuffle_epi8) to group byte counters for the same bucket into the same qwords. Then vpsadbw (_mm256_sad_epu8) against zero can horizontal sum those byte elements within each qword for _mm256_add_epi64. So outer-loop work should be 2 vpblendw, 2x vpshufb, 2x vpsadbw, 2x vpaddq then back into another 255 iterations of the inner loop. Probably also checking if you're within 255 iterations of the end of the array to set the loop bound for the inner iteration.

这篇关于大型数组或列表的 4 桶直方图的微观优化的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆