在AVX2指令中没有快速聚集和分散的情况下该怎么办? [英] What do you do without fast gather and scatter in AVX2 instructions?

查看:234
本文介绍了在AVX2指令中没有快速聚集和分散的情况下该怎么办?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在编写一个程序来检测素数.一方面是对可能的候选人进行筛选.我写了一个相当快的程序,但我想我想看看是否有人有更好的主意.我的程序可以使用一些快速的收集和分散指令,但是我仅限于用于x86架构的AVX2硬件(我不确定AVX-512拥有这些,尽管我不确定它们的速度如何).

 #include <stdint.h>
#include <immintrin.h>

#define USE_AVX2

// Sieve the bits in array sieveX for later use
void sieveFactors(uint64_t *sieveX)
{
    const uint64_t totalX = 5000000;
#ifdef USE_AVX2
    uint64_t indx[4], bits[4];

    const __m256i sieveX2 = _mm256_set1_epi64x((uint64_t)(sieveX));
    const __m256i total = _mm256_set1_epi64x(totalX - 1);
    const __m256i mask = _mm256_set1_epi64x(0x3f);

    // Just filling with some typical values (not really constant)
    __m256i ans = _mm256_set_epi64x(58, 52, 154, 1);
    __m256i ans2 = _mm256_set_epi64x(142, 70, 136, 100);

    __m256i sum = _mm256_set_epi64x(201, 213, 219, 237);    // 3x primes
    __m256i sum2 = _mm256_set_epi64x(201, 213, 219, 237);   // This aren't always the same

    // Actually algorithm can changes these
    __m256i mod1 = _mm256_set1_epi64x(1);
    __m256i mod3 = _mm256_set1_epi64x(1);

    __m256i mod2, mod4, sum3;

    // Sieve until all factors (start under 32-bit threshold) exceed the limit
    do {
        // Sieve until one of the factors exceeds the limit
        do {
            // Compiler does a nice job converting these into extracts
            *(__m256i *)(&indx[0]) = _mm256_add_epi64(_mm256_srli_epi64(_mm256_andnot_si256(mask, ans), 3), sieveX2);
            *(__m256i *)(&bits[0]) = _mm256_sllv_epi64(mod1, _mm256_and_si256(mask, ans));

            ans = _mm256_add_epi64(ans, sum);

            // Early on these locations can overlap
            *(uint64_t *)(indx[0]) |= bits[0];
            *(uint64_t *)(indx[1]) |= bits[1];
            *(uint64_t *)(indx[2]) |= bits[2];
            *(uint64_t *)(indx[3]) |= bits[3];

            mod2 = _mm256_sub_epi64(total, ans);

            *(__m256i *)(&indx[0]) = _mm256_add_epi64(_mm256_srli_epi64(_mm256_andnot_si256(mask, ans2), 3), sieveX2);
            *(__m256i *)(&bits[0]) = _mm256_sllv_epi64(mod3, _mm256_and_si256(mask, ans2));

            ans2 = _mm256_add_epi64(ans2, sum2);

            // Two types of candidates are being performed at once
            *(uint64_t *)(indx[0]) |= bits[0];
            *(uint64_t *)(indx[1]) |= bits[1];
            *(uint64_t *)(indx[2]) |= bits[2];
            *(uint64_t *)(indx[3]) |= bits[3];

            mod4 = _mm256_sub_epi64(total, ans2);
        } while (!_mm256_movemask_pd(_mm256_castsi256_pd(_mm256_or_si256(mod2, mod4))));

        // Remove one factor
        mod2 = _mm256_castpd_si256(_mm256_blendv_pd(_mm256_setzero_pd(), _mm256_castsi256_pd(sum), _mm256_castsi256_pd(mod2)));
        mod4 = _mm256_castpd_si256(_mm256_blendv_pd(_mm256_setzero_pd(), _mm256_castsi256_pd(sum2), _mm256_castsi256_pd(mod4)));
        ans = _mm256_sub_epi64(ans, mod2);
        ans2 = _mm256_sub_epi64(ans2, mod4);
        sum = _mm256_sub_epi64(sum, mod2);
        sum2 = _mm256_sub_epi64(sum2, mod4);
        sum3 = _mm256_or_si256(sum, sum2);
     } while (!_mm256_testz_si256(sum3, sum3));
#else
     // Just some example values (not really constant - compiler will optimize away code incorrectly)
     uint64_t cur = 58;
     uint64_t cur2 = 142;
     uint64_t factor = 67;

     if (cur < cur2) {
        std::swap(cur, cur2);
    }
    while (cur < totalX) {
        sieveX[cur >> 6] |= (1ULL << (cur & 0x3f));
        sieveX[cur2 >> 6] |= (1ULL << (cur2 & 0x3f));
        cur += factor;
        cur2 += factor;
    }
    while (cur2 < totalX) {
        sieveX[cur2 >> 6] |= (1ULL << (cur2 & 0x3f));
        cur2 += factor;
    }
#endif
}
 

请注意,这些位置起初可能会重叠.在循环中短暂停留后,情况并非如此.如果可能的话,我很乐意使用其他方法.在算法的这一部分中,大约有82%的时间在此循环中.希望这与其他发布的问题不太接近.

解决方案

IDK为什么您使用同一cur[8]数组的不同部分作为索引和值;这使得源代码很难理解,以至于只有一个真正的数组.另一个只是将向量反弹到标量.

看起来您只是在使用向量->标量,而不是将标量重新插入向量.而且,循环内的任何内容都不依赖于sieveX[]中的任何数据.我不熟悉您的筛选算法,但我想这是要在内存中创建数据以备后用.


AVX2具有聚集功能(没有分散性),但是它们仅在Skylake和更高版本上运行很快.他们在Broadwell上没问题,在Haswell上变慢,在AMD上变慢. (就像Ryzen的vpgatherqq每12个时钟中的一个一样).请参见 http://agner.org/optimize/x86标签Wiki .

英特尔优化手册中有一小节关于手动收集/散布(使用插入/提取或movhps)与硬件说明,可能值得一读.在这种情况下,索引是运行时变量(不是恒定步幅),我认为Skylake可以从AVX2此处的收集指令中受益.

请参阅英特尔的内部函数指南,以查找诸如 movhps .我只是在谈论要使编译器发出的内容,因为这很重要,而且asm助记符的类型较短且不需要强制转换.您必须知道asm助记符才能在Agner Fog的指令表中查找它们,或者从自动矢量化中读取编译器输出,所以我通常在asm中进行思考,然后将其转换为内在函数.


使用AVX,您有3个主要选项:

  • 做所有标量.寄存器压力可能是一个问题,但是根据需要生成索引(而不是一次完成所有四个加法或减法以立即生成curr[4..7])可能会有所帮助.除非这些mask向量在不同元素中具有不同的值.

    (但是,如果将内存源用于标量常量,则可能不坏,如果它们不适合32位立即数,并且每个时钟都不限制2个内存操作的话.memory-destination or指令会使用索引寻址模式,因此无法使用Haswell及更高版本的端口7上的专用store-AGU.因此,AGU吞吐量可能会成为瓶颈.)

    将向量的所有4个元素提取为标量比4倍标量add或移位指令要昂贵,但是您要做的工作更多.不过,对于BMI2而言,其可变计数移位为1微秒(而不是Intel的3倍),这可能并不可怕.我认为我们可以使用SIMD做得更好,尤其是经过仔细的调整.

  • 像现在一样将索引和值提取为标量,因此sieveX[]中的OR为纯标量.即使两个或多个索引相同也可以工作.

    这将使您每个ymm向量大约花费7微克->使用提取ALU指令的4x标量寄存器,或使用存储/重载的5微秒(值得考虑的是编译器,可能是4种向量提取中的一或两个,因为此代码但是,如果编译器将C源代码中的存储/重新加载转换为混洗/提取指令,则除非使用volatile,否则您就无法轻松地覆盖其策略.顺便说一句,您想要使用alignas(32) cur[8]来确保实际的向量存储区不会越过缓存行边界.

    or [rdi + rax*8], rdx(具有索引寻址模式,可防止完全的微融合)在现代Intel CPU(Haswell及更高版本)上为3 oups. 我们可以通过使用SIMD缩放+添加到数组基地址来避免采用索引寻址模式(前端为2 oups):例如srli用3而不是6乘以3屏蔽掉低3位(vpand),并用set1_epi64(sieveX)屏蔽vpaddq.因此,这为每个索引向量花费了2条额外的SIMD指令,以在SnB系列上节省4块微指令. (您可以提取uint64_t*指针元素而不是uint64_t索引.或者如果sieveX可以是32位绝对地址 1 ,则可以跳过vpaddq并提取已经-标度相同的指标.)

    这还将使存储地址uops能够在端口7(Haswell和更高版本)上运行; port7上的简单AGU只能处理非索引寻址模式. (这使使用store + reload提取标量的值更具吸引力.您希望降低索引提取的延迟时间,因为直到memory-dst or的加载部分完成后才需要这些值.)确实意味着更多未融合的内容.调度程序/执行单元的域uops,但很值得进行权衡.

    这不是其他AVX2 CPU(挖掘机/Ryzen或至强融核)的胜利;只有SnB系列对索引寻址模式具有前端成本和执行端口限制.

  • 提取索引,使用vmovq/vmovhps手动将其收集到SIMD vpor的向量中,然后使用vmovq/vmovhps 散布回去. >

    就像硬件的收集/散布一样,正确性要求所有索引都是唯一的,因此,您需要使用上述选项之一,直到在算法中达到该点为止. (向量冲突检测+后备替代不值得付出代价,而始终提取为标量:对列表元素进行选择性xor-ing带有AVX2指令(用于内部版本). (我知道我最近用手动收集/分散功能写了一个答案,但是花了我一段时间才找到它!)在那种情况下,我只使用128位向量,因为没有任何额外的SIMD工作来证明额外的理由vinserti128/vextracti128.

    实际上,我认为您想提取_mm256_sllv_epi64结果的上半部分,以便在两个单独的__m128i变量中分别包含cur[4..5]cur[6..7](数据).您将拥有vextracti128/2x vpor xmm而不是vinserti128/vpor ymm/vextracti128.

    前者具有较少的port5压力,并且具有更好的指令级并行性:两个128位半部分是相互独立的依赖链,不会相互耦合,因此存储/重载瓶颈(和缓存未命中)影响较少的相关联操作,从而允许乱序执行在等待时继续处理更多内容.

    在256b向量中执行地址计算并提取指针而不是索引将使vmovhps负载在Intel上更便宜(索引负载无法微融合到vmovhps 2 ).请参阅上一个要点.但是vmovq加载/存储始终是一个单一的uop,并且vmovhps索引存储可以在Haswell和以后的版本上保持微融合,因此前端吞吐量是收支平衡,而在AMD或KNL上则更差.对于调度程序/执行单元,这还意味着更多未融合域的操作,这似乎是比port2/3 AGU压力更多的潜在瓶颈.唯一的优点是,存储地址uops可以在端口7上运行,从而减轻了一些压力.

AVX2为我们提供了一个新的选择:

  • AVX2 vpgatherqq进行收集(_mm256_i64gather_epi64(sieveX, srli_result, 8)),然后提取索引并进行手动分散.因此,它与手动收集/手动分散完全一样,只不过您替换了手动收集与AVX2硬件齐聚一堂. (两个128位的收集要比一个256位的收集花费更多,因此您希望采用指令级并行性并将其收集到单个256位的寄存器中.)

    可能会赢得Skylake(其中vpgatherqq ymm是4 oups/4c吞吐量,再加上1 uop的设置),但甚至不是Broadwell(9 uops,每6c吞吐量一个),而绝对不是Haswell(22 oups/9c吞吐量) ).无论如何,您确实确实需要标量寄存器中的索引,因此您只 保存工作中的手动收集部分.那很便宜.


Skylake上每种策略的总费用

看起来在任何一个端口上都不会严重瓶颈. GP reg-> xmm需要端口5,而xmm-> int需要SnB系列CPU上的端口0,因此当与提取所需的混洗混合在一起时,端口5的瓶颈可能性就较小. (例如vpextrq rax, xmm0, 1是2 uop指令,一个端口5随机播放uop以获取高位qword,而一个端口0 uop将该数据从SIMD发送到整数域.)

因此,您的SIMD计算需要经常提取标量向量 与您需要将标量计算结果频繁插入向量中相比,它的影响要小.另请参见从GP regs加载xmm ,但这是关于开始的数据在GP规则中,而不是内存中.

  • 同时提取/标量或:总计= 24微秒= 6个周期的前端吞吐量.

    • vpaddq + vpand地址计算(Skylake上端口0/1/5的2微动)
    • 2x vextracti128(端口5为2 oups)
    • 4个vmovq(4 p0)
    • 4个vpextrq(8:4p0 4p5)
    • 4x or [r], r(每个4x2 = 8个前端uo.后端:4p0156 4p23(加载)4p237(存储地址)4p4(存储数据)).非索引寻址模式.

    总计= p5为6 oups,勉强可以容纳.如果可以让编译器执行此操作,则为数据提取存储/重新加载看起来很明智. (但是,编译器通常不会对管道进行足够详细的建模,以在同一循环中使用多种策略来平衡端口压力.)

  • 手动收集/分散:20 oups,5个前端吞吐量循环(Haswell/BDW/Skylake).在Ryzen上也很好.

    • (可选,可能不值得):vpaddq + vpand地址calc(Skylake上端口0/1/5的2 ups)如果可以将非VEX movhps用于1uop微融合,请跳过这些操作索引负载. (但是p237商店变成了p23).
    • vextracti128指针(用于端口5的1个uop)
    • 2次vmovq提取(2p0)
    • 2x vpextrq(4 = 2p0 2p5)
    • 2个vmovq负载(2p23)
    • 2x vmovhps xmm, xmm, [r]非索引负载(微融合了2个前端uops:2p23 + 2p5)

    • vextracti128拆分数据(p5)

    • 2x vpor xmm(2p015)
    • 2个vmovq存储(2个1个微融合uop,2p237 + 2p4)
    • 2个vmovhps存储(2个1个微融合uop,2p237 + 2p4)

    端口瓶颈:4个p0和4个p5在5个周期内比较合适,尤其是将其与可以在端口1上运行其几个uops的循环混合使用时.在Haswell paddq上只有p15(不是p015),并且移位仅是p0(不是p01).在Skylake上,AVX2 _mm256_sllv_epi64是1 uop(p01),但在Haswell上是3 uop = 2p0 + p5.因此,对于此循环,Haswell可能更接近p0或p5瓶颈,在这种情况下,您可能需要查看一个索引向量的存储/重载提取策略.

    跳过SIMD地址calc可能很好,因为除非使用存储/重新加载摘录,否则AGU压力似乎不是问题.而且这意味着uop缓存中的指令更少/代码更小,代码也更少. (直到解码器/uop缓存之后才发生分层,因此您仍然可以从前端早期的微融合中受益,只是不会遇到问题瓶颈.)

  • Skylake AVX2收集/手动分散:总计= 18微秒,前端吞吐量为4.5个周期.(在任何较早的uarch或AMD上更差).

    • vextracti128索引(端口5为1 uop)
    • 2次vmovq提取(2p0)
    • 2x vpextrq(4 = 2p0 2p5)

    • vpcmpeqd ymm0,ymm0,ymm0vpgatherqq(p015)

    • 创建全遮罩
    • vpgatherqq ymm1, [rdi + ymm2*8], ymm0 4个端口(某些端口).

    • vpor ymm(p015)

    • vextracti128或结果(p5)
    • 2个vmovq存储(2个1个微融合uop,2p23 + 2p4).请注意port7,我们正在使用索引存储.
    • 2个vmovhps存储(2个1个微融合uop,2p23 + 2p4).

因此,即使选择了最佳吞吐量,我们仍然每4.5个周期仅管理4个负载/4个存储,这还没有考虑SIMD在循环中的工作,这会花费一些前端吞吐量. 因此,我们还没有接近AGU吞吐量的瓶颈,也不必担心使用端口7.

我们也许可以考虑为其中一个解压缩(如果我们是编译器)存储/重新加载,用5 uups存储/4x加载替换7 uop 5指令vextracti128/2x vmovq/2x vpextrq序列.


总体:一个循环,直到我们解决冲突为止,然后是SIMD收集循环

您说在某一点之后,像cur[0] == cur[2]这样的索引之间没有冲突(重叠).

您绝对希望有一个单独的循环,它根本不检查冲突,以利用此优势.即使您拥有AVX512,Skylake的vpconflictq还是微代码,并且运行速度不快. (KNL具有单-uop vpconflictq,但完全避免它仍然更快).

我将由您(或一个单独的问题)决定如何确定您何时能解决冲突,并让循环解决这种可能性.

在可能存在冲突的同时,您可能希望提取索引+数据策略. SIMD冲突检查是可能的,但它并不便宜,对于32位元素而言为11微秒:用于在AVX2中进行冲突检测的回退实现.一个qword版本显然比dword便宜(减少了洗牌和比较,使所有结果都对付),但是您可能仍然只希望在提取循环中每10次迭代执行一次.

从最佳的标量或版本到最佳的收集版本并没有很大的加速(6个周期与4.5个周期没有考虑到循环中的其他工作,因此该比率甚至比这个还要小) .尽快保留稍慢的版本并不值得让它慢很多.

因此,如果您可以可靠地检测到何时完成冲突,请使用

int conflictcheck = 10;

do {

    if (--conflictcheck == 0) {
       vector stuff to check for conflicts
       if (no conflicts now or in the future)
           break;

       conflictcheck = 10;  // reset the down-counter
    }

    main loop body,  extract -> scalar OR strategy

} while(blah);


// then fall into the gather/scatter loop.
do {
    main loop body, gather + manual scatter strategy
} while();

应该将其编译为dec / je,在未采用的情况下,只需花费1 uop.

稍微慢一点的循环总共进行9次额外迭代,比进行数千次额外昂贵的冲突检查要好得多.


脚注1 :

如果sieveX是静态的,并且您正在Linux(不是MacOS)上构建非PIC代码,则其地址将作为[reg+disp32]寻址模式的一部分放入disp32中.在这种情况下,您可以省去vpaddq.但是让编译器将uint64_t视为已缩放的数组索引(清除了低位)将是很难的.可能必须将sieveX投射到uintptr_t并添加,然后再投射回去.

在PIE可执行文件或共享库(不允许32位绝对地址)中,或者在OS X上(静态地址始终大于2 ^ 32),这是不可能的.我不确定Windows允许什么.请注意,[disp32 + reg*8]仅具有1个寄存器,但仍是索引寻址模式,因此所有SnB系列惩罚都适用.但是,如果您不需要缩放,则reg + disp32只是base + disp32.

脚注2 :有趣的事实:非VEX movhps负载可以在Haswell上保持微融合.不会在Skylake上造成SSE/AVX停滞,但是您不会在AVX2函数的中间得到编译器来发出非VEX版本.

但是,

IACA(英特尔的静态分析工具)却弄错了. :( 什么是IACA?如何使用它?.

这基本上是对-mtune=skylake的优化遗漏,但是它会在Haswell上停顿 :movhps已经是其目标的读-修改-写操作.)我使用Linux perf在Skylake上对此进行了测试,以计算uops,并使用以下循环:

    mov     r15d, 100000000

.loop:
    vpaddq  ymm0, ymm1, ymm2      ; dirty the upper part
    vpaddq  ymm3, ymm1, ymm2      ; dirty another register for good measure

    vmovq  xmm0, [rdi+rbx*8]       ; zero the full register, breaking dependencies
    movhps xmm0, [rdi+rbx*8+8]     ; RMW the low 128 bits
                          ; fast on Skylake, will stall on Haswell

    dec r15d
    jnz .loop

在Skylake(i7-6700k)上,循环以每次迭代约1.25个周期运行,从而使每个时钟的前端吞吐量最大化4 uops.总共5个融合域uops(uops_issued.any),6个未融合域uops(uops_executed.thread).因此,movhps确实发生了微融合,而没有任何SSE/AVX问题.

将其更改为vmovhps xmm0, xmm0, [rdi+rbx*8+8]可将其每次迭代的速度降低到1.50个周期,现在为6个融合域,但仍然是6个非融合域.

运行movhps xmm0, [mem]时,如果ymm0的上半部分变脏,则没有多余的uop.我通过注释vmovq进行了测试.但是将vmovq更改为movq 会导致额外的麻烦:movq成为微融合的负载+合并,它将替换低64位(并且仍将xmm0的高64位清零).所以不是很movlps).


还请注意,即使没有VEX,pinsrq xmm0, [mem], 1也无法微熔断.但是对于VEX,出于代码大小的原因,您应该更喜欢vmovhps.

您的编译器可能希望将整数数据上的movhps内在函数优化"到vpinsrq中,但是我没有检查.

I'm writing a program to detect primes numbers. One part is bit sieving possible candidates out. I've written a fairly fast program but I thought I'd see if anyone has some better ideas. My program could use some fast gather and scatter instructions but I'm limited to AVX2 hardware for a x86 architecture (I know AVX-512 has these though I'd not sure how fast they are).

#include <stdint.h>
#include <immintrin.h>

#define USE_AVX2

// Sieve the bits in array sieveX for later use
void sieveFactors(uint64_t *sieveX)
{
    const uint64_t totalX = 5000000;
#ifdef USE_AVX2
    uint64_t indx[4], bits[4];

    const __m256i sieveX2 = _mm256_set1_epi64x((uint64_t)(sieveX));
    const __m256i total = _mm256_set1_epi64x(totalX - 1);
    const __m256i mask = _mm256_set1_epi64x(0x3f);

    // Just filling with some typical values (not really constant)
    __m256i ans = _mm256_set_epi64x(58, 52, 154, 1);
    __m256i ans2 = _mm256_set_epi64x(142, 70, 136, 100);

    __m256i sum = _mm256_set_epi64x(201, 213, 219, 237);    // 3x primes
    __m256i sum2 = _mm256_set_epi64x(201, 213, 219, 237);   // This aren't always the same

    // Actually algorithm can changes these
    __m256i mod1 = _mm256_set1_epi64x(1);
    __m256i mod3 = _mm256_set1_epi64x(1);

    __m256i mod2, mod4, sum3;

    // Sieve until all factors (start under 32-bit threshold) exceed the limit
    do {
        // Sieve until one of the factors exceeds the limit
        do {
            // Compiler does a nice job converting these into extracts
            *(__m256i *)(&indx[0]) = _mm256_add_epi64(_mm256_srli_epi64(_mm256_andnot_si256(mask, ans), 3), sieveX2);
            *(__m256i *)(&bits[0]) = _mm256_sllv_epi64(mod1, _mm256_and_si256(mask, ans));

            ans = _mm256_add_epi64(ans, sum);

            // Early on these locations can overlap
            *(uint64_t *)(indx[0]) |= bits[0];
            *(uint64_t *)(indx[1]) |= bits[1];
            *(uint64_t *)(indx[2]) |= bits[2];
            *(uint64_t *)(indx[3]) |= bits[3];

            mod2 = _mm256_sub_epi64(total, ans);

            *(__m256i *)(&indx[0]) = _mm256_add_epi64(_mm256_srli_epi64(_mm256_andnot_si256(mask, ans2), 3), sieveX2);
            *(__m256i *)(&bits[0]) = _mm256_sllv_epi64(mod3, _mm256_and_si256(mask, ans2));

            ans2 = _mm256_add_epi64(ans2, sum2);

            // Two types of candidates are being performed at once
            *(uint64_t *)(indx[0]) |= bits[0];
            *(uint64_t *)(indx[1]) |= bits[1];
            *(uint64_t *)(indx[2]) |= bits[2];
            *(uint64_t *)(indx[3]) |= bits[3];

            mod4 = _mm256_sub_epi64(total, ans2);
        } while (!_mm256_movemask_pd(_mm256_castsi256_pd(_mm256_or_si256(mod2, mod4))));

        // Remove one factor
        mod2 = _mm256_castpd_si256(_mm256_blendv_pd(_mm256_setzero_pd(), _mm256_castsi256_pd(sum), _mm256_castsi256_pd(mod2)));
        mod4 = _mm256_castpd_si256(_mm256_blendv_pd(_mm256_setzero_pd(), _mm256_castsi256_pd(sum2), _mm256_castsi256_pd(mod4)));
        ans = _mm256_sub_epi64(ans, mod2);
        ans2 = _mm256_sub_epi64(ans2, mod4);
        sum = _mm256_sub_epi64(sum, mod2);
        sum2 = _mm256_sub_epi64(sum2, mod4);
        sum3 = _mm256_or_si256(sum, sum2);
     } while (!_mm256_testz_si256(sum3, sum3));
#else
     // Just some example values (not really constant - compiler will optimize away code incorrectly)
     uint64_t cur = 58;
     uint64_t cur2 = 142;
     uint64_t factor = 67;

     if (cur < cur2) {
        std::swap(cur, cur2);
    }
    while (cur < totalX) {
        sieveX[cur >> 6] |= (1ULL << (cur & 0x3f));
        sieveX[cur2 >> 6] |= (1ULL << (cur2 & 0x3f));
        cur += factor;
        cur2 += factor;
    }
    while (cur2 < totalX) {
        sieveX[cur2 >> 6] |= (1ULL << (cur2 & 0x3f));
        cur2 += factor;
    }
#endif
}

Be warned that the locations can overlap at first. After a short while in the loop, this is not the case. I'd be happy to using a different approach if this is possible. Around 82% of the time within this part of the algorithm is in this loop. Hopefully this isn't too close to other posted questions.

解决方案

IDK why you use different parts of the same cur[8] array for indices and values; it made the source harder to understand to figure out that there was only one real array. The other was just to bounce vectors to scalars.

It looks like you're only ever going vector -> scalar, not inserting scalars back into a vector. And also that nothing inside the loop depends on any data in sieveX[]; I'm not familiar with your sieving algorithm but I guess the point of this is to create data in memory for later use.


AVX2 has gathers (not scatters), but they're only fast on Skylake and newer. They're ok on Broadwell, slowish on Haswell, and slow on AMD. (Like one per 12 clocks for Ryzen's vpgatherqq). See http://agner.org/optimize/ and other performance links in the x86 tag wiki.

Intel's optimization manual has a small section on manual gather / scatter (using insert/extract or movhps) vs. hardware instructions, possibly worth reading. In this case where the indices are runtime variables (not a constant stride or something), I think Skylake can benefit from AVX2 gather instructions here.

See Intel's intrinsics guide to look up the intrinsic for asm instructions like movhps. I'm just talking about what you want to get your compiler to emit, because that's what's important and the asm mnemonics are shorter to type and don't need casting. You have to know the asm mnemonic to look them up in Agner Fog's instruction tables, or to read compiler output from auto-vectorization, so I usually think in asm and then translate that to intrinsics.


With AVX, you have 3 main options:

  • do everything scalar. Register pressure may be a problem, but generating indices as needed (instead of doing all 4 adds or subs to generate curr[4..7] at once) might help. Unless those mask vectors have different values in different elements.

    (Using memory sources for scalar constants might not be bad, though, if they don't fit in 32-bit immediates and if you don't bottleneck on 2 memory ops per clock. The memory-destination or instructions would use indexed addressing modes, so the dedicated store-AGU on port 7 on Haswell and later couldn't be used. Thus AGU throughput could be a bottleneck.)

    Extracting all 4 elements of a vector as scalar is more expensive than 4x scalar add or shift instructions, but you're doing more work than that. Still, with BMI2 for 1 uops variable-count shifts (instead of 3 on Intel), it might not be terrible. I think we can do better with SIMD, though, especially with careful tuning.

  • extract indices and values to scalar like you're doing now, so the OR into sieveX[] is pure scalar. Works even when two or more indices are the same.

    This costs you about 7 uops per ymm vector -> 4x scalar registers using extract ALU instructions, or 5 uops using store/reload (worth considering for the compiler, maybe for one or two of the 4 vector extracts, because this code probably doesn't manage to bottleneck on load / store port throughput.) If the compiler turns store/reload in the C source into shuffle/extract instructions, though, you can't easily override its strategy except maybe by using volatile. And BTW, you'd want to use alignas(32) cur[8] to make sure actual vector stores don't cross a cache-line boundary.

    or [rdi + rax*8], rdx (with an indexed addressing mode preventing full micro-fusion) is 3 uops on modern Intel CPUs (Haswell and later). We could avoid an indexed addressing mode (making it 2 uops for the front-end) by scaling + adding to the array base address using SIMD: e.g. srli by 3 instead of 6, mask off the low 3 bits (vpand), and vpaddq with set1_epi64(sieveX). So this costs 2 extra SIMD instructions to save 4 uops on SnB-family, per vector of indices. (You'd extracting uint64_t* pointer elements instead of uint64_t indices. Or if sieveX can be a 32-bit absolute address1, you could skip the vpaddq and extract already-scaled indices for the same gain.)

    It would also enable the store-address uops to run on port 7 (Haswell and later); the simple AGU on port7 can only handle non-indexed addressing modes. (This makes extracting values to scalar with store+reload more attractive. You want lower latency for extracting indices, because the values aren't needed until after the load part of a memory-dst or completes.) It does mean more unfused-domain uops for the scheduler / execution units, but could well be worth the tradeoff.

    This isn't a win on other AVX2 CPUs (Excavator / Ryzen or Xeon Phi); only SnB-family has a front-end cost and execution-port restrictions for indexed addressing modes.

  • extract indices, manually gather into a vector with vmovq / vmovhps for a SIMD vpor, then scatter back with vmovq / vmovhps.

    Just like a HW gather/scatter, correctness requires that all indices are unique, so you'll want to use one of the above options until you get to that point in your algo. (vector conflict detection + fallback would not be worth the cost vs. just always extracting to scalar: Fallback implementation for conflict detection in AVX2).

    See selectively xor-ing elements of a list with AVX2 instructions for an intrinsics version. (I knew I'd recently written an answer with a manual gather / scatter, but took me a while to find it!) In that case I only used 128-bit vectors because there wasn't any extra SIMD work to justify the extra vinserti128 / vextracti128.

    Actually I think here you'd want to extract the high half of the _mm256_sllv_epi64 result so you have (the data that would be) cur[4..5] and cur[6..7] in two separate __m128i variables. You'd have vextracti128 / 2x vpor xmm instead of vinserti128 / vpor ymm / vextracti128.

    The former has less port5 pressure, and has better instruction-level parallelism: The two 128-bit halves are separate dependency chains that don't get coupled to each other, so store/reload bottlenecks (and cache misses) impact fewer dependent uops, allowing out-of-order execution to keep working on more stuff while waiting.

    Doing address calculation in a 256b vector and extracting pointers instead of indices would make vmovhps loads cheaper on Intel (indexed loads can't stay micro-fused to vmovhps2). See the previous bullet point. But vmovq loads/stores are always a single uop, and vmovhps indexed stores can stay micro-fused on Haswell and later, so it's break-even for front-end throughput and worse on AMD or KNL. It also means more unfused-domain uops for the scheduler / execution units, which looks like more of a potential bottleneck than port2/3 AGU pressure. The only advantage is that the store-address uops can run on port 7, relieving some pressure.

AVX2 gives us one new option:

  • AVX2 vpgatherqq for the gather (_mm256_i64gather_epi64(sieveX, srli_result, 8)), then extract indices and manually scatter. So it's exactly like the manual gather / manual scatter, except you replace the manual gather with an AVX2 hardware gather. (Two 128-bit gathers cost more than one 256-bit gather, so you would want to take the instruction-level parallelism hit and gather into a single 256-bit register).

    Possibly a win on Skylake (where vpgatherqq ymm is 4 uops / 4c throughput, plus 1 uop of setup), but not even Broadwell (9 uops, one per 6c throughput) and definitely not Haswell (22 uops / 9c throughput). You do need the indices in scalar registers anyway, so you're only saving the manual-gather part of the work. That's pretty cheap.


Total cost for each strategy on Skylake

It looks like this won't bottleneck badly on any one port. GP reg->xmm needs port 5, but xmm->int needs port 0 on SnB-family CPUs, so it's less likely to bottleneck on port 5 when mixed with the shuffles needed for extracting. (e.g. vpextrq rax, xmm0, 1 is a 2 uop instruction, one port 5 shuffle uop to grab the high qword, and a port 0 uop to send that data from SIMD to the integer domain.)

So your SIMD calculation where you need to frequently extract a vector to scalar is less bad than if you needed to frequently insert scalar calculation results into vectors. See also Loading an xmm from GP regs, but that's talking about data that starts in GP regs, not memory.

  • extract both / scalar OR: Total = 24 uops = 6 cycles of front-end throughput.

    • vpaddq + vpand address calc (2 uops for port 0/1/5 on Skylake)
    • 2x vextracti128 (2 uops for port 5)
    • 4x vmovq (4 p0)
    • 4x vpextrq (8: 4p0 4p5)
    • 4x or [r], r (4x2 = 8 front-end uops each. backend: 4p0156 4p23 (load) 4p237 (store-addres) 4p4 (store-data)). Non-indexed addressing mode.

    Total = 6 uops for p5, just barely fits. Store/reload for a data extract looks sensible, if you could get your compiler to do that. (But compilers don't typically model the pipeline in enough detail to use a mix of strategies in the same loop to balance port pressure.)

  • manual gather/scatter: 20 uops, 5 cycles of front-end throughput (Haswell / BDW / Skylake). Also good on Ryzen.

    • (optional, probably not worth it): vpaddq + vpand address calc (2 uops for port 0/1/5 on Skylake) Skip these if you could use non-VEX movhps for a 1-uop micro-fused indexed load. (But then p237 stores become p23).
    • vextracti128 pointers (1 uop for port 5)
    • 2x vmovq extract (2p0)
    • 2x vpextrq (4 = 2p0 2p5)
    • 2x vmovq load (2p23)
    • 2x vmovhps xmm, xmm, [r] non-indexed load (2 front-end uops micro-fused: 2p23 + 2p5)

    • vextracti128 split the data (p5)

    • 2x vpor xmm (2p015)
    • 2x vmovq store (2x 1 micro-fused uop, 2p237 + 2p4)
    • 2x vmovhps store (2x 1 micro-fused uop, 2p237 + 2p4)

    Port bottlenecks: 4 p0 and 4 p5 fits comfortably in 5 cycles, especially when you mix this with your loop which can run several of its uops on port 1. On Haswell paddq is only p15 (not p015), and shifts are only p0 (not p01). AVX2 _mm256_sllv_epi64 is 1 uop (p01) on Skylake, but on Haswell it's 3 uops = 2p0 + p5. So Haswell might be closer to a p0 or p5 bottleneck for this loop, in which case you might want to look at a store/reload extract strategy for one vector of indices.

    Skipping the SIMD address calc is probably good, because AGU pressure doesn't look like a problem unless you use a store/reload extract. And it means fewer instruction / smaller code-size and fewer uops in the uop cache. (un-lamination doesn't happen until after the decoders / uop cache, so you still benefit from micro-fusion in the early parts of the front-end, just not at the issue bottleneck.)

  • Skylake AVX2 gather / manual scatter: Total = 18 uops, 4.5 cycles of front-end throughput. (Worse on any earlier uarch or on AMD).

    • vextracti128 indices (1 uop for port 5)
    • 2x vmovq extract (2p0)
    • 2x vpextrq (4 = 2p0 2p5)

    • vpcmpeqd ymm0,ymm0,ymm0 create an all-ones mask for vpgatherqq (p015)

    • vpgatherqq ymm1, [rdi + ymm2*8], ymm0 4 uops for some ports.

    • vpor ymm (p015)

    • vextracti128 on the OR result (p5)
    • 2x vmovq store (2x 1 micro-fused uop, 2p23 + 2p4). Note note port7, we're using indexed stores.
    • 2x vmovhps store (2x 1 micro-fused uop, 2p23 + 2p4).

So even with the best throughput choice, we're still only managing 4 loads / 4 stores per 4.5 cycles, and that's without considering the SIMD work in the loop which costs some front-end throughput. So we're not close to bottlenecking on AGU throughput and having to worry about using port 7.

We could maybe think about store/reload for one of the extracts (if we were the compiler), replacing the 7 uop 5 instruction vextracti128 / 2x vmovq / 2x vpextrq sequence with a 5 uops store / 4x load.


Overall: One loop until we're done with conflicts, then a SIMD gather loop

You say that after a certain point, you don't have conflicts (overlap) between the indices like cur[0] == cur[2].

You definitely want a separate loop that doesn't check for conflicts at all to take advantage of this. Even if you had AVX512, Skylake's vpconflictq is micro-code and not fast. (KNL has single-uop vpconflictq but it's still faster to avoid it entirely).

I'll leave it up to you (or a separate question) how to figure out for sure when you're done with conflicts and can leave the loop that accounts for that possibility.

You probably want the extract indices + data strategy while there can be conflicts. SIMD conflict checking is possible, but it's not cheap, 11 uops for 32-bit elements: Fallback implementation for conflict detection in AVX2. A qword version is obviously much cheaper than dword (fewer shuffles and compares to get all against all), but you probably still only want to do it every 10 iterations or so of your extract loop.

There's not a huge speedup from the best scalar-or version to the best gather version (6 cycles vs. 4.5 isn't accounting for the other work in the loop, so the ratio is even smaller than that). Leaving the slightly slower version ASAP is not worth making it a lot slower.

So if you can reliably detect when you're done with conflicts, use something like

int conflictcheck = 10;

do {

    if (--conflictcheck == 0) {
       vector stuff to check for conflicts
       if (no conflicts now or in the future)
           break;

       conflictcheck = 10;  // reset the down-counter
    }

    main loop body,  extract -> scalar OR strategy

} while(blah);


// then fall into the gather/scatter loop.
do {
    main loop body, gather + manual scatter strategy
} while();

That should compile to a dec / je which only costs 1 uop in the not-taken case.

Doing 9 extra iterations total of the slightly-slower loop is much better than doing thousands of extra expensive conflict checks.


Footnote 1:

If sieveX is static and you're building non-PIC code on Linux (not MacOS) then its address will fit in a disp32 as part of a [reg+disp32] addressing mode. In that case you can leave out the vpaddq. But getting a compiler to treat a uint64_t as an already-scaled array index (with its low bits cleared) would be ugly. Probably have to cast sieveX to uintptr_t and add, then cast back.

This isn't possible in a PIE executable or shared library (where 32-bit absolute addresses aren't allowed), or on OS X at all (where static addresses are always above 2^32). I'm not sure what Windows allows. Note that [disp32 + reg*8] only has 1 register, but is still an indexed addressing mode so all the SnB-family penalties apply. But if you don't need scaling, reg + disp32 is just base + disp32.

Footnote 2: Fun fact: non-VEX movhps loads can stay micro-fused on Haswell. It won't cause an SSE/AVX stall on Skylake, but you won't get a compiler to emit the non-VEX version in the middle of an AVX2 function.

IACA (Intel's static analysis tool) gets this wrong, though. :( What is IACA and how do I use it?.

This is basically a missed-optimization for -mtune=skylake, but it would stall on Haswell: Why is this SSE code 6 times slower without VZEROUPPER on Skylake?.

The "penalty A" (execute SSE with dirty upper) on Skylake is merely a false dependency on that one register. (And a merging uop for instructions that would otherwise be write-only, but movhps is already a read-modify-write of its destination.) I tested this on Skylake with Linux perf to count uops, with this loop:

    mov     r15d, 100000000

.loop:
    vpaddq  ymm0, ymm1, ymm2      ; dirty the upper part
    vpaddq  ymm3, ymm1, ymm2      ; dirty another register for good measure

    vmovq  xmm0, [rdi+rbx*8]       ; zero the full register, breaking dependencies
    movhps xmm0, [rdi+rbx*8+8]     ; RMW the low 128 bits
                          ; fast on Skylake, will stall on Haswell

    dec r15d
    jnz .loop

The loop runs at ~1.25 cycles per iteration on Skylake (i7-6700k), maxing out the front-end throughput of 4 uops per clock. 5 total fused-domain uops (uops_issued.any), 6 unfused-domain uops (uops_executed.thread). So micro-fusion was definitely happening for movhps without any SSE/AVX problems.

Changing it to vmovhps xmm0, xmm0, [rdi+rbx*8+8] slowed it down to 1.50 cycles per iteration, now 6 fused-domain, but still the same 6 unfused-domain uops.

There's no extra uop if the upper half of ymm0 is dirty when movhps xmm0, [mem] runs. I tested by commenting out the vmovq. But changing vmovq to movq does result in an extra uop: movq becomes a micro-fused load+merge that replaces the low 64 bits (and still zeros the upper 64 bits of xmm0 so it's not quite movlps).


Also note that pinsrq xmm0, [mem], 1 can't micro fuse even without VEX. But with VEX, you should prefer vmovhps for code-size reasons.

Your compiler may want to "optimize" the intrinsic for movhps on integer data into vpinsrq, though, I didn't check.

这篇关于在AVX2指令中没有快速聚集和分散的情况下该怎么办?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆