如何使用AVX / SIMD嵌套循环和+ =格式? [英] How to use AVX/SIMD with nested loops and += format?

查看:210
本文介绍了如何使用AVX / SIMD嵌套循环和+ =格式?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我写一个网页排名程序。我写更新排名的方法。我已经成功地得到它与嵌套的for循环,也是一个线程版本的工作。不过,我想改用SIMD / AVX。

I am writing a page rank program. I am writing a method for updating the rankings. I have successful got it working with nested for loops and also a threaded version. However I would like to instead use SIMD/AVX.

这是code,我想变成一个SIMD / AVX的实现。

This is the code I would like to change into a SIMD/AVX implementation.

#define IDX(a, b) ((a * npages) + b)   // 2D matrix indexing
for (size_t i = 0; i < npages; i++) {
    temp[i] = 0.0;
    for (size_t j = 0; j < npages; j++) {
        temp[i] += P[j] * matrix_cap[IDX(i,j)];
    }
}

有关此code P [] 是大小 NPAGES matrix_cap [] 是大小 NPAGES * NPAGES P [] 是页面的队伍,温度[] 是用来存储下一个迭代网页排名,从而要能够检查收敛。

For this code P[] is of size npages and matrix_cap[] is of size npages * npages. P[] is the ranks of the pages and temp[] is used to store the next iterations page ranks so as to be able to check convergence.

我不知道该如何跨preT + = 与AVX和我将如何得到我的数据涉及两个数组/大小的矢量 NPAGES 尺寸的一个矩阵NPAGES * NPAGES (以行优先顺序)成格式,其中可与SIMD / AVX使用操作。

I don't know how to interpret += with AVX and how I would get my data which involves two arrays/vectors of size npages and one matrix of size npages * npages (in row major order) into a format of which could be used with SIMD/AVX operations.

至于AVX这是我迄今为止虽然它是非常非常不正确的,只是在我会粗略地喜欢做一个刺。

As far as AVX this is what I have so far though it's very very incorrect and was just a stab at what I would roughly like to do.

ssize_t g_mod = npages - (npages % 4);
double* res = malloc(sizeof(double) * npages);
double sum = 0.0;
for (size_t i = 0; i < npages; i++) {
    for (size_t j = 0; j < mod; j += 4) {
        __m256d p = _mm256_loadu_pd(P + j);
        __m256d m = _mm256_loadu_pd(matrix_hat + i + j);
        __m256d pm = _mm256_mul_pd(p, m);
        _mm256_storeu_pd(&res + j, pm);
        for (size_t k = 0; k < 4; k++) {
            sum += res[j + k];
        }
    }
    for (size_t i = mod; i < npages; i++) {
        for (size_t j = 0; j < npages; j++) {
            sum += P[j] * matrix_cap[IDX(i,j)];
        }
    }
    temp[i] = sum;
    sum = 0.0;
}

如何能格式化我的数据,所以我可以在其上使用AVX / SIMD操作(添加,MUL)优化它,因为它会被调用了很多。

How to can I format my data so I can use AVX/SIMD operations (add,mul) on it to optimise it as it will be called a lot.

推荐答案

首先,EOF是正确的,你应该看到如何GCC /铛/国际商会在做自动向量化的标量code。我无法检查你,因为你只贴code-片段,没有什么我可以扔在的http:/ /gcc.godbolt.org/

First, EOF is right, you should see how well gcc/clang/icc do at auto-vectorizing your scalar code. I can't check for you, because you only posted code-fragments, not anything I can throw on http://gcc.godbolt.org/.

您绝对不需要任何MALLOC。请注意,您的内部函数的版本只会使用在32B RES [] 的时候,总是会覆盖任何在那里了。所以你还不如用一个单一的阵列32B。或者更好,<一个href=\"http://stackoverflow.com/questions/6996764/fastest-way-to-do-horizontal-float-vector-sum-on-x86/35270026#35270026\">use一个更好的方法来获得你的载体的水平总和。

You definitely don't need to malloc anything. Notice that your intrinsics version only ever uses 32B at a time of res[], and always overwrites whatever was there before. So you might as well use a single 32B array. Or better, use a better method to get a horizontal sum of your vector.

(参见建议底部上的矩阵不同的数据排列)

计算每个温度[I] 使用每个 P [J] ,所以实际上是要得到的东西从约为量化聪明。 对于从每一个负载 P [J] ,使用4种不同负载的载体从 matrix_cap [] 的该Ĵ,但4个不同的 I 值。您将会累积4个不同的向量,并有HSUM他们每个人下降到温度在结束[I] 值。

Calculating each temp[i] uses every P[j], so there is actually something to be gained from being smarter about vectorizing. For every load from P[j], use that vector with 4 different loads from matrix_cap[] for that j, but 4 different i values. You'll accumulate 4 different vectors, and have to hsum each of them down to a temp[i] value at the end.

所以,你的内循环将有5读取数据流(P []和 matrix_cap 4种不同的行)。它会做4个水平和以及4标店在年底,连续4 I 值的最终结果。 (或者做两洗牌和两个16B店)。 (也许<一个href=\"http://stackoverflow.com/questions/36195356/most-efficient-way-to-get-a-m256-of-horizontal-sums-of-8-source-m256-vectors\">transpose-and-sum一起的,这实际上是一个很好的用例进行的洗牌动力昂贵的 _mm256_hadd_pd vhaddpd )指令,但要小心它的车道运行)

So your inner loop will have 5 read streams (P[] and 4 different rows of matrix_cap). It will do 4 horizontal sums, and 4 scalar stores at the end, with the final result for 4 consecutive i values. (Or maybe do two shuffles and two 16B stores). (Or maybe transpose-and-sum together, which is actually a good use-case for the shuffling power of the expensive _mm256_hadd_pd (vhaddpd) instruction, but be careful of its in-lane operation)

这可能是更好的积累8-12 温度并行[I] 值,所以从 P [J] 重复使用8〜12倍。 (检查编译器的输出,以确保您没有运行出矢量暂存器并溢出 __ m256d 矢量内存,虽然。)这将使更多的工作,为清理循环。

It's probably even better to accumulate 8 to 12 temp[i] values in parallel, so every load from P[j] is reused 8 to 12 times. (check the compiler output to make sure you aren't running out of vector regs and spilling __m256d vectors to memory, though.) This will leave more work for the cleanup loop.

FMA吞吐量和延迟是这样的,你需要10矢量蓄电池保持10 FMAS在飞行中饱和FMA单位的Haswell。 SKYLAKE微架构的延迟降低到4℃,所以你只需要8矢量累加器饱和它SKL。 (见 86 标签的wiki)。即使你的内存瓶颈,不执行,港口吞吐量,你会想多蓄电池,但他们都可能为同一温度[I] (所以你' ð垂直总结下来,一个向量,然后HSUM这一点)。

FMA throughput and latency are such that you need 10 vector accumulators to keep 10 FMAs in flight to saturate the FMA unit on Haswell. Skylake reduced the latency to 4c, so you only need 8 vector accumulators to saturate it on SKL. (See the x86 tag wiki). Even if you're bottlenecked on memory, not execution-port throughput, you will want multiple accumulators, but they could all be for the same temp[i] (so you'd vertically sum them down to one vector, then hsum that).

然而,对于多积累结果温度[I] 同时具有再利用的较大优势 P [J] 加载后多次。您还可以节省垂直在最后补充道。多读流实际上可能有助于隐藏高速缓存未命中,在任何一种流的延迟。 (在英特尔CPU HW prefetchers可以跟踪一个正向每4K页面一个反向流,IIRC)。你可能会取得平衡,并使用两个或三个矢量蓄电池为每4 温度[I] 并行的结果,如果发现多个读取流是一个问题,但这将意味着你必须加载相同的 p [J] 次总。

However, accumulating results for multiple temp[i] at once has the large advantage of reusing P[j] multiple times after loading it. You also save the vertical adds at the end. Multiple read streams may actually help hide the latency of a cache miss in any one of the streams. (HW prefetchers in Intel CPUs can track one forward and one reverse stream per 4k page, IIRC). You might strike a balance, and use two or three vector accumulators for each of 4 temp[i] results in parallel, if you find that multiple read streams are a problem, but that would mean you'd have to load the same P[j] more times total.

#define IDX(a, b) ((a * npages) + b)   // 2D matrix indexing
for (size_t i = 0; i < (npages & (~7ULL)); i+=8) {
    __m256d s0 = _mm256_setzero_pd(),
            s1 = _mm256_setzero_pd(),
            s2 = _mm256_setzero_pd(),
            ...
            s7 = _mm256_setzero_pd();   // 8 accumulators for 8 i values
    for (size_t j = 0; j < (npages & ~(3ULL)); j+=4) {
        __m256d Pj = _mm256_loadu_pd(P+j);        // reused 8 times after loading
        //temp[i] += P[j] * matrix_cap[IDX(i,j)];
        s0 = _mm256_fmadd_pd(Pj, _mm256_loadu_pd(&matrix_cap[IDX(i+0,j)]), s0);
        s1 = _mm256_fmadd_pd(Pj, _mm256_loadu_pd(&matrix_cap[IDX(i+1,j)]), s1);
        // ...
        s7 = _mm256_fmadd_pd(Pj, _mm256_loadu_pd(&matrix_cap[IDX(i+7,j)]), s7);
    }
    // or do this block with a hsum+transpose and do vector stores.
    // taking advantage of the power of vhaddpd to be doing 4 useful hsums with each instructions.
    temp[i+0] = hsum_pd256(s0);   // See the horizontal-sum link earlier for how to write this function
    temp[i+1] = hsum_pd256(s1);
    //...
    temp[i+7] = hsum_pd256(s7);

    // if npages isn't a multiple of 4, add the last couple scalar elements to the results of the hsum_pd256()s.
}
// TODO: cleanup for the last up-to-7 odd elements.  

您也许可以写 __ m256d总和[8] 并在您的载体蓄电池循环,但你必须检查的编译器完全解开它,实际上仍然保持一切住在寄存器中。

You could probably write __m256d sums[8] and loop over your vector accumulators, but you'd have to check that the compiler fully unrolls it and still actually keeps everything live in registers.

如何能格式化我的数据,所以我可以在其上使用AVX / SIMD操作(添加,MUL)优化它,因为它会被调用了很多。

How to can I format my data so I can use AVX/SIMD operations (add,mul) on it to optimise it as it will be called a lot.

我错过了问题前面这一部分。首先,很明显浮动将会给你2倍元素每向量(每显存带宽单位)的数量。 2较少的内存/缓存占用的因素,如果高速缓存命中率的增加可能会给比这更加速。

I missed this part of the question earlier. First of all, obviously float will and give you 2x the number of elements per vector (and per unit of memory bandwidth). The factor of 2 less memory / cache footprint might give more speedup than that if cache hit rate increases.

理想情况下,矩阵将是条纹相匹配的载体宽度。从矩阵中的每个负载将获得 matrix_cap [IDX(I,J)] 4相邻 I 值的向量,但未来32B将是下一个Ĵ值相同的4 I 值。这意味着每个向量累加器累加的总和为不同的 I 中的每个元素,所以在端不需要水平总和

Ideally, the matrix would be "striped" to match the vector width. Every load from the matrix would get a vector of matrix_cap[IDX(i,j)] for 4 adjacent i values, but the next 32B would be the next j value for the same 4 i values. This means that each vector accumulator is accumulating the sum for a different i in each element, so no need for horizontal sums at the end.

P [J] 保持线性的,而是你广播负载它的每个元素,使用的是4 8矢量I 每个值(或浮动8 VEC 8 I 的S )。所以,你的矢量宽度的因素增加 P [J] 加载您的重用因素。广播负载是接近免费的Haswell及更高版本(仍然只带负载端口UOP),以及大量廉价为此对SNB / IVB,他们也采取了洗牌端口UOP。

P[j] stays linear, but you broadcast-load each element of it, for use with 8 vectors of 4 i values each (or 8 vec of 8 is for float). So you increase your reuse factor for P[j] loads by a factor of the vector width. Broadcast-loads are near-free on Haswell and later (still only take a load-port uop), and plenty cheap for this on SnB/IvB where they also take a shuffle-port uop.

这篇关于如何使用AVX / SIMD嵌套循环和+ =格式?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆