用于平铺矩阵乘法的AVX内在函数 [英] AVX intrinsics for tiled matrix multiplication

查看:94
本文介绍了用于平铺矩阵乘法的AVX内在函数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图使用AVX512内部函数对矩阵乘法(平铺)的循环进行矢量化处理.我使用__mm256d作为变量来存储中间结果并将其存储在我的结果中.但是,这会以某种方式触发内存损坏.我没有暗示为什么会这样,因为非AVX版本可以正常工作.另外,另一个怪异的事情是,磁贴大小现在会以某种方式影响结果.

I was trying to use AVX512 intrinsics to vectorize my loop of matrix multiplication (tiled). I used __mm256d as variables to store intermediate results and store them in my results. However, somehow this triggers memory corruption. I've got no hint why this is the case, as the non-AVX version works fine. Also, another weird thing is that tile sizes somehow affects the result now.

矩阵结构附在下面的代码部分中.该函数使用两个矩阵指针m1和m2以及tileSize的整数.感谢@harold的反馈,我现在用广播替换了矩阵m1的_mm256_load_pd.但是,内存损坏问题仍然存在.我还在下面附加了内存损坏的输出

The matrix structs are attached in the following code section. The function takes two matrix pointers, m1 and m2 and an integer for tileSize.Thanks for @harold's feedback, I've now replaced the _mm256_load_pd for matrix m1 with broadcast. However, the memory corrupution problem still persist. I've also attached the output of memory corruption below


__m256d rResult rm1, rm2, rmult;


    for (int bi = 0; bi < result->row; bi += tileSize) {
         for (int bj = 0; bj < result->col; bj += tileSize) {
             for (int bk = 0; bk < m1->col; bk += tileSize) {
                 for (int i = 0; i < tileSize; i++ ) {
                     for (int j = 0; j < tileSize; j+=4) {
                         rResult = _mm256_setzero_pd();
                         for (int k = 0; k < tileSize; k++) {

                            //  result->val[bi+i][bj+j] += m1.val[bi+i][bk+k]*m2.val[bk+k][bj+j];


                             rm1 = _mm256_broadcast_pd((__m128d const *) &m1->val[bi+i][bk+k]);
                             rm2 = _mm256_load_pd(&m2->val[bk+k][bj+j]);
                             rmult = _mm256_mul_pd(rm1,rm2);
                             rResult = _mm256_add_pd(rResult,rmult);
                             _mm256_store_pd(&result->val[bi+i][bj+j],rResult);
                         }
                     }  
                 }
             }
         }
     }
return result;

*** Error in `./matrix': free(): invalid next size (fast): 0x0000000001880910 ***
======= Backtrace: =========
/lib64/libc.so.6(+0x81609)[0x2b04a26d0609]
./matrix[0x4016cc]
/lib64/libc.so.6(__libc_start_main+0xf5)[0x2b04a2671495]
./matrix[0x400e29]
======= Memory map: ========
00400000-0040c000 r-xp 00000000 00:2c 6981358608                         /home/matrix
0060b000-0060c000 r--p 0000b000 00:2c 6981358608                         /home/matrix
0060c000-0060d000 rw-p 0000c000 00:2c 6981358608                         /home/matrix
01880000-018a1000 rw-p 00000000 00:00 0                                  [heap]
2b04a1f13000-2b04a1f35000 r-xp 00000000 00:16 12900                      /usr/lib64/ld-2.17.so
2b04a1f35000-2b04a1f3a000 rw-p 00000000 00:00 0
2b04a1f4e000-2b04a1f52000 rw-p 00000000 00:00 0
2b04a2134000-2b04a2135000 r--p 00021000 00:16 12900                      /usr/lib64/ld-2.17.so
2b04a2135000-2b04a2136000 rw-p 00022000 00:16 12900                      /usr/lib64/ld-2.17.so
2b04a2136000-2b04a2137000 rw-p 00000000 00:00 0
2b04a2137000-2b04a2238000 r-xp 00000000 00:16 13188                      /usr/lib64/libm-2.17.so
2b04a2238000-2b04a2437000 ---p 00101000 00:16 13188                      /usr/lib64/libm-2.17.so
2b04a2437000-2b04a2438000 r--p 00100000 00:16 13188                      /usr/lib64/libm-2.17.so
2b04a2438000-2b04a2439000 rw-p 00101000 00:16 13188                      /usr/lib64/libm-2.17.so
2b04a2439000-2b04a244e000 r-xp 00000000 00:16 12867                      /usr/lib64/libgcc_s-4.8.5-20150702.so.1
2b04a244e000-2b04a264d000 ---p 00015000 00:16 12867                      /usr/lib64/libgcc_s-4.8.5-20150702.so.1
2b04a264d000-2b04a264e000 r--p 00014000 00:16 12867                      /usr/lib64/libgcc_s-4.8.5-20150702.so.1
2b04a264e000-2b04a264f000 rw-p 00015000 00:16 12867                      /usr/lib64/libgcc_s-4.8.5-20150702.so.1
2b04a264f000-2b04a2811000 r-xp 00000000 00:16 13172                      /usr/lib64/libc-2.17.so
2b04a2811000-2b04a2a11000 ---p 001c2000 00:16 13172                      /usr/lib64/libc-2.17.so
2b04a2a11000-2b04a2a15000 r--p 001c2000 00:16 13172                      /usr/lib64/libc-2.17.so
2b04a2a15000-2b04a2a17000 rw-p 001c6000 00:16 13172                      /usr/lib64/libc-2.17.so
2b04a2a17000-2b04a2a1c000 rw-p 00000000 00:00 0
2b04a2a1c000-2b04a2a1e000 r-xp 00000000 00:16 13184                      /usr/lib64/libdl-2.17.so
2b04a2a1e000-2b04a2c1e000 ---p 00002000 00:16 13184                      /usr/lib64/libdl-2.17.so
2b04a2c1e000-2b04a2c1f000 r--p 00002000 00:16 13184                      /usr/lib64/libdl-2.17.so
2b04a2c1f000-2b04a2c20000 rw-p 00003000 00:16 13184                      /usr/lib64/libdl-2.17.so
2b04a4000000-2b04a4021000 rw-p 00000000 00:00 0
2b04a4021000-2b04a8000000 ---p 00000000 00:00 0
7ffc8448e000-7ffc844b1000 rw-p 00000000 00:00 0                          [stack]
7ffc845ed000-7ffc845ef000 r-xp 00000000 00:00 0                          [vdso]
ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0                  [vsyscall]
Aborted

推荐答案

该代码从m1加载一个小行向量,从m2加载一个小行向量并将它们相乘,这不是矩阵乘法的工作原理,我假设它是直接矢量化相同的标量循环.您可以使用m1的广播负载,这样与m2的行向量相乘的乘积会得出结果的行向量,这很方便(反之亦然,从m2进行广播时,您会得到结果的列向量,即存储起来很棘手-除非您当然使用列主矩阵布局.

That code loads a small row vector from m1 and a small row vector from m2 and multiplies them, which is not how matrix multiplication works, I assume it's direct vectorization of the identical scalar loop. You can use a broadcast-load from m1, that way the product with the row vector from m2 results in a row vector of the result which is convenient (the other way around, broadcasting from m2, you get a column vector of the result which is tricky to store - unless of course you use the column-major matrix layout).

从不重置 rResult 也是错误的,并且在使用平铺时要格外小心,因为平铺意味着将各个结果放在一边,然后在以后再次提取.方便地实现 C + = A * B ,因为这样您就不必区分第二次处理结果了(将 rResult 重新加载到结果矩阵)和第一次处理结果(将累加器清零,或者如果您实现 C + = A * B ,那么它也只是从结果中加载).

Never resetting rResult is also wrong, and takes extra care when using tiling, because the tiling means that individual results are put aside and then picked up again later. It's convenient to implement C += A*B because then you don't have to distinguish between the second time that a result is worked on (loading rResult back out of the result matrix) and the first time that a result is worked on (either zeroing the accumulator, or if you implement C += A*B, then it's also just loading it out of the result).

存在一些性能错误,

  • 使用一个累加器.从长远来看,这限制了内部循环每4个周期运行一次(Skylake),这是因为通过加法(或FMA)产生的循环依赖性.每个周期应有2个FMA,但那样,每4个周期应有1个FMA,速度为1/8.
  • 使用2:1的负载与FMA的比率(假设mul + add是收缩的),它必须为1:1或更高,以避免负载吞吐量造成瓶颈.2:1的比例限制为一半速度.

这两种方法的解决方案都是在内部循环中将m1的一小列向量与m2的一小行向量相乘,求和成一个小的 matrix 累加器,而不只是其中的一个.例如,如果您使用3x16区域(3x4向量,向量长度为​​4,并且向量对应于m2的负荷,则m1会执行广播负荷),那么将有12个累加器,因此有12个独立的依赖链:足以隐藏FMA的高延迟吞吐量产品(每个周期2个,但在Skylake上为4个周期,因此您至少需要 8条独立链,而在Haswell上至少需要10条链).这也意味着内部循环中有7个负载和12个FMA,甚至比1:1还要好,它甚至可以支持Turbo频率而不会超速缓存.

The solution for both of them is multiplying a small column vector from m1 with a small row vector from m2 in the inner loop, summing into a small matrix of accumulators rather than just one of them. For example if you use a 3x16 region (3x4 vectors, with a vector length of 4 and the vectors corresponding to loads from m2, from m1 you would do broadcast-loads), then there are 12 accumulators, and therefore 12 independent dependency chains: enough to hide the high latency-throughput product of FMA (2 per cycle, but 4 cycles long on Skylake, so you need at least 8 independent chains, and at least 10 on Haswell). It also means there are 7 loads and 12 FMAs in the inner loop, even better than 1:1, it can even support Turbo frequencies without overclocking the cache.

我还要指出,在每个维度上将图块大小设置为相同不一定是最好的.也许是(但可能不是),尺寸的作用确实有所不同.

I would also like to note that setting the tile size the same in every dimension is not necessarily the best. Maybe it is, but probably not, the dimensions do act a little differently.

更多高级性能问题,

  • 不重新包装瓷砖.这意味着平铺将跨越不必要的更多页面,这会损害TLB的有效性.您很容易会遇到这样的情况,即您的图块可以容纳在缓存中,但分布在太多页面中而无法容纳在TLB中.TLB颠簸效果不好.

使用非对称图块大小,您可以将m1图块或m2图块安排为TLB友好型,但不能同时使用.

Using asymmetric tile sizes you can arrange for either m1 tiles or m2 tiles to be TLB-friendly, but not both at the same time.

这篇关于用于平铺矩阵乘法的AVX内在函数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆