行大小大于向量宽度时,SIMD转置 [英] SIMD transpose when row size is greater than vector width

查看:104
本文介绍了行大小大于向量宽度时,SIMD转置的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

您可以找到很多

You can find a lot of good answers for transposing a matrix which falls with the natural size of the SIMD instruction set, in particular, where the size of one row is no more than the vector width. Examples would be a 4x4 float transpose in SSE, or a 4x4 double or 8x8 float transpose in AVX/AVX2 (double everything again for AVX-512).

但是,当矩阵大于矩阵时,有哪些选择呢?例如,使用AVX2的16x16 浮点矩阵?可以完全使用SIMD改组来加快处理速度,还是仅采用聚集+顺序写入方式?

However, what are the options when the matrix is larger than that? E.g., a 16x16 float matrix using AVX2? Can SIMD shuffles be used at all to speed things up, or is a gather + sequential write the only way?

推荐答案

如果所有矩阵尺寸都是数据包大小的倍数,则可以按块进行操作并根据需要交换块.使用SSE2的4x4双矩阵示例:

If all your matrix dimensions are a multiple of your packet-size you can do the operation block-wise and swap the blocks as needed. Example for 4x4 double matrix using SSE2:

// transpose vectors i0 and i1 and store the result to addresses r0 and r1
void transpose2x2(double *r0, double* r1, __m128d i0, __m128d i1)
{
    __m128d t0 = _mm_unpacklo_pd(i0,i1);
    __m128d t1 = _mm_unpackhi_pd(i0,i1);
    _mm_storeu_pd(r0, t0);
    _mm_storeu_pd(r1, t1);
}


void transpose(double mat[4][4])
{
    // transpose [00]-block in-place
    transpose2x2(mat[0]+0, mat[1]+0,_mm_loadu_pd(mat[0]+0),_mm_loadu_pd(mat[1]+0));

    // load [20]-block
    __m128d t20 = _mm_loadu_pd(mat[2]+0), t30 = _mm_loadu_pd(mat[3]+0);
    // transpose [02]-block and store it to [20] position
    transpose2x2(mat[2]+0,mat[3]+0, _mm_loadu_pd(mat[0]+2),_mm_loadu_pd(mat[1]+2));
    // transpose temp-block and store it to [02] position
    transpose2x2(mat[0]+2,mat[1]+2, t20, t30);

    // transpose [22]-block in-place
    transpose2x2(mat[2]+2, mat[3]+2,_mm_loadu_pd(mat[2]+2),_mm_loadu_pd(mat[3]+2));
}

这应该相对容易地扩展到其他平方矩阵,其他标量类型和其他体系结构.不是数据包大小倍数的矩阵可能会更复杂(如果它们足够大,那么进行矢量化的大部分工作,而只是手动处理最后几行/列,可能是值得的).

This should be relatively easy to extend to other square matrices, other scalar types and other architectures. Matrices which are not a multiple of packet sizes are perhaps more complicated (if they are large enough, it will probably be worth it to do most the work with vectorization and just do the last rows/columns manually).

对于某些尺寸,例如3x4或3x8矩阵有特殊的算法[1]-如果您有1003x1003矩阵,则可以在最后的行/列中利用它(并且可能还有其他奇数大小的算法).

For some sizes, e.g. 3x4 or 3x8 matrices there are special algorithms [1] -- if you have a 1003x1003 matrix, you could exploit that for the last rows/columns (and there are probably algorithms for other odd sizes as well).

您还可以花一些力气为矩形矩阵编写代码(必须考虑如何避免一次只缓存一个以上的块,但这是可能的).

With some effort you could also write this for rectangular matrices (some thoughts have to be made how to avoid having to cache more than one block at a time, but it is possible).

Godbolt演示: https://godbolt.org/z/tVk_Bc

Godbolt demo: https://godbolt.org/z/tVk_Bc

[1] 查看全文

登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆