读取/写入步幅比其宽度大得多的矩阵会导致性能大幅下降 [英] reading/writing a matrix with a stride much larger than its width causes a big loss in performance

查看：129 发布时间：2020/5/7 19:45:48 c++ optimization x86 matrix-multiplication avx

本文介绍了读取/写入步幅比其宽度大得多的矩阵会导致性能大幅下降的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在对1024x1024矩阵进行密集矩阵乘法.我使用64x64磁贴使用循环阻止/平铺进行此操作.我创建了一个高度优化的64x64矩阵乘法函数(有关代码，请参见问题末尾).

I'm doing dense matrix multiplication on 1024x1024 matrices. I do this using loop blocking/tiling using 64x64 tiles. I have created a highly optimized 64x64 matrix multiplication function (see the end of my question for the code).

gemm64(float *a, float *b, float *c, int stride).

这是在磁贴上运行的代码.具有16x16瓦片的1024x1204矩阵.

Here is the code which runs over the tiles. A 1024x1204 matrix which has 16x16 tiles.

for(int i=0; i<16; i++) {
    for(int j=0; j<16; j++) {
        for(int k=0; k<16; k++) {
            gemm64(&a[64*(i*1024 + k)], &b[64*(k*1024 + j)], &c[64*(i*1024 + j)], 1024);
        }
    }
}

但是，作为猜测，我尝试在新矩阵b2中重新排列b矩阵的所有图块的内存(有关代码，请参见此问题的结尾)，以便每个图块的跨度为64，而不是1024.这有效地组成了一个数组，由stride = 64组成的64x64矩阵.

However, as a guess, I tried rearranging the memory of all the tiles (see the end of this question for the code) for the b matrix in a new matrix b2 so that the stride of each tile is 64 instead of 1024. This effectively makes an array of 64x64 matrices with stride=64.

float *b2 = (float*)_mm_malloc(1024*1024*sizeof(float), 64);
reorder(b, b2, 1024);
for(int i=0; i<16; i++) {
    for(int j=0; j<16; j++) {
        for(int k=0; k<16; k++) {
            gemm64_v2(&a[64*(i*1024 + k)], &b2[64*64*(k*16 + j)], &c[64*(i*1024 + j)], 64);
        }
    }
}
_mm_free(b2);

请注意，b的偏移量如何从&b[64*(k*1024 + j)]更改为&b2[64*64*(k*16 + j)]，并且传递给gemm64的步幅已从1024更改为64.

Notice how the offset for b changed from &b[64*(k*1024 + j)] to &b2[64*64*(k*16 + j)] and that the stride passed to gemm64 has changed from 1024 to 64.

我的代码的性能从Sandy Bridge系统上的峰值触发器的不到20％跃升到70％！

为什么以这种方式重新排列b矩阵中的图块会产生如此大的变化?

数组a，b，b2和c是64字节对齐的.

The arrays a,b, b2, and c are 64 byte aligned.

extern "C" void gemm64(float *a, float*b, float*c, int stride) {
    for(int i=0; i<64; i++) {
        row_m64x64(&a[1024*i], b, &c[1024*i], stride);
    }
}

void row_m64x64(const float *a, const float *b, float *c, int stride) {
    __m256 tmp0, tmp1, tmp2, tmp3, tmp4, tmp5, tmp6, tmp7;
    tmp0 = _mm256_loadu_ps(&c[ 0]);
    tmp1 = _mm256_loadu_ps(&c[ 8]);
    tmp2 = _mm256_loadu_ps(&c[16]);
    tmp3 = _mm256_loadu_ps(&c[24]);
    tmp4 = _mm256_loadu_ps(&c[32]);
    tmp5 = _mm256_loadu_ps(&c[40]);
    tmp6 = _mm256_loadu_ps(&c[48]);
    tmp7 = _mm256_loadu_ps(&c[56]);

    for(int i=0; i<64; i++) {
        __m256 areg0 = _mm256_broadcast_ss(&a[i]);

        __m256 breg0 = _mm256_loadu_ps(&b[stride*i +  0]);
        tmp0 = _mm256_add_ps(_mm256_mul_ps(areg0,breg0), tmp0);    
        __m256 breg1 = _mm256_loadu_ps(&b[stride*i +  8]);
        tmp1 = _mm256_add_ps(_mm256_mul_ps(areg0,breg1), tmp1);
        __m256 breg2 = _mm256_loadu_ps(&b[stride*i + 16]);
        tmp2 = _mm256_add_ps(_mm256_mul_ps(areg0,breg2), tmp2);    
        __m256 breg3 = _mm256_loadu_ps(&b[stride*i + 24]);
        tmp3 = _mm256_add_ps(_mm256_mul_ps(areg0,breg3), tmp3);   
        __m256 breg4 = _mm256_loadu_ps(&b[stride*i + 32]);
        tmp4 = _mm256_add_ps(_mm256_mul_ps(areg0,breg4), tmp4);    
        __m256 breg5 = _mm256_loadu_ps(&b[stride*i + 40]);
        tmp5 = _mm256_add_ps(_mm256_mul_ps(areg0,breg5), tmp5);    
        __m256 breg6 = _mm256_loadu_ps(&b[stride*i + 48]);
        tmp6 = _mm256_add_ps(_mm256_mul_ps(areg0,breg6), tmp6);    
        __m256 breg7 = _mm256_loadu_ps(&b[stride*i + 56]);
        tmp7 = _mm256_add_ps(_mm256_mul_ps(areg0,breg7), tmp7);    
    }
    _mm256_storeu_ps(&c[ 0], tmp0);
    _mm256_storeu_ps(&c[ 8], tmp1);
    _mm256_storeu_ps(&c[16], tmp2);
    _mm256_storeu_ps(&c[24], tmp3);
    _mm256_storeu_ps(&c[32], tmp4);
    _mm256_storeu_ps(&c[40], tmp5);
    _mm256_storeu_ps(&c[48], tmp6);
    _mm256_storeu_ps(&c[56], tmp7);
}

重新排序b矩阵的代码.

The code to reorder the b matrix.

reorder(float *b, float *b2, int stride) {
    //int k = 0;
    for(int i=0; i<16; i++) {
        for(int j=0; j<16; j++) {
            for(int i2=0; i2<64; i2++) {
                for(int j2=0; j2<64; j2++) {
                    //b2[k++] = b[1024*(64*i+i2) + 64*j + j2];
                    b2[64*64*(i*16 + j) + 64*i2+j2] = b[1024*(64*i+i2) + 64*j + j2];
                }
            }
        }
    }
}

读取/写入步幅比其宽度大得多的矩阵会导致性能大幅下降 [英] reading/writing a matrix with a stride much larger than its width causes a big loss in performance

问题描述

推荐答案

相关文章

C/C++开发最新文章

热门教程

热门工具

登录关闭

读取/写入步幅比其宽度大得多的矩阵会导致性能大幅下降 [英] reading/writing a matrix with a stride much larger than its width causes a big loss in performance

问题描述

推荐答案

相关文章

C/C++开发最新文章

热门教程

热门工具

登录 关闭

登录关闭