读取/写入步幅比其宽度大得多的矩阵会导致性能大幅下降 [英] reading/writing a matrix with a stride much larger than its width causes a big loss in performance

查看:129
本文介绍了读取/写入步幅比其宽度大得多的矩阵会导致性能大幅下降的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在对1024x1024矩阵进行密集矩阵乘法.我使用64x64磁贴使用循环阻止/平铺进行此操作.我创建了一个高度优化的64x64矩阵乘法函数(有关代码,请参见问题末尾).

I'm doing dense matrix multiplication on 1024x1024 matrices. I do this using loop blocking/tiling using 64x64 tiles. I have created a highly optimized 64x64 matrix multiplication function (see the end of my question for the code).

gemm64(float *a, float *b, float *c, int stride).  

这是在磁贴上运行的代码.具有16x16瓦片的1024x1204矩阵.

Here is the code which runs over the tiles. A 1024x1204 matrix which has 16x16 tiles.

for(int i=0; i<16; i++) {
    for(int j=0; j<16; j++) {
        for(int k=0; k<16; k++) {
            gemm64(&a[64*(i*1024 + k)], &b[64*(k*1024 + j)], &c[64*(i*1024 + j)], 1024);
        }
    }
}

但是,作为猜测,我尝试在新矩阵b2中重新排列b矩阵的所有图块的内存(有关代码,请参见此问题的结尾),以便每个图块的跨度为64,而不是1024.这有效地组成了一个数组,由stride = 64组成的64x64矩阵.

However, as a guess, I tried rearranging the memory of all the tiles (see the end of this question for the code) for the b matrix in a new matrix b2 so that the stride of each tile is 64 instead of 1024. This effectively makes an array of 64x64 matrices with stride=64.

float *b2 = (float*)_mm_malloc(1024*1024*sizeof(float), 64);
reorder(b, b2, 1024);
for(int i=0; i<16; i++) {
    for(int j=0; j<16; j++) {
        for(int k=0; k<16; k++) {
            gemm64_v2(&a[64*(i*1024 + k)], &b2[64*64*(k*16 + j)], &c[64*(i*1024 + j)], 64);
        }
    }
}
_mm_free(b2);

请注意,b的偏移量如何从&b[64*(k*1024 + j)]更改为&b2[64*64*(k*16 + j)],并且传递给gemm64的步幅已从1024更改为64.

Notice how the offset for b changed from &b[64*(k*1024 + j)] to &b2[64*64*(k*16 + j)] and that the stride passed to gemm64 has changed from 1024 to 64.

我的代码的性能从Sandy Bridge系统上的峰值触发器的不到20%跃升到70%!

为什么以这种方式重新排列b矩阵中的图块会产生如此大的变化?

数组a,b,b2和c是64字节对齐的.

The arrays a,b, b2, and c are 64 byte aligned.

extern "C" void gemm64(float *a, float*b, float*c, int stride) {
    for(int i=0; i<64; i++) {
        row_m64x64(&a[1024*i], b, &c[1024*i], stride);
    }
}

void row_m64x64(const float *a, const float *b, float *c, int stride) {
    __m256 tmp0, tmp1, tmp2, tmp3, tmp4, tmp5, tmp6, tmp7;
    tmp0 = _mm256_loadu_ps(&c[ 0]);
    tmp1 = _mm256_loadu_ps(&c[ 8]);
    tmp2 = _mm256_loadu_ps(&c[16]);
    tmp3 = _mm256_loadu_ps(&c[24]);
    tmp4 = _mm256_loadu_ps(&c[32]);
    tmp5 = _mm256_loadu_ps(&c[40]);
    tmp6 = _mm256_loadu_ps(&c[48]);
    tmp7 = _mm256_loadu_ps(&c[56]);

    for(int i=0; i<64; i++) {
        __m256 areg0 = _mm256_broadcast_ss(&a[i]);

        __m256 breg0 = _mm256_loadu_ps(&b[stride*i +  0]);
        tmp0 = _mm256_add_ps(_mm256_mul_ps(areg0,breg0), tmp0);    
        __m256 breg1 = _mm256_loadu_ps(&b[stride*i +  8]);
        tmp1 = _mm256_add_ps(_mm256_mul_ps(areg0,breg1), tmp1);
        __m256 breg2 = _mm256_loadu_ps(&b[stride*i + 16]);
        tmp2 = _mm256_add_ps(_mm256_mul_ps(areg0,breg2), tmp2);    
        __m256 breg3 = _mm256_loadu_ps(&b[stride*i + 24]);
        tmp3 = _mm256_add_ps(_mm256_mul_ps(areg0,breg3), tmp3);   
        __m256 breg4 = _mm256_loadu_ps(&b[stride*i + 32]);
        tmp4 = _mm256_add_ps(_mm256_mul_ps(areg0,breg4), tmp4);    
        __m256 breg5 = _mm256_loadu_ps(&b[stride*i + 40]);
        tmp5 = _mm256_add_ps(_mm256_mul_ps(areg0,breg5), tmp5);    
        __m256 breg6 = _mm256_loadu_ps(&b[stride*i + 48]);
        tmp6 = _mm256_add_ps(_mm256_mul_ps(areg0,breg6), tmp6);    
        __m256 breg7 = _mm256_loadu_ps(&b[stride*i + 56]);
        tmp7 = _mm256_add_ps(_mm256_mul_ps(areg0,breg7), tmp7);    
    }
    _mm256_storeu_ps(&c[ 0], tmp0);
    _mm256_storeu_ps(&c[ 8], tmp1);
    _mm256_storeu_ps(&c[16], tmp2);
    _mm256_storeu_ps(&c[24], tmp3);
    _mm256_storeu_ps(&c[32], tmp4);
    _mm256_storeu_ps(&c[40], tmp5);
    _mm256_storeu_ps(&c[48], tmp6);
    _mm256_storeu_ps(&c[56], tmp7);
}

重新排序b矩阵的代码.

The code to reorder the b matrix.

reorder(float *b, float *b2, int stride) {
    //int k = 0;
    for(int i=0; i<16; i++) {
        for(int j=0; j<16; j++) {
            for(int i2=0; i2<64; i2++) {
                for(int j2=0; j2<64; j2++) {
                    //b2[k++] = b[1024*(64*i+i2) + 64*j + j2];
                    b2[64*64*(i*16 + j) + 64*i2+j2] = b[1024*(64*i+i2) + 64*j + j2];
                }
            }
        }
    }
}

推荐答案

我认为问题出在最内层的循环中,与预取有关.您正在执行以下操作:

I think the problem lies in the innermost loop, and has to do with prefetching. You are doing the following:

  1. 从c加载256个连续字节(4个缓存行)
  2. 从b 64次加载256个连续字节(总共256个缓存行)
  3. 将256个连续字节保存到c

在步骤2中,当跨度为64时,您实际上是在读取一个连续的长内存块.当跨度为1024时,您将不连续地读取内存,一次读取256个字节.

In step 2, when stride is 64, you are effectively reading a long consecutive block of memory. When stride is 1024, you are reading memory in discontinuous steps, 256 bytes at a time.

结果是,当跨度为64时,预取器会提前将连续的块读入高速缓存,并且每行最多只能有一个高速缓存未命中.当跨度为1024时,预取器会因不连续的读取而感到困惑(您在读取连续的缓存行和分隔开的缓存行之间交替),每行有64个缓存未命中.

As a result, when stride is 64, the prefetcher reads successive blocks into the cache ahead of time, and you get at most one cache miss per row. When stride is 1024, the prefetcher is confused by the discontinuous reads (you alternate between reading successive cache lines and widely separated cache lines) and there are 64 cache misses per row.

这篇关于读取/写入步幅比其宽度大得多的矩阵会导致性能大幅下降的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆