通过线程和SIMD并行矩阵乘法 [英] parallelizing matrix multiplication through threading and SIMD

查看:227
本文介绍了通过线程和SIMD并行矩阵乘法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图加快多核架构的矩阵乘法。为此目的,我试图在同一时间使用线程和SIMD。但我的成绩并不好。我测试过连续的矩阵乘法加快:

I am trying to speed up matrix multiplication on multicore architecture. For this end, I try to use threads and SIMD at the same time. But my results are not good. I test speed up over sequential matrix multiplication:

void sequentialMatMul(void* params)
{
    cout << "SequentialMatMul started.";
    int i, j, k;
    for (i = 0; i < N; i++)
    {
        for (k = 0; k < N; k++)
        {
            for (j = 0; j < N; j++)
            {
                X[i][j] += A[i][k] * B[k][j];
            }
        }
    }
    cout << "\nSequentialMatMul finished.";
}

我尝试添加线程和SIMD以矩阵乘法如下:

I tried to add threading and SIMD to matrix multiplication as follows:

void threadedSIMDMatMul(void* params)
{
    bounds *args = (bounds*)params;
    int lowerBound = args->lowerBound;
    int upperBound = args->upperBound;
    int idx = args->idx;

    int i, j, k;
    for (i = lowerBound; i <upperBound; i++)
    {
        for (k = 0; k < N; k++)
        {
            for (j = 0; j < N; j+=4)
            {
                mmx1 = _mm_loadu_ps(&X[i][j]);
                mmx2 = _mm_load_ps1(&A[i][k]);
                mmx3 = _mm_loadu_ps(&B[k][j]);
                mmx4 = _mm_mul_ps(mmx2, mmx3);
                mmx0 = _mm_add_ps(mmx1, mmx4);
                _mm_storeu_ps(&X[i][j], mmx0);
            }
        }
    }
    _endthread();
}

和以下部分用于计算每个线程的下界和上界:

And the following section is used for calculating lowerbound and upperbound of each thread:

bounds arg[CORES];
for (int part = 0; part < CORES; part++)
{
    arg[part].idx = part;
    arg[part].lowerBound = (N / CORES)*part;
    arg[part].upperBound = (N / CORES)*(part + 1);
}

最后线程SIMD版本被称为是这样的:

And finally threaded SIMD version is called like this:

HANDLE  handle[CORES];      
for (int part = 0; part < CORES; part++)
{
    handle[part] = (HANDLE)_beginthread(threadedSIMDMatMul, 0, (void*)&arg[part]);
}
for (int part = 0; part < CORES; part++)
{
WaitForSingleObject(handle[part], INFINITE);
}

的结果如下:
测试1:

The result is as follows: Test 1:

// arrays are defined as follow
float A[N][N];
float B[N][N];
float X[N][N];
N=2048
Core=1//just one thread

连续时间:11129ms

Sequential time: 11129ms

螺纹SIMD MATMUL时间:14650ms

Threaded SIMD matmul time: 14650ms

加快= 0.​​75倍。

Speed up=0.75x

测试2:

//defined arrays as follow
float **A = (float**)_aligned_malloc(N* sizeof(float), 16);
float **B = (float**)_aligned_malloc(N* sizeof(float), 16);
float **X = (float**)_aligned_malloc(N* sizeof(float), 16);
for (int k = 0; k < N; k++)
{
    A[k] = (float*)malloc(cols * sizeof(float));
    B[k] = (float*)malloc(cols * sizeof(float));
    X[k] = (float*)malloc(cols * sizeof(float));
}
N=2048
Core=1//just one thread

连续时间:15907ms

Sequential time: 15907ms

螺纹SIMD MATMUL时间:18578ms

Threaded SIMD matmul time: 18578ms

加快= 0.​​85x

Speed up=0.85x

测试3:

//defined arrays as follow
float A[N][N];
float B[N][N];
float X[N][N];
N=2048
Core=2

连续时间:10855ms

Sequential time: 10855ms

螺纹SIMD MATMUL时间:27967ms

Threaded SIMD matmul time: 27967ms

加快= 0.​​38x

Speed up=0.38x

测试4:

//defined arrays as follow
float **A = (float**)_aligned_malloc(N* sizeof(float), 16);
float **B = (float**)_aligned_malloc(N* sizeof(float), 16);
float **X = (float**)_aligned_malloc(N* sizeof(float), 16);
for (int k = 0; k < N; k++)
{
    A[k] = (float*)malloc(cols * sizeof(float));
    B[k] = (float*)malloc(cols * sizeof(float));
    X[k] = (float*)malloc(cols * sizeof(float));
}
N=2048
Core=2

连续时间:16579ms

Sequential time: 16579ms

螺纹SIMD MATMUL时间:30160ms

Threaded SIMD matmul time: 30160ms

加快= 0.​​51x

Speed up=0.51x

我的问题:为什么我没有得到加快

推荐答案

下面是我得到建立在你的算法的时间对我的四个核心酷睿i7处理器IVB

Here are the times I get building on your algorithm on my four core i7 IVB processor.

sequential:         3.42 s
4 threads:          0.97 s
4 threads + SSE:    0.86 s

下面是一个2核心P9600时代@ 2.53 GHz的是类似于OP的E2200 @ 2.2GHz的

Here are the times on a 2 core P9600 @2.53 GHz which is similar to the OP's E2200 @2.2 GHz

sequential: time    6.52 s
2 threads: time     3.66 s
2 threads + SSE:    3.75 s

我用的OpenMP,因为它使这个容易。在OpenMP的每个线程有效地运行在

I used OpenMP because it makes this easy. Each thread in OpenMP runs over effectively

lowerBound = N*part/CORES;
upperBound = N*(part + 1)/CORES;

(注意,这是不是你的定义略有不同。你的定义可能由于四舍五入为 n某些值,因为你通过划分<$ C $给出错误的结果C>磁芯第一)。

(note that that is slightly different than your definition. Your definition can give the wrong result due to rounding for some values of N since you divide by CORES first.)

至于SIMD版本。 这是快不了多少可能是由于它被绑定的内存带宽。它可能不是真正要快,因为GCC已经vectroizes循环。

As to the SIMD version. It's not much faster probably due it being memory bandwidth bound . It's probably not really faster because GCC already vectroizes the loop.

最优化的解决方案是复杂得多。您需要使用循环平铺并重新排序的瓷砖中的元素,以获得最佳性能。我没有时间做,今天。

The most optimal solution is much more complicated. You need to use loop tiling and reorder the elements within tiles to get the optimal performance. I don't have time to do that today.

下面是code我用:

//c99 -O3 -fopenmp -Wall foo.c
#include <stdio.h>
#include <string.h>
#include <x86intrin.h>
#include <omp.h>

void gemm(float * restrict a, float * restrict b, float * restrict c, int n) {
    for(int i=0; i<n; i++) {
        for(int k=0; k<n; k++) {
            for(int j=0; j<n; j++) {
                c[i*n+j] += a[i*n+k]*b[k*n+j];
            }
        }
    }
}

void gemm_tlp(float * restrict a, float * restrict b, float * restrict c, int n) {
    #pragma omp parallel for
    for(int i=0; i<n; i++) {
        for(int k=0; k<n; k++) {
            for(int j=0; j<n; j++) {
                c[i*n+j] += a[i*n+k]*b[k*n+j];
            }
        }
    }
}   

void gemm_tlp_simd(float * restrict a, float * restrict b, float * restrict c, int n) {
    #pragma omp parallel for
    for(int i=0; i<n; i++) {
        for(int k=0; k<n; k++) {
            __m128 a4 = _mm_set1_ps(a[i*n+k]);
            for(int j=0; j<n; j+=4) {
                __m128 c4 = _mm_load_ps(&c[i*n+j]);
                __m128 b4 = _mm_load_ps(&b[k*n+j]);
                c4 = _mm_add_ps(_mm_mul_ps(a4,b4),c4);
                _mm_store_ps(&c[i*n+j], c4);
            }
        }
    }
}

int main(void) {
    int n = 2048;
    float *a = _mm_malloc(n*n * sizeof *a, 64);
    float *b = _mm_malloc(n*n * sizeof *b, 64);
    float *c1 = _mm_malloc(n*n * sizeof *c1, 64);
    float *c2 = _mm_malloc(n*n * sizeof *c2, 64);
    float *c3 = _mm_malloc(n*n * sizeof *c2, 64);
    for(int i=0; i<n*n; i++) a[i] = 1.0*i;
    for(int i=0; i<n*n; i++) b[i] = 1.0*i;
    memset(c1, 0, n*n * sizeof *c1);
    memset(c2, 0, n*n * sizeof *c2);
    memset(c3, 0, n*n * sizeof *c3);
    double dtime;

    dtime = -omp_get_wtime();
    gemm(a,b,c1,n);
    dtime += omp_get_wtime();
    printf("time %f\n", dtime);

    dtime = -omp_get_wtime();
    gemm_tlp(a,b,c2,n);
    dtime += omp_get_wtime();
    printf("time %f\n", dtime);

    dtime = -omp_get_wtime();
    gemm_tlp_simd(a,b,c3,n);
    dtime += omp_get_wtime();
    printf("time %f\n", dtime);
    printf("error %d\n", memcmp(c1,c2, n*n*sizeof *c1));
    printf("error %d\n", memcmp(c1,c3, n*n*sizeof *c1));
}

这篇关于通过线程和SIMD并行矩阵乘法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆