MKL是否针对*主要订单优化cblas? [英] Does MKL optimize cblas for *major order?

查看:88
本文介绍了MKL是否针对*主要订单优化cblas?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用 mkl cblas_dgemm ,目前将其与 CblasRowMajor CblasNoTrans CblasNotrans ,用于我的矩阵.

I am using mkl cblas_dgemm and currently have it with CblasRowMajor, CblasNoTrans, CblasNotrans, for my matrices.

我知道 c 是行主要语言,而 dgemm 是列主要算法.我有兴趣知道,如果我要链接到 mkl ,切换矩阵的顺序是否会对 cblas_dgemm 算法产生任何影响. mkl 是否足够聪明,可以在幕后做一些事情来优化矩阵乘法呢?如果不是,用 mkl 执行矩阵乘法的最佳方法是什么?

I know that c is a row major language, whereas dgemm is a column major algorithm. I am interested to know if switching the ordering of the matrices will have any affect on the cblas_dgemm algorithm if I am linking against mkl. Is mkl smart enough to do things behind the scenes that I would try to do to optimized the matrix multiplcations? If not, what is the best way to perform matrix multiplications with mkl?

推荐答案

TL; DR:简而言之,是否使用 row-major

TL;DR: In short it does not matter whether you perform matrix-matrix multiplications using row-major or column-major ordering with MKL (and other BLAS implementations).

执行矩阵矩阵乘法都没关系使用MKL(以及其他BLAS实现)的em>或大列排序.

I know that c is a row major language, whereas dgemm is a column major algorithm.

我知道c是行主要语言,而dgemm是列主要算法.

DGEMM is not a column-major algorithm, it is the BLAS interface for computing a matrix-matrix product with general matrices. The common reference implementation for DGEMM (and most of BLAS) is Netlib's which is written in Fortran. The only reason it assumes column-major ordering is because Fortran is a column-major order language. DGEMM (and the corresponding BLAS Level 3 functions) is not specifically for column-major data.

DGEMM是不是列主算法,它是用于计算具有通用矩阵的矩阵产品的BLAS接口.DGEMM(以及大多数BLAS)的常见参考实现是算法用于将2D矩阵相乘需要您沿其行遍历一个矩阵,并沿其列遍历另一个矩阵.要执行矩阵矩阵乘法, AB = C ,我们可以将 A 的行乘以 B 的列生成 C .因此,输入矩阵的顺序无关紧要,因为一个矩阵必须沿其行遍历,而另一个矩阵必须沿其列遍历.

DGEMM in basic maths performs a 2D algorithm for multiplying 2D matrices requires you to traverse one matrix along its rows and the other along its columns. To perform the matrix-matrix multiplication, AB = C, we would multiply the rows of A by the columns of B to produce C. Therefore, the ordering of the input matrices does not matter as one matrix must be traversed along its rows and the other along its columns.

英特尔MKL足够聪明,可以在后台使用它,并且为行主要列主要数据提供完全相同的性能.

Intel MKL is smart enough to utilise this under the hood and provides the exact same performance for row-major and column-major data.

cblas_dgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans, ...);

cblas_dgemm(CblasColMajor, CblasNoTrans, CblasNoTrans, ...);

将以类似的性能执行.我们可以用一个相对简单的程序对此进行测试

will execute with similar performance. We can test this with a relatively simple program

#include <float.h>
#include <mkl.h>
#include <omp.h>
#include <stdio.h>

void init_matrix(double *A, int n, int m, double d);
void test_dgemm(CBLAS_LAYOUT Layout, double *A, double *B, double *C, const MKL_INT m, const MKL_INT n, const MKL_INT k, int nSamples, double *timing);
void print_summary(const MKL_INT m, const MKL_INT n, const MKL_INT k, const int nSamples, const double *timing);

int main(int argc, char **argv) {
    MKL_INT n, k, m;
    double *a, *b, *c;
    double *timing;
    int nSamples = 1;

    if (argc != 5){
        fprintf(stderr, "Error: Wrong number of arguments!\n");
        fprintf(stderr, "usage: %s mMatrix nMatrix kMatrix NSamples\n", argv[0]);
        return -1;
    }

    m = atoi(argv[1]);
    n = atoi(argv[2]);
    k = atoi(argv[3]);

    nSamples = atoi(argv[4]);

    timing = malloc(nSamples * sizeof *timing);

    a = mkl_malloc(m*k * sizeof *a, 64);
    b = mkl_malloc(k*n * sizeof *a, 64);
    c = mkl_calloc(m*n, sizeof *a, 64);

    /** ROW-MAJOR ORDERING **/
    test_dgemm(CblasRowMajor, a, b, c, m, n, k, nSamples, timing);

    /** COLUMN-MAJOR ORDERING **/
    test_dgemm(CblasColMajor, a, b, c, m, n, k, nSamples, timing);

    mkl_free(a);
    mkl_free(b);
    mkl_free(c);
    free(timing);
}

void init_matrix(double *A, int n, int m, double d) {
    int i, j;
    #pragma omp for schedule (static) private(i,j)
    for (i = 0; i < n; ++i) {
        for (j = 0; j < m; ++j) {
            A[j + i*n] = d * (double) ((i - j) / n);
        }
    }
}

void test_dgemm(CBLAS_LAYOUT Layout, double *A, double *B, double *C, const MKL_INT m, const MKL_INT n, const MKL_INT k, int nSamples, double *timing) {
    int i;
    MKL_INT lda = m, ldb = k, ldc = m;
    double alpha = 1.0, beta = 0.0;

    if (CblasRowMajor == Layout) {
        printf("\n*****ROW-MAJOR ORDERING*****\n\n");
    } else if (CblasColMajor == Layout) {
        printf("\n*****COLUMN-MAJOR ORDERING*****\n\n");
    }

    init_matrix(A, m, k, 0.5);
    init_matrix(B, k, n, 0.75);
    init_matrix(C, m, n, 0);

    // First call performs any buffer/thread initialisation
    cblas_dgemm(Layout, CblasNoTrans, CblasNoTrans, m, n, k, alpha, A, lda, B, ldb, beta, C, ldc);

    double tmin = DBL_MAX, tmax = 0.0;
    for (i = 0; i < nSamples; ++i) {
        init_matrix(A, m, k, 0.5);
        init_matrix(B, k, n, 0.75);
        init_matrix(C, m, n, 0);

        timing[i] = dsecnd();
        cblas_dgemm(Layout, CblasNoTrans, CblasNoTrans, m, n, k, alpha, A, lda, B, ldb, beta, C, ldc);
        timing[i] = dsecnd() - timing[i];

        if (tmin > timing[i]) tmin = timing[i];
        else if (tmax < timing[i]) tmax = timing[i];
    }

    print_summary(m, n, k, nSamples, timing);
}

void print_summary(const MKL_INT m, const MKL_INT n, const MKL_INT k, const int nSamples, const double *timing) {
    int i;

    double tavg = 0.0;
    for(i = 0; i < nSamples; i++) {
        tavg += timing[i];
    }
    tavg /= nSamples;

    printf("#Loop | Sizes  m   n   k  | Time (s)\n");
    for(i = 0; i < nSamples; i++) {
        printf("%4d %12d %3d %3d  %6.4f\n", i + 1 , m, n, k, timing[i]);
    }

    printf("Summary:\n");
    printf("Sizes  m   n   k  | Avg. Time (s)\n");
    printf(" %8d %3d %3d %12.8f\n", m, n, k, tavg);
}

在我的系统上会产生

$ ./benchmark_dgemm 1000 1000 1000 5
*****ROW-MAJOR ORDERING*****

#Loop | Sizes  m   n   k  | Time (s)
   1         1000 1000 1000  0.0589
   2         1000 1000 1000  0.0596
   3         1000 1000 1000  0.0603
   4         1000 1000 1000  0.0626
   5         1000 1000 1000  0.0584
Summary:
Sizes  m   n   k  | Avg. Time (s)
     1000 1000 1000   0.05995692

*****COLUMN-MAJOR ORDERING*****

#Loop | Sizes  m   n   k  | Time (s)
   1         1000 1000 1000  0.0597
   2         1000 1000 1000  0.0610
   3         1000 1000 1000  0.0581
   4         1000 1000 1000  0.0594
   5         1000 1000 1000  0.0596
Summary:
Sizes  m   n   k  | Avg. Time (s)
     1000 1000 1000   0.05955171

在这里我们可以看到 column-major 排序时间和 row-major 排序时间之间的差异很小.主要列 0.0595秒,而主要列 0.0599 秒.再次执行此操作可能会产生以下结果,其中主要行计算的速度快了 0.00003 秒.

where we can see that there is very little difference between the column-major ordering time and the row-major ordering time. 0.0595 seconds for column-major versus 0.0599 seconds for row-major. Executing this again might produce the following, where the row-major calculation is faster by 0.00003 seconds.

$ ./benchmark_dgemm 1000 1000 1000 5
*****ROW-MAJOR ORDERING*****

#Loop | Sizes  m   n   k  | Time (s)
   1         1000 1000 1000  0.0674
   2         1000 1000 1000  0.0598
   3         1000 1000 1000  0.0595
   4         1000 1000 1000  0.0587
   5         1000 1000 1000  0.0584
Summary:
Sizes  m   n   k  | Avg. Time (s)
     1000 1000 1000   0.06075310

*****COLUMN-MAJOR ORDERING*****

#Loop | Sizes  m   n   k  | Time (s)
   1         1000 1000 1000  0.0634
   2         1000 1000 1000  0.0596
   3         1000 1000 1000  0.0582
   4         1000 1000 1000  0.0582
   5         1000 1000 1000  0.0645
Summary:
Sizes  m   n   k  | Avg. Time (s)
     1000 1000 1000   0.06078266

这篇关于MKL是否针对*主要订单优化cblas?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆