多核计算机上单精度数组与双精度数组的矩阵乘法的性能下降 [英] Performance degradation of matrix multiplication of single vs double precision arrays on multi-core machine

查看：98 发布时间：2020/5/18 20:35:30 python c numpy openmp intel-mkl

本文介绍了多核计算机上单精度数组与双精度数组的矩阵乘法的性能下降的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

更新

不幸的是，由于我的疏忽，我将较旧版本的MKL(11.1)与numpy链接在一起.较新版本的MKL(11.3.1)在C中以及从python调用时均具有相同的性能.

Unfortunately, due to my oversight, I had an older version of MKL (11.1) linked against numpy. Newer version of MKL (11.3.1) gives same performance in C and when called from python.

令人费解的是，即使将编译后的共享库与较新的MKL显式链接，并通过LD_ *变量指向它们，然后在python中执行import numpy，还是使python调用了旧的MKL库.只有通过在python lib文件夹中将所有libmkl _ *.so替换为较新的MKL，我才能匹配python和C调用中的性能.

What was obscuring things, was even if linking the compiled shared libraries explicitly with the newer MKL, and pointing through LD_* variables to them, and then in python doing import numpy, was somehow making python call old MKL libraries. Only by replacing in python lib folder all libmkl_*.so with newer MKL I was able to match performance in python and C calls.

背景/图书馆信息.

矩阵乘法是通过numpy.dot函数通过sgemm(单精度)和dgemm(双精度)Intel的MKL库调用完成的.库函数的实际调用可以通过例如oprof.

Matrix multiplication was done via sgemm (single-precision) and dgemm (double-precision) Intel's MKL library calls, via numpy.dot function. The actual call of the library functions can be verified with e.g. oprof.

此处使用2x18核心CPU E5-2699 v3，因此共有36个物理核心. KMP_AFFINITY =散点图.在Linux上运行.

Using here 2x18 core CPU E5-2699 v3, hence a total of 36 physical cores. KMP_AFFINITY=scatter. Running on linux.

TL; DR

1)为什么numpy.dot即使调用了相同的MKL库函数，也比C编译代码慢了两倍?

1) Why is numpy.dot, even though it is calling the same MKL library functions, twice slower at best compared to C compiled code?

2)为什么通过numpy.dot会导致性能随着内核数量的增加而降低，而在C代码中却没有看到相同的效果(调用相同的库函数).

2) Why via numpy.dot you get performance decreasing with increasing number of cores, whereas the same effect is not observed in C code (calling the same library functions).

问题

我观察到，在numpy.dot中执行单精度/双精度浮点数的矩阵乘法，以及直接从已编译的C 共享库调用cblas_sgemm/dgemm相比调用而言，性能明显下降在纯C代码中也有相同的MKL cblas_sgemm/dgemm函数.

I've observed that doing matrix multiplication of single/double precision floats in numpy.dot, as well as calling cblas_sgemm/dgemm directly from a compiled C shared library give noticeably worse performance compared to calling same MKL cblas_sgemm/dgemm functions from inside pure C code.

import numpy as np
import mkl
n = 10000
A = np.random.randn(n,n).astype('float32')
B = np.random.randn(n,n).astype('float32')
C = np.zeros((n,n)).astype('float32')

mkl.set_num_threads(3); %time np.dot(A, B, out=C)
11.5 seconds
mkl.set_num_threads(6); %time np.dot(A, B, out=C)
6 seconds
mkl.set_num_threads(12); %time np.dot(A, B, out=C)
3 seconds
mkl.set_num_threads(18); %time np.dot(A, B, out=C)
2.4 seconds
mkl.set_num_threads(24); %time np.dot(A, B, out=C)
3.6 seconds
mkl.set_num_threads(30); %time np.dot(A, B, out=C)
5 seconds
mkl.set_num_threads(36); %time np.dot(A, B, out=C)
5.5 seconds

执行与上面完全相同的操作，但是使用双精度A，B和C，您将获得: 3核:20s，6核:10s，12核:5s，18核:4.3s，24核:3s，30核:2.8s，36核:2.8s.

Doing exactly the same as above, but with double precision A, B and C, you get: 3 cores: 20s, 6 cores: 10s, 12 cores: 5s, 18 cores: 4.3s, 24 cores: 3s, 30 cores: 2.8s, 36 cores: 2.8s.

单精度浮点速度的提高似乎与高速缓存未命中有关. 对于28核运行，这是perf的输出. 对于单精度:

The topping up of speed for single precision floating points seem to be associated with cache misses. For 28 core run, here is the output of perf. For single precision:

perf stat -e task-clock,cycles,instructions,cache-references,cache-misses ./ptestf.py
631,301,854 cache-misses # 31.478 % of all cache refs

双精度:

93,087,703 cache-misses # 5.164 % of all cache refs

C共享库，使用编译

/opt/intel/bin/icc -o comp_sgemm_mkl.so -openmp -mkl sgem_lib.c -lm -lirc -O3 -fPIC -shared -std=c99 -vec-report1 -xhost -I/opt/intel/composer/mkl/include

#include <stdio.h>
#include <stdlib.h>
#include "mkl.h"

void comp_sgemm_mkl(int m, int n, int k, float *A, float *B, float *C);

void comp_sgemm_mkl(int m, int n, int k, float *A, float *B, float *C)
{
    int i, j;
    float alpha, beta;
    alpha = 1.0; beta = 0.0;

    cblas_sgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans,
                m, n, k, alpha, A, k, B, n, beta, C, n);
}

Python包装函数，调用上面的编译库:

Python wrapper function, calling the above compiled library:

def comp_sgemm_mkl(A, B, out=None):
    lib = CDLL(omplib)
    lib.cblas_sgemm_mkl.argtypes = [c_int, c_int, c_int, 
                                 np.ctypeslib.ndpointer(dtype=np.float32, ndim=2), 
                                 np.ctypeslib.ndpointer(dtype=np.float32, ndim=2),
                                 np.ctypeslib.ndpointer(dtype=np.float32, ndim=2)]
    lib.comp_sgemm_mkl.restype = c_void_p
    m = A.shape[0]
    n = B.shape[0]
    k = B.shape[1]
    if np.isfortran(A):
        raise ValueError('Fortran array')
    if m != n:
        raise ValueError('Wrong matrix dimensions')
    if out is None:
        out = np.empty((m,k), np.float32)
    lib.comp_sgemm_mkl(m, n, k, A, B, out)

但是，来自C编译二进制调用MKL的cblas_sgemm/cblas_dgemm的显式调用(通过C中通过malloc分配的数组)与python代码(即numpy.dot调用)相比，性能提高了近2倍.此外，也没有观察到随着内核数量的增加而导致的性能下降的影响. 单精度矩阵乘法的最佳性能为900毫秒，当通过mkl_set_num_cores使用全部36个物理内核并使用numactl --interleave = all运行C代码时，即可达到最佳性能.

However, explicit calls from a C-compiled binary calling MKL's cblas_sgemm / cblas_dgemm, with arrays allocated through malloc in C, gives almost 2x better performance compared to the python code, i.e. the numpy.dot call. Also, the effect of performance degradation with increasing number of cores is NOT observed. The best performance was 900 ms for single-precision matrix multiplication and was achieved when using all 36 physical cores via mkl_set_num_cores and running the C code with numactl --interleave=all.

也许有任何花哨的工具或建议可以进一步剖析/检查/了解这种情况?任何阅读材料也将不胜感激.

Perhaps any fancy tools or advice for profiling/inspecting/understanding this situation further? Any reading material is much appreciated as well.

更新遵循@Hristo Iliev的建议，运行numactl --interleave = all ./ipython不会更改计时(在噪声范围内)，但改善了纯C二进制运行时.

UPDATE Following @Hristo Iliev advice, running numactl --interleave=all ./ipython did not change the timings (within noise), but improves the pure C binary runtimes.

多核计算机上单精度数组与双精度数组的矩阵乘法的性能下降 [英] Performance degradation of matrix multiplication of single vs double precision arrays on multi-core machine

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

多核计算机上单精度数组与双精度数组的矩阵乘法的性能下降 [英] Performance degradation of matrix multiplication of single vs double precision arrays on multi-core machine

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭