多核计算机上单精度数组与双精度数组的矩阵乘法的性能下降 [英] Performance degradation of matrix multiplication of single vs double precision arrays on multi-core machine

查看:98
本文介绍了多核计算机上单精度数组与双精度数组的矩阵乘法的性能下降的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

更新

不幸的是,由于我的疏忽,我将较旧版本的MKL(11.1)与numpy链接在一起.较新版本的MKL(11.3.1)在C中以及从python调用时均具有相同的性能.

Unfortunately, due to my oversight, I had an older version of MKL (11.1) linked against numpy. Newer version of MKL (11.3.1) gives same performance in C and when called from python.

令人费解的是,即使将编译后的共享库与较新的MKL显式链接,并通过LD_ *变量指向它们,然后在python中执行import numpy,还是使python调用了旧的MKL库.只有通过在python lib文件夹中将所有libmkl _ *.so替换为较新的MKL,我才能匹配python和C调用中的性能.

What was obscuring things, was even if linking the compiled shared libraries explicitly with the newer MKL, and pointing through LD_* variables to them, and then in python doing import numpy, was somehow making python call old MKL libraries. Only by replacing in python lib folder all libmkl_*.so with newer MKL I was able to match performance in python and C calls.

背景/图书馆信息.

矩阵乘法是通过numpy.dot函数通过sgemm(单精度)和dgemm(双精度)Intel的MKL库调用完成的.库函数的实际调用可以通过例如oprof.

Matrix multiplication was done via sgemm (single-precision) and dgemm (double-precision) Intel's MKL library calls, via numpy.dot function. The actual call of the library functions can be verified with e.g. oprof.

此处使用2x18核心CPU E5-2699 v3,因此共有36个物理核心. KMP_AFFINITY =散点图.在Linux上运行.

Using here 2x18 core CPU E5-2699 v3, hence a total of 36 physical cores. KMP_AFFINITY=scatter. Running on linux.

TL; DR

1)为什么numpy.dot即使调用了相同的MKL库函数,也比C编译代码慢了两倍?

1) Why is numpy.dot, even though it is calling the same MKL library functions, twice slower at best compared to C compiled code?

2)为什么通过numpy.dot会导致性能随着内核数量的增加而降低,而在C代码中却没有看到相同的效果(调用相同的库函数).

2) Why via numpy.dot you get performance decreasing with increasing number of cores, whereas the same effect is not observed in C code (calling the same library functions).

问题

我观察到,在numpy.dot中执行单精度/双精度浮点数的矩阵乘法,以及直接从已编译的C 共享库调用cblas_sgemm/dgemm相比调用而言,性能明显下降在纯C代码中也有相同的MKL cblas_sgemm/dgemm函数.

I've observed that doing matrix multiplication of single/double precision floats in numpy.dot, as well as calling cblas_sgemm/dgemm directly from a compiled C shared library give noticeably worse performance compared to calling same MKL cblas_sgemm/dgemm functions from inside pure C code.

import numpy as np
import mkl
n = 10000
A = np.random.randn(n,n).astype('float32')
B = np.random.randn(n,n).astype('float32')
C = np.zeros((n,n)).astype('float32')

mkl.set_num_threads(3); %time np.dot(A, B, out=C)
11.5 seconds
mkl.set_num_threads(6); %time np.dot(A, B, out=C)
6 seconds
mkl.set_num_threads(12); %time np.dot(A, B, out=C)
3 seconds
mkl.set_num_threads(18); %time np.dot(A, B, out=C)
2.4 seconds
mkl.set_num_threads(24); %time np.dot(A, B, out=C)
3.6 seconds
mkl.set_num_threads(30); %time np.dot(A, B, out=C)
5 seconds
mkl.set_num_threads(36); %time np.dot(A, B, out=C)
5.5 seconds

执行与上面完全相同的操作,但是使用双精度A,B和C,您将获得: 3核:20s,6核:10s,12核:5s,18核:4.3s,24核:3s,30核:2.8s,36核:2.8s.

Doing exactly the same as above, but with double precision A, B and C, you get: 3 cores: 20s, 6 cores: 10s, 12 cores: 5s, 18 cores: 4.3s, 24 cores: 3s, 30 cores: 2.8s, 36 cores: 2.8s.

单精度浮点速度的提高似乎与高速缓存未命中有关. 对于28核运行,这是perf的输出. 对于单精度:

The topping up of speed for single precision floating points seem to be associated with cache misses. For 28 core run, here is the output of perf. For single precision:

perf stat -e task-clock,cycles,instructions,cache-references,cache-misses ./ptestf.py
631,301,854 cache-misses # 31.478 % of all cache refs

双精度:

93,087,703 cache-misses # 5.164 % of all cache refs

C共享库,使用编译

/opt/intel/bin/icc -o comp_sgemm_mkl.so -openmp -mkl sgem_lib.c -lm -lirc -O3 -fPIC -shared -std=c99 -vec-report1 -xhost -I/opt/intel/composer/mkl/include

#include <stdio.h>
#include <stdlib.h>
#include "mkl.h"

void comp_sgemm_mkl(int m, int n, int k, float *A, float *B, float *C);

void comp_sgemm_mkl(int m, int n, int k, float *A, float *B, float *C)
{
    int i, j;
    float alpha, beta;
    alpha = 1.0; beta = 0.0;

    cblas_sgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans,
                m, n, k, alpha, A, k, B, n, beta, C, n);
}

Python包装函数,调用上面的编译库:

Python wrapper function, calling the above compiled library:

def comp_sgemm_mkl(A, B, out=None):
    lib = CDLL(omplib)
    lib.cblas_sgemm_mkl.argtypes = [c_int, c_int, c_int, 
                                 np.ctypeslib.ndpointer(dtype=np.float32, ndim=2), 
                                 np.ctypeslib.ndpointer(dtype=np.float32, ndim=2),
                                 np.ctypeslib.ndpointer(dtype=np.float32, ndim=2)]
    lib.comp_sgemm_mkl.restype = c_void_p
    m = A.shape[0]
    n = B.shape[0]
    k = B.shape[1]
    if np.isfortran(A):
        raise ValueError('Fortran array')
    if m != n:
        raise ValueError('Wrong matrix dimensions')
    if out is None:
        out = np.empty((m,k), np.float32)
    lib.comp_sgemm_mkl(m, n, k, A, B, out)

但是,来自C编译二进制调用MKL的cblas_sgemm/cblas_dgemm的显式调用(通过C中通过malloc分配的数组)与python代码(即numpy.dot调用)相比,性能提高了近2倍.此外,也没有观察到随着内核数量的增加而导致的性能下降的影响. 单精度矩阵乘法的最佳性能为900毫秒,当通过mkl_set_num_cores使用全部36个物理内核并使用numactl --interleave = all运行C代码时,即可达到最佳性能.

However, explicit calls from a C-compiled binary calling MKL's cblas_sgemm / cblas_dgemm, with arrays allocated through malloc in C, gives almost 2x better performance compared to the python code, i.e. the numpy.dot call. Also, the effect of performance degradation with increasing number of cores is NOT observed. The best performance was 900 ms for single-precision matrix multiplication and was achieved when using all 36 physical cores via mkl_set_num_cores and running the C code with numactl --interleave=all.

也许有任何花哨的工具或建议可以进一步剖析/检查/了解这种情况?任何阅读材料也将不胜感激.

Perhaps any fancy tools or advice for profiling/inspecting/understanding this situation further? Any reading material is much appreciated as well.

更新 遵循@Hristo Iliev的建议,运行numactl --interleave = all ./ipython不会更改计时(在噪声范围内),但改善了纯C二进制运行时.

UPDATE Following @Hristo Iliev advice, running numactl --interleave=all ./ipython did not change the timings (within noise), but improves the pure C binary runtimes.

推荐答案

我怀疑这是由于不幸的线程调度所致.我能够重现与您相似的效果. Python的运行时间约为2.2 s,而C版本的显示时间为1.4-2.2 s.

I suspect this is due to unfortunate thread scheduling. I was able to reproduce an effect similar to yours. Python was running at ~2.2 s, while the C version was showing huge variations from 1.4-2.2 s.

正在申请: KMP_AFFINITY=scatter,granularity=thread 这样可以确保28个线程始终在同一处理器线程上运行.

Applying: KMP_AFFINITY=scatter,granularity=thread This ensures that the 28 threads are always running on the same processor thread.

将两个运行时的C语言减少至〜1.24秒,对于python减少至1.26秒.

Reduces both runtimes to more stable ~1.24 s for C and ~1.26 s for python.

这是在28核双插槽Xeon E5-2680 v3系统上.

This is on a 28 core dual socket Xeon E5-2680 v3 system.

有趣的是,在非常相似的24核双插槽Haswell系统上,即使没有线程亲缘关系/固定,python和C的性能也几乎相同.

Interestingly, on a very similar 24 core dual socket Haswell system, both python and C perform almost identical even without thread affinity / pinning.

python为什么会影响调度?好吧,我认为周围还有更多的运行时环境.最重要的是,如果不固定性能,结果将是不确定的.

Why does python affect the scheduling? Well I assume there is more runtime environment around it. Bottom line is, without pinning your performance results will be non-deterministic.

还需要考虑的是,英特尔OpenMP运行时产生了一个额外的管理线程,该线程可能会使调度程序混乱.固定还有更多选择,例如KMP_AFFINITY=compact-但由于某些原因,这完全弄乱了我的系统.您可以将,verbose添加到变量中,以查看运行时如何固定线程.

Also you need to consider, that the Intel OpenMP runtime spawns an extra management thread that can confuse the scheduler. There are more choices for pinning, for instance KMP_AFFINITY=compact - but for some reason that is totally messed up on my system. You can add ,verbose to the variable to see how the runtime is pinning your threads.

likwid-pin 是有用的替代方法,可提供更方便的控制

likwid-pin is a useful alternative providing more convenient control.

通常,单精度至少应与双精度一样快.双精度会变慢,原因是:

In general single precision should be at least as fast as double precision. Double precision can be slower because:

  • 您需要更多的内存/缓存带宽才能实现双精度.
  • 您可以为单精度构建具有更高吞吐量的ALU,但通常不适用于CPU而是GPU.

我认为一旦您摆脱了性能异常,这就会反映在您的人数中.

I would think that once you get rid of the performance anomaly, this will be reflected in your numbers.

当扩大MKL/* gemm的线程数时,请考虑

When you scale up the number of threads for MKL/*gemm, consider

  • 内存/共享缓存带宽可能成为瓶颈,限制了可伸缩性
  • 涡轮模式将在提高利用率时有效降低核心频率.即使在标称频率下运行,这也适用:在Haswell-EP处理器上,AVX指令将施加较低的"AVX基本频率"-但是,当使用更少的内核/可用热裕量时,处理器可以超过该频率.短时间内更多.如果要获得完美的中性效果,则必须使用AVX基本频率,该频率为1.9 GHz.它记录在一张图片
  • Memory /shared cache bandwidth may become a bottleneck, limiting the scalability
  • Turbo mode will effectively decrease the core frequency when increasing utilization. This applies even when you run at nominal frequency: On Haswell-EP processors, AVX instructions will impose a lower "AVX base frequency" - but the processor is allowed to exceed that when less cores are utilized / thermal headroom is available and in general even more for a short time. If you want perfectly neutral results, you would have to use the AVX base frequency, which is 1.9 GHz for you. It is documented here, and explained in one picture.

我认为没有一种非常简单的方法可以衡量错误的计划如何影响您的应用程序.您可以使用perf trace -e sched:sched_switch公开它,并且有一些软件可以使这个形象化.学习曲线会很高.然后再说一次-对于并行性能分析,无论如何,您都应该固定线程.

I don't think there is a really simple way to measure how your application is affected by bad scheduling. You can expose this with perf trace -e sched:sched_switch and there is some software to visualize this, but this will come with a high learning curve. And then again - for parallel performance analysis you should have the threads pinned anyway.

这篇关于多核计算机上单精度数组与双精度数组的矩阵乘法的性能下降的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆