为什么使用Ryzen Threadripper的Numpy比Xeon慢得多? [英] Why is Numpy with Ryzen Threadripper so much slower than Xeon?

查看:303
本文介绍了为什么使用Ryzen Threadripper的Numpy比Xeon慢得多?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我知道Numpy可以使用不同的后端,例如OpenBLAS或MKL.我还读到MKL已针对Intel进行了优化,因此通常人们建议在AMD上使用OpenBLAS,对吗?

I know that Numpy can use different backends like OpenBLAS or MKL. I have also read that MKL is heavily optimized for Intel, so usually people suggest to use OpenBLAS on AMD, right?

我使用以下测试代码:

import numpy as np

def testfunc(x):
    np.random.seed(x)
    X = np.random.randn(2000, 4000)
    np.linalg.eigh(X @ X.T)

%timeit testfunc(0)

我已经使用不同的CPU测试了此代码:

I have tested this code using different CPUs:

  • Intel Xeon E5-1650 v3 上,此代码使用 12个内核中的6个 0.7s 中执行.
  • AMD Ryzen 5 2600 上,此代码使用所有12个内核 1.45s 中执行.
  • AMD Ryzen Threadripper 3970X 上,此代码使用所有64个内核 1.55s 中执行.
  • On Intel Xeon E5-1650 v3, this code performs in 0.7s using 6 out of 12 cores.
  • On AMD Ryzen 5 2600, this code performs in 1.45s using all 12 cores.
  • On AMD Ryzen Threadripper 3970X, this code performs in 1.55s using all 64 cores.

我在所有三个系统上都使用相同的Conda环境.根据np.show_config(),英特尔系统使用Numpy(libraries = ['mkl_rt', 'pthread'])的MKL后端,而AMD系统使用OpenBLAS(libraries = ['openblas', 'openblas']).通过观察Linux shell中的top可以确定CPU内核的使用情况:

I am using the same Conda environment on all three systems. According to np.show_config(), the Intel system uses the MKL backend for Numpy (libraries = ['mkl_rt', 'pthread']), whereas the AMD systems use OpenBLAS (libraries = ['openblas', 'openblas']). The CPU core usage was determined by observing top in a Linux shell:

  • 对于 Intel Xeon E5-1650 v3 CPU(6个物理内核),它显示12个内核(6个空闲).
  • 对于 AMD Ryzen 5 2600 CPU(6个物理内核),它显示12个内核(无空闲).
  • 对于 AMD Ryzen Threadripper 3970X CPU(32个物理内核),它显示64个内核(无空闲).
  • For the Intel Xeon E5-1650 v3 CPU (6 physical cores), it shows 12 cores (6 idling).
  • For the AMD Ryzen 5 2600 CPU (6 physical cores), it shows 12 cores (none idling).
  • For the AMD Ryzen Threadripper 3970X CPU (32 physical cores), it shows 64 cores (none idling).

以上观察结果引起以下问题:

The above observations give rise to the following questions:

  1. 在使用OpenBLAS的最新AMD CPU上,线性代数是否比使用六年的Intel Xeon慢得多?? (也在更新3中解决)
  2. 从对CPU负载的观察来看,Numpy似乎在所有三种情况下都使用了多核环境.即使Threadripper的物理核心数量几乎是Ryzen 5的六倍,它又怎么可能比Ryzen 5慢呢? (另请参阅更新3)
  3. 是否可以采取任何措施来加快Threadripper上的计算速度? (在更新2中得到部分答复)
  1. Is that normal, that linear algebra on up-to-date AMD CPUs using OpenBLAS is that much slower than on a six-year-old Intel Xeon? (also addressed in Update 3)
  2. Judging by the observations of the CPU load, it looks like Numpy utilizes the multi-core environment in all three cases. How can it be that the Threadripper is even slower than the Ryzen 5, even though it has almost six times as many physical cores? (also see Update 3)
  3. Is there anything that can be done to speed up the computations on the Threadripper? (partially answered in Update 2)


更新1: OpenBLAS版本为0.3.6.我在某处读到,升级到较新版本可能会有所帮助,但是,随着OpenBLAS更新到0.3.10,testfunc的性能在AMD Ryzen Threadripper 3970X上仍为1.55s.


Update 1: The OpenBLAS version is 0.3.6. I read somewhere, that upgrading to a newer version might help, however, with OpenBLAS updated to 0.3.10, the performance for testfunc is still 1.55s on AMD Ryzen Threadripper 3970X.

更新2:将Numpy的MKL后端与环境变量MKL_DEBUG_CPU_TYPE=5结合使用(如

Update 2: Using the MKL backend for Numpy in conjunction with the environment variable MKL_DEBUG_CPU_TYPE=5 (as described here) reduces the run time for testfunc on AMD Ryzen Threadripper 3970X to only 0.52s, which is actually more or less satisfying. FTR, setting this variable via ~/.profile did not work for me on Ubuntu 20.04. Also, setting the variable from within Jupyter did not work. So instead I put it into ~/.bashrc which works now. Anyways, performing 35% faster than an old Intel Xeon, is this all we get, or can we get more out of it?

更新3:我尝试MKL/OpenBLAS使用的线程数:

Update 3: I play around with the number of threads used by MKL/OpenBLAS:

运行时间以秒为单位报告.每栏的最佳结果都用下划线标出.我为此测试使用了OpenBLAS 0.3.6.该测试的结论:

The run times are reported in seconds. The best result of each column is underlined. I used OpenBLAS 0.3.6 for this test. The conclusions from this test:

  • 使用OpenBLAS的Threadripper的单核性能要比Xeon的单核性能好一些(快11%),但是当使用Xeon时,其单核性能甚至更好.使用MKL(速度提高了34%).
  • 使用OpenBLAS的Threadripper的多核性能远比Xeon的多核性能差.这是怎么回事?
  • 使用MKL时,Threadripper的整体性能优于Xeon (比Xeon快26%至38%).通过使用16个线程和MKL的Threadripper实现了总体最佳性能(比Xeon快36%).
  • The single-core performance of the Threadripper using OpenBLAS is a bit better than the single-core performance of the Xeon (11% faster), however, its single-core performance is even better when using MKL (34% faster).
  • The multi-core performance of the Threadripper using OpenBLAS is ridiculously worse than the multi-core performance of the Xeon. What is going on here?
  • The Threadripper performs overall better than the Xeon, when MKL is used (26% to 38% faster than Xeon). The overall best performance is achieved by the Threadripper using 16 threads and MKL (36% faster than Xeon).

更新4:仅用于澄清.不,我不认为(a) 的代码的速度. (a)暗示"OpenBLAS的表现与MKL差不多",这与我观察到的数字强烈矛盾.根据我的数字,OpenBLAS的性能比MKL差很多.问题是为什么. (a)和(b)都建议将MKL_DEBUG_CPU_TYPE=5与MKL结合使用以实现最佳性能.这可能是正确的,但它都不能解释为什么 OpenBLAS的速度很慢.两者都没有解释,为什么即使使用MKL和MKL_DEBUG_CPU_TYPE=5 32核Threadripper也仅比拥有6年历史的6核Xeon 快36%.

Update 4: Just for clarification. No, I do not think that (a) this or (b) that answers this question. (a) suggests that "OpenBLAS does nearly as well as MKL", which is a strong contradiction to the numbers I observed. According to my numbers, OpenBLAS performs ridiculously worse than MKL. The question is why. (a) and (b) both suggest using MKL_DEBUG_CPU_TYPE=5 in conjunction with MKL to achieve maximum performance. This might be right, but it does neither explain why OpenBLAS is that dead slow. Neither it explains, why even with MKL and MKL_DEBUG_CPU_TYPE=5 the 32-core Threadripper is only 36% faster than the six-year-old 6-core Xeon.

推荐答案

尝试使用AMD优化的BLIS库是否有意义?

也许我缺少(误解)了一些东西,但是我认为您可以使用BLIS代替OpenBLAS.唯一潜在的问题可能是AMD BLIS针对AMD EPYC进行了优化(但是您正在使用Ryzen).我对结果非常好奇,因为我正在购买服务器以供工作,并且正在考虑使用AMD EPYC和Intel Xeon.

Maybe I am missing (misunderstanding) something, but I would assume you could use BLIS instead of OpenBLAS. The only potential problem could be that AMD BLIS is optimized for AMD EPYC (but you're using Ryzen). I'm VERY curious about the results, since I'm in the process of buying a server for work, and am considering AMD EPYC and Intel Xeon.

以下是相应的AMD BLIS库: https://developer.amd.com/amd-aocl/

Here are the respective AMD BLIS libraries: https://developer.amd.com/amd-aocl/

这篇关于为什么使用Ryzen Threadripper的Numpy比Xeon慢得多?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆