在Intel MKL披性能 [英] MKL Performance on Intel Phi
问题描述
我有一个执行一些MKL一个程序调用上的小矩阵(50-100×1000元),以适应一个模型,然后我需要不同的车型。在伪code:
I have a routine that performs a few MKL calls on small matrices (50-100 x 1000 elements) to fit a model, which I then call for different models. In pseudo-code:
double doModelFit(int model, ...) {
...
while( !done ) {
cblas_dgemm(...);
cblas_dgemm(...);
...
dgesv(...);
...
}
return result;
}
int main(int argc, char **argv) {
...
c_start = 1; c_stop = nmodel;
for(int c=c_start; c<c_stop; c++) {
...
result = doModelFit(c, ...);
...
}
}
拨打以上版本1.由于模型是独立的,我可以使用OpenMP的线程并行模型拟合,如下(第2版):
Call the above version 1. Since the models are independent, I can use OpenMP threads to parallelize the model fitting, as follows (version 2):
int main(int argc, char **argv) {
...
int numthreads=omp_max_num_threads();
int c;
#pragma omp parallel for private(c)
for(int t=0; t<numthreads; t++) {
// assuming nmodel divisible by numthreads...
c_start = t*nmodel/numthreads+1;
c_end = (t+1)*nmodel/numthreads;
for(c=c_start; c<c_stop; c++) {
...
result = doModelFit(c, ...);
...
}
}
}
当我在主机上运行的版本1,它需要〜11秒和VTune™可视化报告并行很差,大部分花在空闲的时间。在主机上的版本2,大约需要5秒钟,VTune™可视化报告极大并行(接近100%的时间都花在与使用8个CPU)。现在,当我编译code到本机模式下披卡(-mmic)上运行,当上mic0命令提示符下运行版本1和2都需要大约30秒。当我使用VTune™可视化来分析吧:
When I run version 1 on the host machine, it takes ~11 seconds and VTune reports poor parallelization with most of the time spent idle. Version 2 on the host machine takes ~5 seconds and VTune reports great parallelization (near 100% of the time is spent with 8 CPUs in use). Now, when I compile the code to run on the Phi card in native mode (with -mmic), versions 1 and 2 both take approximately 30 seconds when run on the command prompt on mic0. When I use VTune to profile it:
- 版1采用相同的大约30秒,并且热点分析表明,大部分时间是在__kmp_wait_sleep和__kmp_static_yield花费。出于7710s CPU时间,5804s都花费在旋转时间。
- 2版需要fooooorrrreevvvver ......我在VTune™可视化运行几分钟后杀死它。热点分析表明,CPU时间25254s,21585s在[vmlinux中]度过的。
有人能阐明什么是怎么回事一些光,为什么我得到这么糟糕的表现?我使用OMP_NUM_THREADS默认并设置KMP_AFFINITY =紧凑,粒度细=(所推荐的英特尔)。我是新来的MKL和OpenMP,所以我敢肯定我做菜鸟的错误。
Can anyone shed some light on what's going on here and why I'm getting such bad performance? I'm using the default for OMP_NUM_THREADS and set KMP_AFFINITY=compact,granularity=fine (as recommended by Intel). I'm new to MKL and OpenMP, so I'm certain I'm making rookie mistakes.
谢谢,
安德鲁
Thanks, Andrew
推荐答案
最可能的原因是因为大部分时间都在OS(vmlinux的)都花在这种行为,是超额认购所造成的内部MKL嵌套OpenMP并行区域实施 cblas_dgemm的()
和 dgesv
。例如。看<一个href=\"https://software.intel.com/sites/products/documentation/doclib/mkl_sa/11/tutorials/mkl_mmx_c/GUID-8DB79DF7-B853-46C9-8F46-C3782E0CA401.htm\"相对=nofollow>这个例子。
The most probable reason for this behavior given that most of the time is spent in OS (vmlinux), is over-subscription caused by nested OpenMP parallel region inside MKL implementation of cblas_dgemm()
and dgesv
. E.g. see this example.
此版本支持和吉姆·登普西在英特尔解释论坛。
This version is supported and explained by Jim Dempsey at the Intel forum.
这篇关于在Intel MKL披性能的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!