在Intel MKL披性能 [英] MKL Performance on Intel Phi

查看:344
本文介绍了在Intel MKL披性能的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个执行一些MKL一个程序调用上的小矩阵(50-100×1000元),以适应一个模型,然后我需要不同的车型。在伪code:

I have a routine that performs a few MKL calls on small matrices (50-100 x 1000 elements) to fit a model, which I then call for different models. In pseudo-code:

double doModelFit(int model, ...) {
   ...
   while( !done ) {
     cblas_dgemm(...);
     cblas_dgemm(...);
     ...
     dgesv(...);
     ...
   }
   return result;
}

int main(int argc, char **argv) {
  ...
  c_start = 1;  c_stop = nmodel;
  for(int c=c_start; c<c_stop; c++) {
    ...
    result = doModelFit(c, ...);
    ...
  }
}

拨打以上版本1.由于模型是独立的,我可以使用OpenMP的线程并行模型拟合,如下(第2版):

Call the above version 1. Since the models are independent, I can use OpenMP threads to parallelize the model fitting, as follows (version 2):

int main(int argc, char **argv) {
  ...
  int numthreads=omp_max_num_threads();
  int c;
#pragma omp parallel for private(c)
  for(int t=0; t<numthreads; t++) {  
     // assuming nmodel divisible by numthreads...      
     c_start = t*nmodel/numthreads+1; 
     c_end = (t+1)*nmodel/numthreads;
     for(c=c_start; c<c_stop; c++) {
        ...
        result = doModelFit(c, ...);
        ...
     }
  }
}

当我在主机上运行的版本1,它需要〜11秒和VTune™可视化报告并行很差,大部分花在空闲的时间。在主机上的版本2,大约需要5秒钟,VTune™可视化报告极大并行(接近100%的时间都花在与使用8个CPU)。现在,当我编译code到本机模式下披卡(-mmic)上运行,当上mic0命令提示符下运行版本1和2都需要大约30秒。当我使用VTune™可视化来分析吧:

When I run version 1 on the host machine, it takes ~11 seconds and VTune reports poor parallelization with most of the time spent idle. Version 2 on the host machine takes ~5 seconds and VTune reports great parallelization (near 100% of the time is spent with 8 CPUs in use). Now, when I compile the code to run on the Phi card in native mode (with -mmic), versions 1 and 2 both take approximately 30 seconds when run on the command prompt on mic0. When I use VTune to profile it:


  • 版1采用相同的大约30秒,并且热点分析表明,大部分时间是在__kmp_wait_sleep和__kmp_static_yield花费。出于7710s CPU时间,5804s都花费在旋转时间。

  • 2版需要fooooorrrreevvvver ......我在VTune™可视化运行几分钟后杀死它。热点分析表明,CPU时间25254s,21585s在[vmlinux中]度过的。

有人能阐明什么是怎么回事一些光,为什么我得到这么糟糕的表现?我使用OMP_NUM_THREADS默认并设置KMP_AFFINITY =紧凑,粒度细=(所推荐的英特尔)。我是新来的MKL和OpenMP,所以我敢肯定我做菜鸟的错误。

Can anyone shed some light on what's going on here and why I'm getting such bad performance? I'm using the default for OMP_NUM_THREADS and set KMP_AFFINITY=compact,granularity=fine (as recommended by Intel). I'm new to MKL and OpenMP, so I'm certain I'm making rookie mistakes.

谢谢,
安德鲁

Thanks, Andrew

推荐答案

最可能的原因是因为大部分时间都在OS(vmlinux的)都花在这种行为,是超额认购所造成的内部MKL嵌套OpenMP并行区域实施 cblas_dgemm的() dgesv 。例如。看<一个href=\"https://software.intel.com/sites/products/documentation/doclib/mkl_sa/11/tutorials/mkl_mmx_c/GUID-8DB79DF7-B853-46C9-8F46-C3782E0CA401.htm\"相对=nofollow>这个例子。

The most probable reason for this behavior given that most of the time is spent in OS (vmlinux), is over-subscription caused by nested OpenMP parallel region inside MKL implementation of cblas_dgemm() and dgesv. E.g. see this example.

此版本支持和吉姆·登普西在英特尔解释论坛

This version is supported and explained by Jim Dempsey at the Intel forum.

这篇关于在Intel MKL披性能的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆