在Intel MKL披性能 [英] MKL Performance on Intel Phi

查看：344 发布时间：2016/8/22 16:57:08 c openmp intel-mkl vtune intel-mic

本文介绍了在Intel MKL披性能的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个执行一些MKL一个程序调用上的小矩阵（50-100×1000元），以适应一个模型，然后我需要不同的车型。在伪code：

I have a routine that performs a few MKL calls on small matrices (50-100 x 1000 elements) to fit a model, which I then call for different models. In pseudo-code:

double doModelFit(int model, ...) {
   ...
   while( !done ) {
     cblas_dgemm(...);
     cblas_dgemm(...);
     ...
     dgesv(...);
     ...
   }
   return result;
}

int main(int argc, char **argv) {
  ...
  c_start = 1;  c_stop = nmodel;
  for(int c=c_start; c<c_stop; c++) {
    ...
    result = doModelFit(c, ...);
    ...
  }
}

拨打以上版本1.由于模型是独立的，我可以使用OpenMP的线程并行模型拟合，如下（第2版）：

Call the above version 1. Since the models are independent, I can use OpenMP threads to parallelize the model fitting, as follows (version 2):

int main(int argc, char **argv) {
  ...
  int numthreads=omp_max_num_threads();
  int c;
#pragma omp parallel for private(c)
  for(int t=0; t<numthreads; t++) {  
     // assuming nmodel divisible by numthreads...      
     c_start = t*nmodel/numthreads+1; 
     c_end = (t+1)*nmodel/numthreads;
     for(c=c_start; c<c_stop; c++) {
        ...
        result = doModelFit(c, ...);
        ...
     }
  }
}

当我在主机上运行的版本1，它需要〜11秒和VTune™可视化报告并行很差，大部分花在空闲的时间。在主机上的版本2，大约需要5秒钟，VTune™可视化报告极大并行（接近100％的时间都花在与使用8个CPU）。现在，当我编译code到本机模式下披卡（-mmic）上运行，当上mic0命令提示符下运行版本1和2都需要大约30秒。当我使用VTune™可视化来分析吧：

When I run version 1 on the host machine, it takes ~11 seconds and VTune reports poor parallelization with most of the time spent idle. Version 2 on the host machine takes ~5 seconds and VTune reports great parallelization (near 100% of the time is spent with 8 CPUs in use). Now, when I compile the code to run on the Phi card in native mode (with -mmic), versions 1 and 2 both take approximately 30 seconds when run on the command prompt on mic0. When I use VTune to profile it:

版1采用相同的大约30秒，并且热点分析表明，大部分时间是在__kmp_wait_sleep和__kmp_static_yield花费。出于7710s CPU时间，5804s都花费在旋转时间。

2版需要fooooorrrreevvvver ......我在VTune™可视化运行几分钟后杀死它。热点分析表明，CPU时间25254s，21585s在[vmlinux中]度过的。

有人能阐明什么是怎么回事一些光，为什么我得到这么糟糕的表现？我使用OMP_NUM_THREADS默认并设置KMP_AFFINITY =紧凑，粒度细=（所推荐的英特尔）。我是新来的MKL和OpenMP，所以我敢肯定我做菜鸟的错误。

Can anyone shed some light on what's going on here and why I'm getting such bad performance? I'm using the default for OMP_NUM_THREADS and set KMP_AFFINITY=compact,granularity=fine (as recommended by Intel). I'm new to MKL and OpenMP, so I'm certain I'm making rookie mistakes.

谢谢，
安德鲁

Thanks, Andrew

在Intel MKL披性能 [英] MKL Performance on Intel Phi

问题描述

推荐答案

相关文章

C/C++最新文章

热门教程

热门工具

登录关闭

在Intel MKL披性能 [英] MKL Performance on Intel Phi

问题描述

推荐答案

相关文章

C/C++最新文章

热门教程

热门工具

登录 关闭

登录关闭