英特尔C ++编译器ICC似乎忽略了SSE/AVX现象 [英] Intel c++ compiler, ICC, seems to ingnore SSE/AVX seetings

查看:100
本文介绍了英特尔C ++编译器ICC似乎忽略了SSE/AVX现象的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我最近下载并安装了面向Linux的英特尔C ++编译器Composer XE 2013,该软件可免费用于非商业开发. http://software.intel.com/zh-cn/non-commercial -software-development

I have recently downloaded and installed the Intel C++ compiler, Composer XE 2013, for Linux which is free to use for non-commercial development. http://software.intel.com/en-us/non-commercial-software-development

我正在常春藤桥系统(具有AVX)上运行.我有两个功能相同的版本.一个不使用SSE/AVX.另一个版本使用AVX.在GCC中,AVX代码比标量代码快大约四倍.但是,使用Intel C ++编译器时,性能要差得多.使用GCC,我可以这样编译

I'm running on a ivy bridge system (which has AVX). I have two versions of a function which do the same thing. One does not use SSE/AVX. The other version uses AVX. In GCC the AVX code is about four times faster than the scalar code. However, with the Intel C++ compiler the performance is much worse. With GCC I compile like this

gcc m6.cpp -o m6_gcc -O3 -mavx -fopenmp -Wall -pedantic

使用英特尔,我可以这样编译

With Intel I compile like this

icc m6.cpp -o m6_gcc -O3 -mavx -fopenmp -Wall -pedantic

目前,我仅使用OpenMP进行计时(与omp_get_wtime()一起使用). 奇怪的是,如果我将avx选项更改为msse2,则代码无法在GCC上编译,而在ICC下可以正常编译.实际上,我可以将mavx全部放在一起,并且仍然可以编译.似乎无论我尝试使用哪种选项进行编译,但都无法最佳利用AVX代码.因此,我想知道在使用ICC启用/禁用SSE/AVX时是否做错了什么?

I'm only using OpenMP for timing (with omp_get_wtime()) at this point. The strange thing is that if I change the avx option to say msse2 the code fails to compile with GCC but compiles just fine with ICC. In fact I can drop the mavx all together and it still compiles. It seems no matter what options I try it compiles but does not make optimal use of the AVX code. So I'm wondering if I'm doing something wrong in enabling/disabling SSE/AVX with ICC?

这是我正在使用的AVX的功能.

Here is the the function with AVX that I'm using.

inline void prod_block4_unroll2_AVX(double *x, double *M, double *y, double *result) {
    __m256d sum4_1 = _mm256_set1_pd(0.0f);
    __m256d sum4_2 = _mm256_set1_pd(0.0f);

    __m256d yrow[6];
    for(int i=0; i<6; i++) {
        yrow[i] = _mm256_load_pd(&y[4*i]);
    }
    for(int i=0; i<6; i++) {
        __m256d x4 = _mm256_load_pd(&x[4*i]);
        for(int j=0; j<6; j+=2) {
            __m256d brod1 = _mm256_set1_pd(M[i*6 + j]);
            sum4_1 = _mm256_add_pd(sum4_1, _mm256_mul_pd(_mm256_mul_pd(x4, brod1), yrow[j]));
            __m256d brod2 = _mm256_set1_pd(M[i*6 + j+1]);
            sum4_2 = _mm256_add_pd(sum4_2, _mm256_mul_pd(_mm256_mul_pd(x4, brod2), yrow[j+1]));
        }
    }
    sum4_1 = _mm256_add_pd(sum4_1, sum4_2);
    _mm256_store_pd(result, sum4_1);
}

以下是计时信息,以秒为单位.我在与L1,L2和L3缓存范围相对应的三个范围内运行.我在L1区域只得到4倍.请注意,ICC的标量代码快得多,而AVX代码却慢得多.

Here is timing information in seconds. I run over three ranges corresponding to L1, L2, and L3 cache ranges. I only get 4x in the L1 region. Note that ICC has much faster scalar code but slower AVX code.

GCC:
nvec 2000, repeat 100000
time scalar 5.847293
time SIMD 1.463820
time scalar/SIMD 3.994543

nvec 32000, repeat 10000
time scalar 9.529597
time SIMD 2.616296
time scalar/SIMD 3.642400
difference 0.000000

nvec 5000000, repeat 100
time scalar 15.105612
time SIMD 4.530891
time scalar/SIMD 3.333917
difference -0.000000

ICC:
nvec 2000, repeat 100000
time scalar 3.715568
time SIMD 2.025883
time scalar/SIMD 1.834049

nvec 32000, repeat 10000
time scalar 6.128615
time SIMD 3.509130
time scalar/SIMD 1.746477

nvec 5000000, repeat 100
time scalar 9.844096
time SIMD 5.782332
time scalar/SIMD 1.702444

推荐答案

两点:

(1)似乎您在代码中使用了int内在函数-g ++和icpc不一定实现相同的内在函数(但它们中的大多数重叠).检查需要导入的头文件(g ++可能需要提示来为您定义技巧). g ++失败时是否会给出错误消息?

(1) It appears you are using intel intrinsics in your code -- g++ and icpc do not necessarily implement the same intrinsics (but most of them overlap). Check the header files that need to be imported (g++ may need the hint to define the inartistic for you). Does g++ give an error message when it fails?

(2)编译器标志并不意味着将生成指令(从icpc --help): -msse3 May generate Intel(R) SSE3, SSE2, and SSE instructions

(2) The compiler flags do does not mean that instructions will be generated (from icpc --help): -msse3 May generate Intel(R) SSE3, SSE2, and SSE instructions

这些标志通常只是对编译器的提示.您可能需要查看-xHost和-fast.

These flags are usually just hints to the compiler. You may want to look at -xHost and -fast.

无论我尝试使用哪种选项编译,但都无法最佳利用AVX代码.

It seems no matter what options I try it compiles but does not make optimal use of the AVX code.

您如何检查此内容?如果还有其他瓶颈(例如内存带宽),则可能看不到4倍加速.

How have you checked this? You may not see a 4x speedup if there are other bottlenecks (such as memory bandwidth).

编辑(基于问题编辑):

EDIT (based on question edits):

看起来icc标量比gcc标量要快-icc可能对标量代码进行矢量化处理.如果是这种情况,在手动编码矢量化时,我不希望icc的速度提高4倍.

It looks like icc scalar is faster than gcc scalar -- it is possible that icc is vectorizing the scalar code. If this is the case, I would not expect a 4x speedup from icc when manually coding the vectorization.

到目前为止,icc在5.782332s和gcc在3.509130s之间的差异(对于nvec 5000000);这是出乎意料的.基于这些信息,我无法确定为什么两个编译器的运行时会有所不同.我建议您查看发出的代码( http://www.delorie.com/djgpp/两个编译器的v2faq/faq8_20.html ).另外,请确保您的测量结果是可重复的(例如,多插槽计算机上的内存布局,热/冷缓存,后台进程等).

As far the the difference between icc at 5.782332s and gcc at 3.509130s (for nvec 5000000); this is unexpected. I cannot tell based on the information I have what why there is a difference in the runtime between the two compilers. I would recommend looking at the emitted code (http://www.delorie.com/djgpp/v2faq/faq8_20.html) from both compilers. Also, make sure that your measurements are reproducible (e.g. memory layout on multi-socket machines, hot/cold caches, background processes, etc.).

这篇关于英特尔C ++编译器ICC似乎忽略了SSE/AVX现象的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆