vmovaps'的分段错误 [英] segmentation fault for `vmovaps'

查看:61
本文介绍了vmovaps'的分段错误的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我编写了一个代码,在Xeon Phi intel协处理器上使用具有(512位长向量)的KNC指令添加两个数组.但是,我在内联汇编部分中有细分部分.

I wrote a code to add two arrays using KNC instructions with (512bit long vectors) on Xeon Phi intel coprocessor. However I've got segmentation part in the inline assembly part.

这是我的代码:

int main(int argc, char* argv[])
{
    int i;
    const int length = 65536;
    const int AVXLength = length / 16;
    float *A = (float*) aligned_malloc(length * sizeof(float), 64);
    float *B = (float*) aligned_malloc(length * sizeof(float), 64);
    float *C = (float*) aligned_malloc(length * sizeof(float), 64);
    for(i=0; i<length; i++){
            A[i] = 1;
            B[i] = 2;
    }

    float * pA = A;
    float * pB = B;
    float * pC = C;
    for(i=0; i<AVXLength; i++ ){
         __asm__("vmovaps %1,%%zmm0\n"
                    "vmovaps %2,%%zmm1\n"
                    "vaddps %%zmm0,%%zmm0,%%zmm1\n"
                    "vmovaps %%zmm0,%0;"
            : "=m" (pC) : "m" (pA), "m" (pB));

            pA += 512;
            pB += 512;
            pC += 512;
    }
    return 0;
}

我正在使用gcc作为编译器(因为我没有钱购买英特尔编译器).这是我用来编译此代码的命令行:

I am using gcc as a compiler (because I don't have money to buy intel compiler). And this is my command line to compile this code:

k1om-mpss-linux-gcc add.c -o add.out


问题出在内联汇编中.以下嵌入式程序集对其进行了修复.


The problem was in the inline assembly. The following inline assembly fixed it.

__asm__("vmovaps %1,%%zmm1\n"
        "vmovaps %2,%%zmm2\n"
        "vaddps %%zmm1,%%zmm2,%%zmm3\n"
        "vmovaps %%zmm3,%0;"
        : "=m" (*pC) : "m" (*pA), "m" (*pB));

推荐答案

正如已经解释,骑士区(KNC)没有AVX512.但是,它确实有类似的东西.事实证明,KNC vs AVX512问题在这里是一个红鲱鱼.问题出在OP的内联汇编中.

As already explained, Knights Corner (KNC) does not have AVX512. However, it does have something similar. It turns out that the KNC vs AVX512 issue is a red herring here. The problem is in the OPs inline assembly.

建议您不要使用内联汇编,而应使用内联汇编.KNC内部函数在在线英特尔内部指南中进行了说明.

Instead of using inline assembly I suggest you use intrinsics. The KNC intrinsics are described at the Intel Intrinsic Guide online.

此外,CERN的PrzemysławKarpiński将Agner Fog的向量类库扩展到使用KNC .您可以在此处找到git存储库.如果您查看文件vectorf512_mic.h 你可以学到很多关于 KNC 内在函数的知识.

Additionally, Przemysław Karpiński at CERN extend Agner Fog's Vector Class Library to use KNC. You can find the git repository here. If you look in the file vectorf512_mic.h you can learn a lot about the KNC intrinsics.

我将您的代码转换为使用这些内部函数(在本例中,结果与AVX512内部函数相同):

I converted your code to use these intrinsics (which turn out in this case to be the same as the AVX512 intrinsics):

int main(int argc, char* argv[])
{
    int i;
    const int length = 65536;
    const int AVXLength = length /16;
    float *A = (float*) aligned_malloc(length * sizeof(float), 64);
    float *B = (float*) aligned_malloc(length * sizeof(float), 64);
    float *C = (float*) aligned_malloc(length * sizeof(float), 64);
    for(i=0; i<length; i++){
        A[i] = 1;
        B[i] = 2;
    }
    for(i=0; i<AVXLength; i++ ){
        __m512 a16 = _mm512_load_ps(&A[16*i]);
        __m512 b16 = _mm512_load_ps(&B[16*i]);
        __m512 s16 = _mm512_add_ps(a16,b16);
        _mm512_store_ps(&C[16*i], s16);
    }
    return 0;
}


ICC仅支持KNC内部函数.但是,KNC随附 Manycore平台软件堆栈(MCSS),它带有gcc的特殊版本, k1om-mpss-linux-gcc ,可以通过内联汇编来使用AVX512类似KNC的功能.


The KNC intrinsics are only supported by ICC. However, KNC comes with the Manycore Platform Software Stack (MCSS) which comes with a special version of gcc, k1om-mpss-linux-gcc, which can use the AVX512 like features of KNC using inline assembly.

在这种情况下,KNC和AVX512的助记符相同.因此,我们可以使用AVX512内部函数来发现要使用的程序集

The mnemoncis for KNC and AVX512 are the same in this case. Therefore we can use AVX512 intrinsics to discover the assembly to use

void foo(int *A, int *B, int *C) {
    __m512i a16 = _mm512_load_epi32(A);
    __m512i b16 = _mm512_load_epi32(B);
    __m512i s16 = _mm512_add_epi32(a16,b16);
    _mm512_store_epi32(C, s16);
}

gcc -O3 -mavx512 knc.c 产生

vmovaps (%rdi), %zmm0
vaddps  (%rsi), %zmm0, %zmm0
vmovaps %zmm0, (%rdx)

从这种使用内联汇编的解决方案将是

From this one solution using inline assembly would be

__asm__("vmovaps   (%1), %%zmm0\n"
        "vpaddps   (%2), %%zmm0, %%zmm0\n"
        "vmovaps   %%zmm0, (%0)"
        :
        : "r" (pC), "r" (pA), "r" (pB)
        :
);


使用前面的代码,GCC为每个数组生成添加指令.这是使用仅产生一个加法的索引寄存器的更好解决方案.


With the previous code GCC generates add instructions for each array. Here is a better solution using an index register which only generates one add.

for(i=0; i<length; i+=16){
    __asm__ __volatile__ (
            "vmovaps   (%1,%3,4), %%zmm0\n"
            "vpaddps   (%2,%3,4), %%zmm0, %%zmm0\n"
            "vmovaps   %%zmm0, (%0,%3,4)"
            :
            : "r" (C), "r" (A), "r" (B), "r" (i)
            : "memory"
     );
 }


MPSS(3.6)的最新版本包括支持AVX512内在函数的GCC 5.1.1.因此,我认为只要它们与KNC内部函数相同,就可以使用AVX512内部函数,并且只有在不同意时才使用内联汇编.查看Intel Intrinsic指南,可以发现它们在大多数情况下都是相同的.


The latest version of the MPSS (3.6) includes GCC 5.1.1 which supports AVX512 intrinsics. So I think you can use AVX512 intrinsics whenever they are the same as the KNC intrinsics and only use inline assembly when they disagree. Looking at the Intel Intrinsic guide shows that they are the same in most cases.

这篇关于vmovaps&amp;#39;的分段错误的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆