分割过错`vmovaps“ [英] segmentation fault for `vmovaps'

查看:183
本文介绍了分割过错`vmovaps“的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我写了一个code以添加使用与至强融核协处理器的英特尔(512bit的长向量)KNC说明两个数组。不过,我已经得到了分割部分的内联汇编的部分。

这是我的code:

  INT主(INT ARGC,CHAR *的argv [])
{
    INT I;
    const int的长度= 65536;
    const int的AVXLength =长度/ 16;
    浮动* A =(浮点*)aligned_malloc(长*的sizeof(浮动),64);
    浮* B =(浮点*)aligned_malloc(长*的sizeof(浮动),64);
    浮动* C =(浮点*)aligned_malloc(长*的sizeof(浮动),64);
    对于(i = 0; I<长度;我++){
            A [i] = 1;
            B〔Ⅰ〕= 2;
    }    浮动* PA = A;
    浮* PB = B;
    浮动* PC = C;
    对于(i = 0; I< AVXLength;我++){
         __asm​​ __(vmovaps%1 %% zmm0 \\ n
                    vmovaps%2 %% zmm1 \\ n
                    vaddps %% zmm0,%% zmm0,%% zmm1 \\ n
                    vmovaps %% zmm0,%0;
            := M(PC):M(PA),M(PB));            PA + = 512;
            PB + = 512;
            PC + = 512;
    }
    返回0;
}

我使用gcc作为一个编译器(因为我没有钱买英特尔编译器)。这是我的命令行编译此code:

  k1om-MPSS-Linux的GCC add.c -o add.out


问题是行内汇编。下面的内联汇编固定它。

  __ __ ASM(vmovaps%1 %% zmm1 \\ n
        vmovaps%2 %% zmm2 \\ n
        vaddps %% zmm1,%% zmm2,%% zmm3 \\ n
        vmovaps %% zmm3,%0;
        := M(* PC):M(* PA),M(* PB));


解决方案

解释,骑士角(KNC)不具有AVX512。然而,它确实有类似的东西。 原来,该KNC VS AVX512问题是一个红色的鲱鱼在这里。问题是在OPS内联汇编。

而不是使用内联汇编我建议你使用内部函数。该KNC内部函数在英特尔本质指南在线

此外,普热斯基在CERN延长瓦格纳雾的矢量类库使用KNC 。你可以在这里找到 git仓库。如果将文件<一个在看href=\"https://bitbucket.org/veclibknc/vclknc/src/3237cd189f019994975e6b73fd23faa715e53112/vectorf512_mic.h?at=master&fileviewer=file-view-default\"相对=nofollow> vectorf512_mic.h 你可以学到很多关于KNC内部函数。

我转换的code键使用这些内在(在这种情况下,变成是一样的AVX512内在):

  INT主(INT ARGC,CHAR *的argv [])
{
    INT I;
    const int的长度= 65536;
    const int的AVXLength =长度/ 16;
    浮动* A =(浮点*)aligned_malloc(长*的sizeof(浮动),64);
    浮* B =(浮点*)aligned_malloc(长*的sizeof(浮动),64);
    浮动* C =(浮点*)aligned_malloc(长*的sizeof(浮动),64);
    对于(i = 0; I&LT;长度;我++){
        A [i] = 1;
        B〔Ⅰ〕= 2;
    }
    对于(i = 0; I&LT; AVXLength;我++){
        __m512 A16 = _mm512_load_ps(安培; A [16 * I]);
        __m512 B16 = _mm512_load_ps(和B [16 * I]);
        __m512 S16 = _mm512_add_ps(A16,B16);
        _mm512_store_ps(和C [16 * I],S16);
    }
    返回0;
}


该KNC内在只能通过ICC支持。然而,KNC自带的众核平台软件堆栈( MCSS)附带GCC的特殊版本, k1om-MPSS-Linux的海湾合作​​委员会,它可以使用AVX512喜欢使用内联汇编KNC的功能。


有KNC和AVX512的mnemoncis是在这种情况下是相同的。因此,我们可以用AVX512内在函数发现装配使用

 无效美孚(INT * A,为int * B,INT * C){
    __m512i A16 = _mm512_load_epi32(A);
    __m512i B16 = _mm512_load_epi32(B)
    __m512i S16 = _mm512_add_epi32(A16,B16);
    _mm512_store_epi32(C,S16);
}

GCC -O3 -mavx512 knc.c 产生

  vmovaps(%RDI),%zmm0
vaddps(%RSI),%zmm0,%zmm0
vmovaps%zmm0(%的RDX)

使用内联汇编会是这样一个解决方案。

  __ __ ASM(vmovaps(1%),%% zmm0 \\ n
        vpaddps(2%),%% zmm0,%% zmm0 \\ n
        vmovaps %% zmm0,(%0)
        :
        :R(PC),R(PA),R(PB)
        :
);


随着previous code GCC生成增加为每个阵列的说明。下面是使用索引寄存器只产生一个附加一个更好的解决方案。

 为(i = 0; I&LT;长度;我+ = 16){
    __asm​​__ __volatile__(
            vmovaps(%1%3,4),%% zmm0 \\ n
            vpaddps(%2,%3,4),%% zmm0,%% zmm0 \\ n
            vmovaps %% zmm0,(%0,%3,4)
            :
            :R(C),R(A),R(B),R(一)
            :记忆
     );
 }


在MPSS(3.6)的最新版本包括GCC 5.1.1,支持AVX512内部函数。因此,我认为你可以使用AVX512内部函数时,他们都一样的KNC内部函数,只有当他们不同意使用内联汇编。纵观英特尔内在导向表明他们在大多数情况下是一样的。

I wrote a code to add two arrays using KNC instructions with (512bit long vectors) on Xeon Phi intel coprocessor. However I've got segmentation part in the inline assembly part.

Here it is my code:

int main(int argc, char* argv[])
{
    int i;
    const int length = 65536;
    const int AVXLength = length / 16;
    float *A = (float*) aligned_malloc(length * sizeof(float), 64);
    float *B = (float*) aligned_malloc(length * sizeof(float), 64);
    float *C = (float*) aligned_malloc(length * sizeof(float), 64);
    for(i=0; i<length; i++){
            A[i] = 1;
            B[i] = 2;
    }

    float * pA = A;
    float * pB = B;
    float * pC = C;
    for(i=0; i<AVXLength; i++ ){
         __asm__("vmovaps %1,%%zmm0\n"
                    "vmovaps %2,%%zmm1\n"
                    "vaddps %%zmm0,%%zmm0,%%zmm1\n"
                    "vmovaps %%zmm0,%0;"
            : "=m" (pC) : "m" (pA), "m" (pB));

            pA += 512;
            pB += 512;
            pC += 512;
    }
    return 0;
}

I am using gcc as a compiler (because I don't have money to buy intel compiler). And this is my command line to compile this code:

k1om-mpss-linux-gcc add.c -o add.out


The problem was in the inline assembly. The following inline assembly fixed it.

__asm__("vmovaps %1,%%zmm1\n"
        "vmovaps %2,%%zmm2\n"
        "vaddps %%zmm1,%%zmm2,%%zmm3\n"
        "vmovaps %%zmm3,%0;"
        : "=m" (*pC) : "m" (*pA), "m" (*pB));

解决方案

As already explained, Knights Corner (KNC) does not have AVX512. However, it does have something similar. It turns out that the KNC vs AVX512 issue is a red herring here. The problem is in the OPs inline assembly.

Instead of using inline assembly I suggest you use intrinsics. The KNC intrinsics are described at the Intel Intrinsic Guide online.

Additionally, Przemysław Karpiński at CERN extend Agner Fog's Vector Class Library to use KNC. You can find the git repository here. If you look in the file vectorf512_mic.h you can learn a lot about the KNC intrinsics.

I converted your code to use these intrinsics (which turn out in this case to be the same as the AVX512 intrinsics):

int main(int argc, char* argv[])
{
    int i;
    const int length = 65536;
    const int AVXLength = length /16;
    float *A = (float*) aligned_malloc(length * sizeof(float), 64);
    float *B = (float*) aligned_malloc(length * sizeof(float), 64);
    float *C = (float*) aligned_malloc(length * sizeof(float), 64);
    for(i=0; i<length; i++){
        A[i] = 1;
        B[i] = 2;
    }
    for(i=0; i<AVXLength; i++ ){
        __m512 a16 = _mm512_load_ps(&A[16*i]);
        __m512 b16 = _mm512_load_ps(&B[16*i]);
        __m512 s16 = _mm512_add_ps(a16,b16);
        _mm512_store_ps(&C[16*i], s16);
    }
    return 0;
}


The KNC intrinsics are only supported by ICC. However, KNC comes with the Manycore Platform Software Stack (MCSS) which comes with a special version of gcc, k1om-mpss-linux-gcc, which can use the AVX512 like features of KNC using inline assembly.


The mnemoncis for KNC and AVX512 are the same in this case. Therefore we can use AVX512 intrinsics to discover the assembly to use

void foo(int *A, int *B, int *C) {
    __m512i a16 = _mm512_load_epi32(A);
    __m512i b16 = _mm512_load_epi32(B);
    __m512i s16 = _mm512_add_epi32(a16,b16);
    _mm512_store_epi32(C, s16);
}

and gcc -O3 -mavx512 knc.c produces

vmovaps (%rdi), %zmm0
vaddps  (%rsi), %zmm0, %zmm0
vmovaps %zmm0, (%rdx)

From this one solution using inline assembly would be

__asm__("vmovaps   (%1), %%zmm0\n"
        "vpaddps   (%2), %%zmm0, %%zmm0\n"
        "vmovaps   %%zmm0, (%0)"
        :
        : "r" (pC), "r" (pA), "r" (pB)
        :
);


With the previous code GCC generates add instructions for each array. Here is a better solution using an index register which only generates one add.

for(i=0; i<length; i+=16){
    __asm__ __volatile__ (
            "vmovaps   (%1,%3,4), %%zmm0\n"
            "vpaddps   (%2,%3,4), %%zmm0, %%zmm0\n"
            "vmovaps   %%zmm0, (%0,%3,4)"
            :
            : "r" (C), "r" (A), "r" (B), "r" (i)
            : "memory"
     );
 }


The latest version of the MPSS (3.6) includes GCC 5.1.1 which supports AVX512 intrinsics. So I think you can use AVX512 intrinsics whenever they are the same as the KNC intrinsics and only use inline assembly when they disagree. Looking at the Intel Intrinsic guide shows that they are the same in most cases.

这篇关于分割过错`vmovaps“的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆