如何自动矢量化数组比较功能 [英] how to auto vectorization array comparison function

查看:156
本文介绍了如何自动矢量化数组比较功能的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这是帖子的后续内容。免责声明:我已经完成了零分析,甚至没有应用程序,这纯粹是为了了解更多关于向量化的知识。

This is a follow on to this post. Disclaimer: I have done zero profiling and don't even have an application, this is purely for me to learn more about vectorization.

我的代码如下。我正在用带有i3 m370的机器编译gcc 4.9.4。第一个循环矢量化,如我所料。然而,检查temp的每个元素的第二个循环不是矢量化的AFAICT,所有的andb指令。我预计它会被向量化为像 _mm_test_all_ones 。那个循环又如何被矢量化呢?第二个问题,我真的希望这是一个更大的循环的一部分。如果我取消注释下面的内容,则没有任何内容被矢量化。

My code is below. I am compiling with gcc 4.9.4 on a machine with an i3 m370. The first loop vectorizes as I expect. However the second loop checking each element of temp is not vectorized AFAICT, with all the "andb" instructions. I expected it to be vectorized with something like _mm_test_all_ones. How can that loop also be vectorized? Second question, I really want this as part of a larger loop. If I uncomment whats below, nothing gets vectorized. How can I also get that vectorized?

#define ARR_LENGTH 4096
#define block_size 4
typedef float afloat __attribute__ ((__aligned__(16)));

char all_equal_2(afloat *a, afloat *b){
    unsigned int i, j;
    char r = 1;
    unsigned int temp[block_size] __attribute__((aligned(16)));
    //for (i=0; i<ARR_LENGTH; i+=block_size){

        for (j = 0; j < block_size; ++j) {
            temp[j] = (*a) == (*b);
            a++;
            b++;
        }

        for (j=0; j<block_size; j++){
            r &= temp[j];
        }

        /*if (r == 0){
            break;
        }
    }*/
    return r;
}

结果汇编的关键部分:

.cfi_startproc
    movaps  (%rdi), %xmm0
    cmpeqps (%rsi), %xmm0
    movdqa  .LC0(%rip), %xmm1
    pand    %xmm0, %xmm1
    movaps  %xmm1, -24(%rsp)
    movl    -24(%rsp), %eax
    andl    $1, %eax
    andb    -20(%rsp), %al
    andb    -16(%rsp), %al
    andb    -12(%rsp), %al
    ret
    .cfi_endproc

更新:
信息与我的第一个问题类似。在那个问题中,vector是一个原始指针,所以segfaults是可能的,但这不是一个问题。因此,AFAIK重新排序比较操作在这里是安全的,但不是在那里。结论大概是相同的。

Update: This post is similar to my first question. In that question, the vector was a raw pointer so segfaults are possible, but here that isn't a concern. Therefore AFAIK reordering the comparison operations is safe here, but not there. The conclusion is probably the same though.

推荐答案

Autovectorization真的喜欢减少操作,所以诀窍就是将其降低。

Autovectorization really likes reductions operations, so the trick was to turn this into a reduction.

#define ARR_LENGTH 4096
typedef float afloat __attribute__ ((__aligned__(16)));
int foo(afloat *a, afloat *b){
    unsigned int i, j;
    unsigned int result;
    unsigned int blocksize = 4;
    for (i=0; i<ARR_LENGTH; i+=blocksize){
        result = 0;
        for (j=0; j<blocksize; j++){
            result += (*a) == (*b);
            a++;
            b++;
        }
        if (result == blocksize){
            blocksize *= 2;
        } else {
            break;
        }
    }
    blocksize = ARR_LENGTH - i;
    for (i=0; i<blocksize; i++){
        result += (*a) == (*b);
        a++;
        b++;
    }
    return result == i;
}

编译成一个漂亮的循环:

Compiles into a nice loop:

.L3:
        movaps  (%rdi,%rax), %xmm1
        addl    $1, %ecx
        cmpeqps (%rsi,%rax), %xmm1
        addq    $16, %rax
        cmpl    %r8d, %ecx
        psubd   %xmm1, %xmm0
        jb      .L3

这篇关于如何自动矢量化数组比较功能的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆