ç与VDSP与NEON - NEON怎么能像C一样慢? [英] C versus vDSP versus NEON - How could NEON be as slow as C?

查看:907
本文介绍了ç与VDSP与NEON - NEON怎么能像C一样慢?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

NEON怎么能慢为C?

How could NEON be as slow as C?

我一直在试图建立一个快速的直方图功能,将通过赋予它们的值斗传入值到范围 - 这就是范围门槛,他们是最接近。这一点是将被应用到图像,以便它必须要快(假设为640x480所以300000元件的图像阵列)。直方图范围数字的倍数(0,25,50,75,100)。输入将浮子和最终产出显然是整数

I have been trying to build a fast Histogram function that would bucket incoming values into ranges by assigning them a value - which is the range threshold they are closest to. This is something that would be applied to images so it would have to be fast (assume an image array of 640x480 so 300,000 elements) . The histogram range numbers are multiples (0,25,50,75,100) . Inputs would be float and final outputs would obviously be integers

我打开一个新的空项目(无应用程序委托),只是使用main.m文件测试,在X code以下的版本。我删除了所有链接库与加快除外。

I tested the following versions on xCode by opening a new empty project (no app delegate) and just using the main.m file. I removed all linked libraries with the exception of Accelerate.

下面是C实现:旧版本是足够的,如果再不过这里是最后的优化的逻辑。花了11S和300毫秒。

Here is the C implementation: the older version was plenty of if then but here is the final optimized logic. it took 11s and 300ms.

int main(int argc, char *argv[])
{
  NSLog(@"starting");

  int sizeOfArray=300000;

  float* inputArray=(float*) malloc(sizeof(float)*sizeOfArray);
  int* outputArray=(int*) malloc(sizeof(int)*sizeOfArray);

  for (int i=0; i<sizeOfArray; ++i)
  {
    inputArray[i]=88.5;
  }

  //Assume range is [0,25,50,75,100]
  int lcd=25;

  for (int j=0; j<1000; ++j)// just to get some good time interval
  {
    for (int i=0; i<sizeOfArray; ++i)
    {
        //a 60.5 would give a 50. An 88.5 would give 100
        outputArray[i]=roundf(inputArray[i]/lcd)*lcd;
    }
  }
NSLog(@"done");
}

下面是VDSP实施。即使有一些繁琐的浮动整数来回,只用了6秒!近50%的改善!

Here is the vDSP implementation. Even with some of the tedious floating to integer back and forth, it took only 6s! almost 50% improvement!

//vDSP implementation
 int main(int argc, char *argv[])
 {
   NSLog(@"starting");

   int sizeOfArray=300000;

   float* inputArray=(float*) malloc(sizeof(float)*sizeOfArray);
   float* outputArrayF=(float*) malloc(sizeof(float)*sizeOfArray);//vDSP requires matching of input output
   int* outputArray=(int*) malloc(sizeof(int)*sizeOfArray); //rounded value to the nearest integere
   float* finalOutputArrayF=(float*) malloc(sizeof(float)*sizeOfArray);
   int* finalOutputArray=(int*) malloc(sizeof(int)*sizeOfArray); //to compare apples to apples scenarios output


   for (int i=0; i<sizeOfArray; ++i)
   {
     inputArray[i]=37.0; //this will produce an final number of 25. On the other hand 37.5 would produce 50.
   }


   for (int j=0; j<1000; ++j)// just to get some good time interval
   {
     //Assume range is [0,25,50,75,100]
     float lcd=25.0f;

     //divide by lcd
     vDSP_vsdiv(inputArray, 1, &lcd, outputArrayF, 1,sizeOfArray);

     //Round to nearest integer
     vDSP_vfixr32(outputArrayF, 1,outputArray, 1, sizeOfArray);

     // MUST convert int to float (cannot just cast) then multiply by scalar - This step has the effect of rounding the number to the nearest lcd.
    vDSP_vflt32(outputArray, 1, outputArrayF, 1, sizeOfArray);
    vDSP_vsmul(outputArrayF, 1, &lcd, finalOutputArrayF, 1, sizeOfArray);
    vDSP_vfix32(finalOutputArrayF, 1, finalOutputArray, 1, sizeOfArray);
   }
  NSLog(@"done");
}

下面是霓虹灯的实现。这是我第一次这么玩好!比VDSP较慢,耗时9秒,300毫秒这没有道理给我。无论是VDSP优于NEON优化或我做错了什么。

Here is the Neon implementation. This is my first so play nice! it was slower than vDSP and took 9 sec and 300ms which did not make sense to me. Either vDSP is better optimized than NEON or I am doing something wrong.

//NEON implementation
int main(int argc, char *argv[])
{
NSLog(@"starting");

int sizeOfArray=300000;

float* inputArray=(float*) malloc(sizeof(float)*sizeOfArray);
float* finalOutputArrayF=(float*) malloc(sizeof(float)*sizeOfArray);

for (int i=0; i<sizeOfArray; ++i)
{
    inputArray[i]=37.0; //this will produce an final number of 25. On the other hand 37.5 would produce 50.
}



for (int j=0; j<1000; ++j)// just to get some good time interval
{
    float32x4_t c0,c1,c2,c3;
    float32x4_t e0,e1,e2,e3;
    float32x4_t f0,f1,f2,f3;

    //ranges of histogram buckets
    float32x4_t buckets0=vdupq_n_f32(0);
    float32x4_t buckets1=vdupq_n_f32(25);
    float32x4_t buckets2=vdupq_n_f32(50);
    float32x4_t buckets3=vdupq_n_f32(75);
    float32x4_t buckets4=vdupq_n_f32(100);

    //midpoints of ranges
    float32x4_t thresholds1=vdupq_n_f32(12.5);
    float32x4_t thresholds2=vdupq_n_f32(37.5);
    float32x4_t thresholds3=vdupq_n_f32(62.5);
    float32x4_t thresholds4=vdupq_n_f32(87.5);


    for (int i=0; i<sizeOfArray;i+=16)
    {
        c0= vld1q_f32(&inputArray[i]);//load
        c1= vld1q_f32(&inputArray[i+4]);//load
        c2= vld1q_f32(&inputArray[i+8]);//load
        c3= vld1q_f32(&inputArray[i+12]);//load


        f0=buckets0;
        f1=buckets0;
        f2=buckets0;
        f3=buckets0;

        //register0
        e0=vcgtq_f32(c0,thresholds1);
        f0=vbslq_f32(e0, buckets1, f0);

        e0=vcgtq_f32(c0,thresholds2);
        f0=vbslq_f32(e0, buckets2, f0);

        e0=vcgtq_f32(c0,thresholds3);
        f0=vbslq_f32(e0, buckets3, f0);

        e0=vcgtq_f32(c0,thresholds4);
        f0=vbslq_f32(e0, buckets4, f0);



        //register1
        e1=vcgtq_f32(c1,thresholds1);
        f1=vbslq_f32(e1, buckets1, f1);

        e1=vcgtq_f32(c1,thresholds2);
        f1=vbslq_f32(e1, buckets2, f1);

        e1=vcgtq_f32(c1,thresholds3);
        f1=vbslq_f32(e1, buckets3, f1);

        e1=vcgtq_f32(c1,thresholds4);
        f1=vbslq_f32(e1, buckets4, f1);


        //register2
        e2=vcgtq_f32(c2,thresholds1);
        f2=vbslq_f32(e2, buckets1, f2);

        e2=vcgtq_f32(c2,thresholds2);
        f2=vbslq_f32(e2, buckets2, f2);

        e2=vcgtq_f32(c2,thresholds3);
        f2=vbslq_f32(e2, buckets3, f2);

        e2=vcgtq_f32(c2,thresholds4);
        f2=vbslq_f32(e2, buckets4, f2);


        //register3
        e3=vcgtq_f32(c3,thresholds1);
        f3=vbslq_f32(e3, buckets1, f3);

        e3=vcgtq_f32(c3,thresholds2);
        f3=vbslq_f32(e3, buckets2, f3);

        e3=vcgtq_f32(c3,thresholds3);
        f3=vbslq_f32(e3, buckets3, f3);

        e3=vcgtq_f32(c3,thresholds4);
        f3=vbslq_f32(e3, buckets4, f3);


        vst1q_f32(&finalOutputArrayF[i], f0);
        vst1q_f32(&finalOutputArrayF[i+4], f1);
        vst1q_f32(&finalOutputArrayF[i+8], f2);
        vst1q_f32(&finalOutputArrayF[i+12], f3);
    }
}
NSLog(@"done");
}

PS:这是我对这个第一次大规模标杆所以我试图保持它的简单(大循环,设置code不变,使用的NSLog打印开始/结束时间,只有加快框架链接)。如果这些假设都显著影响的结果,请批评。

PS: this is my first benchmarking on this scale so I tried to keep it simple (large loops, setup code constant, using NSlog to print start/end time, only accelerate framework linked). If any of these assumptions are significantly impacting the outcome, please critique.

感谢

推荐答案

首先,这是不是NEON每本身。这是内部函数。这几乎是不可能得到使用下铛或GCC内在良好NEON性能。如果你认为你需要的内部函数,你应该手工编写汇编。

First, this is not "NEON" per-se. This is intrinsics. It is almost impossible to get good NEON performance using intrinsics under clang or gcc. If you think you need intrinsics, you should hand-write the assembler.

VDSP不是更好地优化比NEON。 VDSP iOS上使用的NEON处理器。 VDSP的使用NEON的比你使用NEON是更好的优化。

vDSP is not "better optimized" than NEON. vDSP on iOS uses the NEON processor. vDSP's use of the NEON is much better optimized than your use of the NEON.

我没有通过你的内在code挖还,但麻烦的最有可能的(事实上几乎可以肯定),原因是你创建的等待状态。在汇编写作(和内在都只是组装与焊接手套写的),没有像在C写你不循环一样。你不比较相同。你需要一种新的思维方式。在汇编你可以一次做多件事情(因为你有不同的逻辑单元),但你绝对要安排在这样所有的这些东西可以并行运行的方式事情。组装好保持所有这些管道满。如果你可以看到你的code和它非常有意义,它可能是废话组装code。如果你从不重复自己,它可能是废话组装code。你需要仔细考虑到底是怎么回事成什么寄存器和多少个周期有,直到你被允许阅读。

I haven't dug through your intrinsics code yet, but the most likely (in fact almost certain) cause of trouble is that you're creating wait states. Writing in assembler (and intrinsics are just assembler written with welding gloves on), is nothing like writing in C. You don't loop the same. You don't compare the same. You need a new way of thinking. In assembly you can do more than one thing at a time (because you have different logic units), but you absolutely have to schedule things in such a way that all those things can run in parallel. Good assembly keeps all those pipelines full. If you can read your code and it makes perfect sense, it's probably crap assembly code. If you never repeat yourself, it's probably crap assembly code. You need to carefully consider what is going into what register and at how many cycles there are until you're allowed to read it.

如果它是作为音译C作为简单的,那么编译器会为你做的。你说我要去NEON写这篇那一刻你说我想我可以比编译器写出更好的NEON,因为编译器将使用它。这就是说,它通常可以写出更好的NEON比编译器(尤其是gcc和铛)。

If it were as easy as transliterating C, then the compiler would do that for you. The moment you say "I'm going to write this in NEON" you're saying "I think I can write better NEON than the compiler," because the compiler uses it too. That said, it often is possible to write better NEON than the compiler (particularly gcc and clang).

如果你准备去潜水进入那个世界(这是一个pretty清凉世界),你有一些阅读在你前面。下面是一些地方我建议:

If you're ready to go diving into that world (and it's a pretty cool world), you have some reading ahead of you. Here's some places I recommend:

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆