ç与VDSP与NEON - NEON怎么能像C一样慢? [英] C versus vDSP versus NEON - How could NEON be as slow as C?
问题描述
NEON怎么能慢为C?
How could NEON be as slow as C?
我一直在试图建立一个快速的直方图功能,将通过赋予它们的值斗传入值到范围 - 这就是范围门槛,他们是最接近。这一点是将被应用到图像,以便它必须要快(假设为640x480所以300000元件的图像阵列)。直方图范围数字的倍数(0,25,50,75,100)。输入将浮子和最终产出显然是整数
I have been trying to build a fast Histogram function that would bucket incoming values into ranges by assigning them a value - which is the range threshold they are closest to. This is something that would be applied to images so it would have to be fast (assume an image array of 640x480 so 300,000 elements) . The histogram range numbers are multiples (0,25,50,75,100) . Inputs would be float and final outputs would obviously be integers
我打开一个新的空项目(无应用程序委托),只是使用main.m文件测试,在X code以下的版本。我删除了所有链接库与加快除外。
I tested the following versions on xCode by opening a new empty project (no app delegate) and just using the main.m file. I removed all linked libraries with the exception of Accelerate.
下面是C实现:旧版本是足够的,如果再不过这里是最后的优化的逻辑。花了11S和300毫秒。
Here is the C implementation: the older version was plenty of if then but here is the final optimized logic. it took 11s and 300ms.
int main(int argc, char *argv[])
{
NSLog(@"starting");
int sizeOfArray=300000;
float* inputArray=(float*) malloc(sizeof(float)*sizeOfArray);
int* outputArray=(int*) malloc(sizeof(int)*sizeOfArray);
for (int i=0; i<sizeOfArray; ++i)
{
inputArray[i]=88.5;
}
//Assume range is [0,25,50,75,100]
int lcd=25;
for (int j=0; j<1000; ++j)// just to get some good time interval
{
for (int i=0; i<sizeOfArray; ++i)
{
//a 60.5 would give a 50. An 88.5 would give 100
outputArray[i]=roundf(inputArray[i]/lcd)*lcd;
}
}
NSLog(@"done");
}
下面是VDSP实施。即使有一些繁琐的浮动整数来回,只用了6秒!近50%的改善!
Here is the vDSP implementation. Even with some of the tedious floating to integer back and forth, it took only 6s! almost 50% improvement!
//vDSP implementation
int main(int argc, char *argv[])
{
NSLog(@"starting");
int sizeOfArray=300000;
float* inputArray=(float*) malloc(sizeof(float)*sizeOfArray);
float* outputArrayF=(float*) malloc(sizeof(float)*sizeOfArray);//vDSP requires matching of input output
int* outputArray=(int*) malloc(sizeof(int)*sizeOfArray); //rounded value to the nearest integere
float* finalOutputArrayF=(float*) malloc(sizeof(float)*sizeOfArray);
int* finalOutputArray=(int*) malloc(sizeof(int)*sizeOfArray); //to compare apples to apples scenarios output
for (int i=0; i<sizeOfArray; ++i)
{
inputArray[i]=37.0; //this will produce an final number of 25. On the other hand 37.5 would produce 50.
}
for (int j=0; j<1000; ++j)// just to get some good time interval
{
//Assume range is [0,25,50,75,100]
float lcd=25.0f;
//divide by lcd
vDSP_vsdiv(inputArray, 1, &lcd, outputArrayF, 1,sizeOfArray);
//Round to nearest integer
vDSP_vfixr32(outputArrayF, 1,outputArray, 1, sizeOfArray);
// MUST convert int to float (cannot just cast) then multiply by scalar - This step has the effect of rounding the number to the nearest lcd.
vDSP_vflt32(outputArray, 1, outputArrayF, 1, sizeOfArray);
vDSP_vsmul(outputArrayF, 1, &lcd, finalOutputArrayF, 1, sizeOfArray);
vDSP_vfix32(finalOutputArrayF, 1, finalOutputArray, 1, sizeOfArray);
}
NSLog(@"done");
}
下面是霓虹灯的实现。这是我第一次这么玩好!比VDSP较慢,耗时9秒,300毫秒这没有道理给我。无论是VDSP优于NEON优化或我做错了什么。
Here is the Neon implementation. This is my first so play nice! it was slower than vDSP and took 9 sec and 300ms which did not make sense to me. Either vDSP is better optimized than NEON or I am doing something wrong.
//NEON implementation
int main(int argc, char *argv[])
{
NSLog(@"starting");
int sizeOfArray=300000;
float* inputArray=(float*) malloc(sizeof(float)*sizeOfArray);
float* finalOutputArrayF=(float*) malloc(sizeof(float)*sizeOfArray);
for (int i=0; i<sizeOfArray; ++i)
{
inputArray[i]=37.0; //this will produce an final number of 25. On the other hand 37.5 would produce 50.
}
for (int j=0; j<1000; ++j)// just to get some good time interval
{
float32x4_t c0,c1,c2,c3;
float32x4_t e0,e1,e2,e3;
float32x4_t f0,f1,f2,f3;
//ranges of histogram buckets
float32x4_t buckets0=vdupq_n_f32(0);
float32x4_t buckets1=vdupq_n_f32(25);
float32x4_t buckets2=vdupq_n_f32(50);
float32x4_t buckets3=vdupq_n_f32(75);
float32x4_t buckets4=vdupq_n_f32(100);
//midpoints of ranges
float32x4_t thresholds1=vdupq_n_f32(12.5);
float32x4_t thresholds2=vdupq_n_f32(37.5);
float32x4_t thresholds3=vdupq_n_f32(62.5);
float32x4_t thresholds4=vdupq_n_f32(87.5);
for (int i=0; i<sizeOfArray;i+=16)
{
c0= vld1q_f32(&inputArray[i]);//load
c1= vld1q_f32(&inputArray[i+4]);//load
c2= vld1q_f32(&inputArray[i+8]);//load
c3= vld1q_f32(&inputArray[i+12]);//load
f0=buckets0;
f1=buckets0;
f2=buckets0;
f3=buckets0;
//register0
e0=vcgtq_f32(c0,thresholds1);
f0=vbslq_f32(e0, buckets1, f0);
e0=vcgtq_f32(c0,thresholds2);
f0=vbslq_f32(e0, buckets2, f0);
e0=vcgtq_f32(c0,thresholds3);
f0=vbslq_f32(e0, buckets3, f0);
e0=vcgtq_f32(c0,thresholds4);
f0=vbslq_f32(e0, buckets4, f0);
//register1
e1=vcgtq_f32(c1,thresholds1);
f1=vbslq_f32(e1, buckets1, f1);
e1=vcgtq_f32(c1,thresholds2);
f1=vbslq_f32(e1, buckets2, f1);
e1=vcgtq_f32(c1,thresholds3);
f1=vbslq_f32(e1, buckets3, f1);
e1=vcgtq_f32(c1,thresholds4);
f1=vbslq_f32(e1, buckets4, f1);
//register2
e2=vcgtq_f32(c2,thresholds1);
f2=vbslq_f32(e2, buckets1, f2);
e2=vcgtq_f32(c2,thresholds2);
f2=vbslq_f32(e2, buckets2, f2);
e2=vcgtq_f32(c2,thresholds3);
f2=vbslq_f32(e2, buckets3, f2);
e2=vcgtq_f32(c2,thresholds4);
f2=vbslq_f32(e2, buckets4, f2);
//register3
e3=vcgtq_f32(c3,thresholds1);
f3=vbslq_f32(e3, buckets1, f3);
e3=vcgtq_f32(c3,thresholds2);
f3=vbslq_f32(e3, buckets2, f3);
e3=vcgtq_f32(c3,thresholds3);
f3=vbslq_f32(e3, buckets3, f3);
e3=vcgtq_f32(c3,thresholds4);
f3=vbslq_f32(e3, buckets4, f3);
vst1q_f32(&finalOutputArrayF[i], f0);
vst1q_f32(&finalOutputArrayF[i+4], f1);
vst1q_f32(&finalOutputArrayF[i+8], f2);
vst1q_f32(&finalOutputArrayF[i+12], f3);
}
}
NSLog(@"done");
}
PS:这是我对这个第一次大规模标杆所以我试图保持它的简单(大循环,设置code不变,使用的NSLog打印开始/结束时间,只有加快框架链接)。如果这些假设都显著影响的结果,请批评。
PS: this is my first benchmarking on this scale so I tried to keep it simple (large loops, setup code constant, using NSlog to print start/end time, only accelerate framework linked). If any of these assumptions are significantly impacting the outcome, please critique.
感谢
推荐答案
首先,这是不是NEON每本身。这是内部函数。这几乎是不可能得到使用下铛或GCC内在良好NEON性能。如果你认为你需要的内部函数,你应该手工编写汇编。
First, this is not "NEON" per-se. This is intrinsics. It is almost impossible to get good NEON performance using intrinsics under clang or gcc. If you think you need intrinsics, you should hand-write the assembler.
VDSP不是更好地优化比NEON。 VDSP iOS上使用的NEON处理器。 VDSP的使用NEON的比你使用NEON是更好的优化。
vDSP is not "better optimized" than NEON. vDSP on iOS uses the NEON processor. vDSP's use of the NEON is much better optimized than your use of the NEON.
我没有通过你的内在code挖还,但麻烦的最有可能的(事实上几乎可以肯定),原因是你创建的等待状态。在汇编写作(和内在都只是组装与焊接手套写的),没有像在C写你不循环一样。你不比较相同。你需要一种新的思维方式。在汇编你可以一次做多件事情(因为你有不同的逻辑单元),但你绝对要安排在这样所有的这些东西可以并行运行的方式事情。组装好保持所有这些管道满。如果你可以看到你的code和它非常有意义,它可能是废话组装code。如果你从不重复自己,它可能是废话组装code。你需要仔细考虑到底是怎么回事成什么寄存器和多少个周期有,直到你被允许阅读。
I haven't dug through your intrinsics code yet, but the most likely (in fact almost certain) cause of trouble is that you're creating wait states. Writing in assembler (and intrinsics are just assembler written with welding gloves on), is nothing like writing in C. You don't loop the same. You don't compare the same. You need a new way of thinking. In assembly you can do more than one thing at a time (because you have different logic units), but you absolutely have to schedule things in such a way that all those things can run in parallel. Good assembly keeps all those pipelines full. If you can read your code and it makes perfect sense, it's probably crap assembly code. If you never repeat yourself, it's probably crap assembly code. You need to carefully consider what is going into what register and at how many cycles there are until you're allowed to read it.
如果它是作为音译C作为简单的,那么编译器会为你做的。你说我要去NEON写这篇那一刻你说我想我可以比编译器写出更好的NEON,因为编译器将使用它。这就是说,它通常可以写出更好的NEON比编译器(尤其是gcc和铛)。
If it were as easy as transliterating C, then the compiler would do that for you. The moment you say "I'm going to write this in NEON" you're saying "I think I can write better NEON than the compiler," because the compiler uses it too. That said, it often is possible to write better NEON than the compiler (particularly gcc and clang).
如果你准备去潜水进入那个世界(这是一个pretty清凉世界),你有一些阅读在你前面。下面是一些地方我建议:
If you're ready to go diving into that world (and it's a pretty cool world), you have some reading ahead of you. Here's some places I recommend:
- http://www.coranac.com/tonc/text/asm.htm (你真的想花一些时间与这一个)
- http://hilbert-space.de/ (整个网站。他停下书面方式方法太快。)
- 要您的具体问题,他在这里解释它:的 http://hilbert-space.de/?p=22
- http://www.coranac.com/tonc/text/asm.htm (You really want to spend some time with this one)
- http://hilbert-space.de/ (The whole site. He stopped writting way too soon.)
- To your specific question, he explains it here: http://hilbert-space.de/?p=22
所有的说 ...永远永远永远重新考虑你的算法开始。通常情况下,答案不是如何让你的循环快速计算,这如何不叫循环经常。
ALL THAT SAID... Always always always start by reconsidering your algorithm. Often the answer is not how to make your loop calculate quickly, it's how to not call the loop so often.
这篇关于ç与VDSP与NEON - NEON怎么能像C一样慢?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!