快速查找C数组中是否存在值? [英] Quickly find whether a value is present in a C array?

查看:167
本文介绍了快速查找C数组中是否存在值?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个嵌入式应用程序,它具有对时间要求严格的ISR,它需要遍历大小为256(最好是1024,但最小为256)的数组,并检查值是否与数组内容匹配.在这种情况下,bool将设置为true.

I have an embedded application with a time-critical ISR that needs to iterate through an array of size 256 (preferably 1024, but 256 is the minimum) and check if a value matches the arrays contents. A bool will be set to true is this is the case.

微控制器是NXP LPC4357,ARM Cortex M4内核,而编译器是GCC.我已经组合了优化级别2(速度较慢3),并将函数放置在RAM中而不是闪存中.我还使用指针算术和for循环,该循环执行递减计数而不是递增计数(检查i!=0是否比检查i<256更快).总而言之,我的持续时间最终为12.5µs,必须将其大幅缩短以确保可行.这是我现在使用的(伪)代码:

The microcontroller is an NXP LPC4357, ARM Cortex M4 core, and the compiler is GCC. I already have combined optimisation level 2 (3 is slower) and placing the function in RAM instead of flash. I also use pointer arithmetic and a for loop, which does down-counting instead of up (checking if i!=0 is faster than checking if i<256). All in all, I end up with a duration of 12.5 µs which has to be reduced drastically to be feasible. This is the (pseudo) code I use now:

uint32_t i;
uint32_t *array_ptr = &theArray[0];
uint32_t compareVal = 0x1234ABCD;
bool validFlag = false;

for (i=256; i!=0; i--)
{
    if (compareVal == *array_ptr++)
    {
         validFlag = true;
         break;
     }
}

绝对最快的方法是什么?允许使用内联汇编.也可以使用其他不太优雅"的技巧.

What would be the absolute fastest way to do this? Using inline assembly is allowed. Other 'less elegant' tricks are also allowed.

推荐答案

在性能至关重要的情况下,与手动调试汇编语言相比,C编译器很可能不会生成最快的代码.我倾向于走阻力最小的路径-对于像这样的小型例程,我只编写了asm代码,并且很好地知道执行一个循环需要多少个周期.您可能可以摆弄C代码并让编译器生成良好的输出,但是最终可能会浪费大量时间来调整输出.编译器(尤其是来自Microsoft的编译器)在过去的几年中取得了长足的进步,但是它们仍不如您耳边的编译器那么聪明,因为您正在处理的是特定情况,而不仅是一般情况.编译器可能未使用某些指令(例如LDM)来加快速度,因此不太可能聪明地展开循环.这是一种结合了我在注释中提到的3个想法的方法:循环展开,缓存预取和使用多重加载(ldm)指令.每个数组元素的指令周期数约为3个时钟,但这并未考虑内存延迟.

In situations where performance is of utmost importance, the C compiler will most likely not produce the fastest code compared to what you can do with hand tuned assembly language. I tend to take the path of least resistance - for small routines like this, I just write asm code and have a good idea how many cycles it will take to execute. You may be able to fiddle with the C code and get the compiler to generate good output, but you may end up wasting lots of time tuning the output that way. Compilers (especially from Microsoft) have come a long way in the last few years, but they are still not as smart as the compiler between your ears because you're working on your specific situation and not just a general case. The compiler may not make use of certain instructions (e.g. LDM) that can speed this up, and it's unlikely to be smart enough to unroll the loop. Here's a way to do it which incorporates the 3 ideas I mentioned in my comment: Loop unrolling, cache prefetch and making use of the multiple load (ldm) instruction. The instruction cycle count comes out to about 3 clocks per array element, but this doesn't take into account memory delays.

工作原理: ARM的CPU设计在一个时钟周期内执行大多数指令,但这些指令是在流水线中执行的. C编译器将尝试通过在其间插入其他指令来消除流水线延迟.当出现像原始C代码这样的紧密循环时,编译器将很难隐藏延迟,因为必须立即比较从内存读取的值.我下面的代码在2组4个寄存器之间切换,以显着减少内存本身和流水线获取数据的延迟.通常,在处理大型数据集时,如果您的代码没有利用大多数或所有可用寄存器,那么您将无法获得最佳性能.

Theory of operation: ARM's CPU design executes most instructions in one clock cycle, but the instructions are executed in a pipeline. C compilers will try to eliminate the pipeline delays by interleaving other instructions in between. When presented with a tight loop like the original C code, the compiler will have a hard time hiding the delays because the value read from memory must be immediately compared. My code below alternates between 2 sets of 4 registers to significantly reduce the delays of the memory itself and the pipeline fetching the data. In general, when working with large data sets and your code doesn't make use of most or all of the available registers, then you're not getting maximum performance.

; r0 = count, r1 = source ptr, r2 = comparison value

   stmfd sp!,{r4-r11}   ; save non-volatile registers
   mov r3,r0,LSR #3     ; loop count = total count / 8
   pld [r1,#128]
   ldmia r1!,{r4-r7}    ; pre load first set
loop_top:
   pld [r1,#128]
   ldmia r1!,{r8-r11}   ; pre load second set
   cmp r4,r2            ; search for match
   cmpne r5,r2          ; use conditional execution to avoid extra branch instructions
   cmpne r6,r2
   cmpne r7,r2
   beq found_it
   ldmia r1!,{r4-r7}    ; use 2 sets of registers to hide load delays
   cmp r8,r2
   cmpne r9,r2
   cmpne r10,r2
   cmpne r11,r2
   beq found_it
   subs r3,r3,#1        ; decrement loop count
   bne loop_top
   mov r0,#0            ; return value = false (not found)
   ldmia sp!,{r4-r11}   ; restore non-volatile registers
   bx lr                ; return
found_it:
   mov r0,#1            ; return true
   ldmia sp!,{r4-r11}
   bx lr

更新: 评论中有很多怀疑者认为我的经历是轶事/毫无价值,需要证明.我使用GCC 4.8(来自Android NDK 9C)通过优化-O2(打开了所有优化包括循环展开)来生成以下输出.我编译了上面问题中显示的原始C代码.这是GCC产生的:

Update: There are a lot of skeptics in the comments who think that my experience is anecdotal/worthless and require proof. I used GCC 4.8 (from the Android NDK 9C) to generate the following output with optimization -O2 (all optimizations turned on including loop unrolling). I compiled the original C code presented in the question above. Here's what GCC produced:

.L9: cmp r3, r0
     beq .L8
.L3: ldr r2, [r3, #4]!
     cmp r2, r1
     bne .L9
     mov r0, #1
.L2: add sp, sp, #1024
     bx  lr
.L8: mov r0, #0
     b .L2

GCC的输出不仅不会展开循环,还浪费了LDR之后的停顿时钟.每个阵列元素至少需要8个时钟.使用地址知道何时退出循环的确很不错,但是在此代码中找不到编译器能够执行的所有神奇操作.我没有在目标平台上运行代码(我没有代码),但是任何有ARM代码性能经验的人都可以看到我的代码更快.

GCC's output not only doesn't unroll the loop, but also wastes a clock on a stall after the LDR. It requires at least 8 clocks per array element. It does a good job of using the address to know when to exit the loop, but all of the magical things compilers are capable of doing are nowhere to be found in this code. I haven't run the code on the target platform (I don't own one), but anyone experienced in ARM code performance can see that my code is faster.

更新2: 我给了Microsoft的Visual Studio 2013 SP2一个机会,以使代码做得更好.它能够使用NEON指令对数组初始化进行矢量化处理,但由OP编写的线性值搜索与GCC生成的内容相似(我将标签重命名以使其更具可读性):

Update 2: I gave Microsoft's Visual Studio 2013 SP2 a chance to do better with the code. It was able to use NEON instructions to vectorize my array initialization, but the linear value search as written by the OP came out similar to what GCC generated (I renamed the labels to make it more readable):

loop_top:
   ldr  r3,[r1],#4  
   cmp  r3,r2  
   beq  true_exit
   subs r0,r0,#1 
   bne  loop_top
false_exit: xxx
   bx   lr
true_exit: xxx
   bx   lr

正如我所说,我不拥有OP的确切硬件,但是我将在3种不同版本的nVidia Tegra 3和Tegra 4上测试性能,并将结果发布在这里.

As I said, I don't own the OP's exact hardware, but I will be testing the performance on an nVidia Tegra 3 and Tegra 4 of the 3 different versions and post the results here soon.

更新3: 我在Tegra 3和Tegra 4(Surface RT,Surface RT 2)上运行了我的代码和Microsoft编译的ARM代码.我对一个循环进行了1000000次迭代,但找不到匹配项,因此所有内容都在缓存中,并且很容易测量.

Update 3: I ran my code and Microsoft's compiled ARM code on a Tegra 3 and Tegra 4 (Surface RT, Surface RT 2). I ran 1000000 iterations of a loop which fails to find a match so that everything is in cache and it's easy to measure.

             My Code       MS Code
Surface RT    297ns         562ns
Surface RT 2  172ns         296ns  

在两种情况下,我的代码运行速度几乎都快一倍.大多数现代ARM CPU可能会给出相似的结果.

In both cases my code runs almost twice as fast. Most modern ARM CPUs will probably give similar results.

这篇关于快速查找C数组中是否存在值?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆