快速查找 C 数组中是否存在值? [英] Quickly find whether a value is present in a C array?

查看:23
本文介绍了快速查找 C 数组中是否存在值?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个带有时间关键型 ISR 的嵌入式应用程序,它需要遍历大小为 256(最好是 1024,但 256 是最小值)的数组,并检查值是否与数组内容匹配.在这种情况下,bool 将设置为 true.

I have an embedded application with a time-critical ISR that needs to iterate through an array of size 256 (preferably 1024, but 256 is the minimum) and check if a value matches the arrays contents. A bool will be set to true is this is the case.

微控制器为NXP LPC4357,ARM Cortex M4内核,编译器为GCC.我已经结合了优化级别 2(3 级较慢)并将该函数放置在 RAM 中而不是闪存中.我还使用指针算法和一个 for 循环,它进行向下计数而不是向上计数(检查 i!=0 是否比检查 i<256).总而言之,我最终的持续时间为 12.5 微秒,必须大幅减少才能可行.这是我现在使用的(伪)代码:

The microcontroller is an NXP LPC4357, ARM Cortex M4 core, and the compiler is GCC. I already have combined optimisation level 2 (3 is slower) and placing the function in RAM instead of flash. I also use pointer arithmetic and a for loop, which does down-counting instead of up (checking if i!=0 is faster than checking if i<256). All in all, I end up with a duration of 12.5 µs which has to be reduced drastically to be feasible. This is the (pseudo) code I use now:

uint32_t i;
uint32_t *array_ptr = &theArray[0];
uint32_t compareVal = 0x1234ABCD;
bool validFlag = false;

for (i=256; i!=0; i--)
{
    if (compareVal == *array_ptr++)
    {
         validFlag = true;
         break;
     }
}

执行此操作的绝对最快方法是什么?允许使用内联汇编.其他不太优雅"的技巧也是允许的.

What would be the absolute fastest way to do this? Using inline assembly is allowed. Other 'less elegant' tricks are also allowed.

推荐答案

在性能至关重要的情况下,与手动调整的汇编语言相比,C 编译器很可能不会生成最快的代码.我倾向于走阻力最小的路径——对于像这样的小程序,我只编写 asm 代码,并且很清楚执行需要多少个周期.您也许可以摆弄 C 代码并让编译器生成良好的输出,但最终可能会浪费大量时间以这种方式调整输出.编译器(尤其是来自 Microsoft 的)在过去几年中取得了长足的进步,但它们仍然不如您耳边的编译器那么聪明,因为您正在处理您的特定情况,而不仅仅是一般情况.编译器可能不会使用某些可以加速这一过程的指令(例如 LDM),并且它不太可能足够聪明来展开循环.这是一种方法,它结合了我在评论中提到的 3 个想法:循环展开、缓存预取和使用多重加载 (ldm) 指令.每个数组元素的指令周期数约为 3 个时钟,但这并未考虑内存延迟.

In situations where performance is of utmost importance, the C compiler will most likely not produce the fastest code compared to what you can do with hand tuned assembly language. I tend to take the path of least resistance - for small routines like this, I just write asm code and have a good idea how many cycles it will take to execute. You may be able to fiddle with the C code and get the compiler to generate good output, but you may end up wasting lots of time tuning the output that way. Compilers (especially from Microsoft) have come a long way in the last few years, but they are still not as smart as the compiler between your ears because you're working on your specific situation and not just a general case. The compiler may not make use of certain instructions (e.g. LDM) that can speed this up, and it's unlikely to be smart enough to unroll the loop. Here's a way to do it which incorporates the 3 ideas I mentioned in my comment: Loop unrolling, cache prefetch and making use of the multiple load (ldm) instruction. The instruction cycle count comes out to about 3 clocks per array element, but this doesn't take into account memory delays.

工作原理: ARM 的 CPU 设计在一个时钟周期内执行大部分指令,但指令是在流水线中执行的.C 编译器将尝试通过在其间交错其他指令来消除流水线延迟.当出现像原始 C 代码这样的紧密循环时,编译器将很难隐藏延迟,因为必须立即比较从内存中读取的值.我下面的代码在 2 组 4 个寄存器之间交替,以显着减少内存本身和获取数据的管道的延迟.一般来说,当处理大型数据集并且您的代码没有使用大部分或全部可用寄存器时,您将无法获得最佳性能.

Theory of operation: ARM's CPU design executes most instructions in one clock cycle, but the instructions are executed in a pipeline. C compilers will try to eliminate the pipeline delays by interleaving other instructions in between. When presented with a tight loop like the original C code, the compiler will have a hard time hiding the delays because the value read from memory must be immediately compared. My code below alternates between 2 sets of 4 registers to significantly reduce the delays of the memory itself and the pipeline fetching the data. In general, when working with large data sets and your code doesn't make use of most or all of the available registers, then you're not getting maximum performance.

; r0 = count, r1 = source ptr, r2 = comparison value

   stmfd sp!,{r4-r11}   ; save non-volatile registers
   mov r3,r0,LSR #3     ; loop count = total count / 8
   pld [r1,#128]
   ldmia r1!,{r4-r7}    ; pre load first set
loop_top:
   pld [r1,#128]
   ldmia r1!,{r8-r11}   ; pre load second set
   cmp r4,r2            ; search for match
   cmpne r5,r2          ; use conditional execution to avoid extra branch instructions
   cmpne r6,r2
   cmpne r7,r2
   beq found_it
   ldmia r1!,{r4-r7}    ; use 2 sets of registers to hide load delays
   cmp r8,r2
   cmpne r9,r2
   cmpne r10,r2
   cmpne r11,r2
   beq found_it
   subs r3,r3,#1        ; decrement loop count
   bne loop_top
   mov r0,#0            ; return value = false (not found)
   ldmia sp!,{r4-r11}   ; restore non-volatile registers
   bx lr                ; return
found_it:
   mov r0,#1            ; return true
   ldmia sp!,{r4-r11}
   bx lr

更新:评论中有很多怀疑论者认为我的经历是轶事/毫无价值并且需要证据.我使用 GCC 4.8(来自 Android NDK 9C)通过优化 -O2 生成以下输出(所有优化都打开包括循环展开).我编译了上面问题中提供的原始 C 代码.以下是 GCC 生成的内容:

Update: There are a lot of skeptics in the comments who think that my experience is anecdotal/worthless and require proof. I used GCC 4.8 (from the Android NDK 9C) to generate the following output with optimization -O2 (all optimizations turned on including loop unrolling). I compiled the original C code presented in the question above. Here's what GCC produced:

.L9: cmp r3, r0
     beq .L8
.L3: ldr r2, [r3, #4]!
     cmp r2, r1
     bne .L9
     mov r0, #1
.L2: add sp, sp, #1024
     bx  lr
.L8: mov r0, #0
     b .L2

GCC 的输出不仅没有展开循环,而且在 LDR 之后的停顿上浪费了一个时钟.每个阵列元素至少需要 8 个时钟.它在使用地址来知道何时退出循环方面做得很好,但是在这段代码中找不到编译器能够做的所有神奇的事情.我没有在目标平台上运行代码(我没有自己的),但是任何在 ARM 代码性能方面有经验的人都可以看到我的代码更快.

GCC's output not only doesn't unroll the loop, but also wastes a clock on a stall after the LDR. It requires at least 8 clocks per array element. It does a good job of using the address to know when to exit the loop, but all of the magical things compilers are capable of doing are nowhere to be found in this code. I haven't run the code on the target platform (I don't own one), but anyone experienced in ARM code performance can see that my code is faster.

更新 2:我让 Microsoft 的 Visual Studio 2013 SP2 有机会用代码做得更好.它能够使用 NEON 指令来向量化我的数组初始化,但是 OP 编写的线性值搜索结果类似于 GCC 生成的内容(我重命名了标签以使其更具可读性):

Update 2: I gave Microsoft's Visual Studio 2013 SP2 a chance to do better with the code. It was able to use NEON instructions to vectorize my array initialization, but the linear value search as written by the OP came out similar to what GCC generated (I renamed the labels to make it more readable):

loop_top:
   ldr  r3,[r1],#4  
   cmp  r3,r2  
   beq  true_exit
   subs r0,r0,#1 
   bne  loop_top
false_exit: xxx
   bx   lr
true_exit: xxx
   bx   lr

正如我所说,我不拥有 OP 的确切硬件,但我将在 3 个不同版本的 nVidia Tegra 3 和 Tegra 4 上测试性能,并很快在此处发布结果.

As I said, I don't own the OP's exact hardware, but I will be testing the performance on an nVidia Tegra 3 and Tegra 4 of the 3 different versions and post the results here soon.

更新 3:我在 Tegra 3 和 Tegra 4(Surface RT、Surface RT 2)上运行我的代码和 Microsoft 编译的 ARM 代码.我运行了 1000000 次循环迭代,但未能找到匹配项,因此所有内容都在缓存中并且易于测量.

Update 3: I ran my code and Microsoft's compiled ARM code on a Tegra 3 and Tegra 4 (Surface RT, Surface RT 2). I ran 1000000 iterations of a loop which fails to find a match so that everything is in cache and it's easy to measure.

             My Code       MS Code
Surface RT    297ns         562ns
Surface RT 2  172ns         296ns  

在这两种情况下,我的代码运行速度几乎是原来的两倍.大多数现代 ARM CPU 可能会给出类似的结果.

In both cases my code runs almost twice as fast. Most modern ARM CPUs will probably give similar results.

这篇关于快速查找 C 数组中是否存在值?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆