优化 ARM Cortex M3 代码 [英] Optimizing ARM Cortex M3 code
问题描述
我有一个 C 函数,它试图将帧缓冲区复制到 FSMC RAM.
I have a C Function which tries to copy a framebuffer to FSMC RAM.
这些函数将游戏循环的帧速率消耗到 10FPS.我想知道如何分析反汇编函数,我应该计算每个指令周期吗?我想知道 CPU 的时间在哪里,在哪个部分.我确定算法也是一个问题,因为它的O(N^2)
The functions eats the frame rate of the game loop to 10FPS. I would like to know how to analyze the disassembled function, should I count each instruction cycle ? I want to know where the CPU spend its time, in which part. I'm sure that the algorithm is also a problem, because its O(N^2)
C 函数是:
void LCD_Flip()
{
u8 i,j;
LCD_SetCursor(0x00, 0x0000);
LCD_WriteRegister(0x0050,0x00);//GRAM horizontal start position
LCD_WriteRegister(0x0051,239);//GRAM horizontal end position
LCD_WriteRegister(0x0052,0);//Vertical GRAM Start position
LCD_WriteRegister(0x0053,319);//Vertical GRAM end position
LCD_WriteIndex(0x0022);
for(j=0;j<fbHeight;j++)
{
for(i=0;i<240;i++)
{
u16 color = frameBuffer[i+j*fbWidth];
LCD_WriteData(color);
}
}
}
反汇编函数:
08000fd0 <LCD_Flip>:
8000fd0: b580 push {r7, lr}
8000fd2: b082 sub sp, #8
8000fd4: af00 add r7, sp, #0
8000fd6: 2000 movs r0, #0
8000fd8: 2100 movs r1, #0
8000fda: f7ff fde9 bl 8000bb0 <LCD_SetCursor>
8000fde: 2050 movs r0, #80 ; 0x50
8000fe0: 2100 movs r1, #0
8000fe2: f7ff feb5 bl 8000d50 <LCD_WriteRegister>
8000fe6: 2051 movs r0, #81 ; 0x51
8000fe8: 21ef movs r1, #239 ; 0xef
8000fea: f7ff feb1 bl 8000d50 <LCD_WriteRegister>
8000fee: 2052 movs r0, #82 ; 0x52
8000ff0: 2100 movs r1, #0
8000ff2: f7ff fead bl 8000d50 <LCD_WriteRegister>
8000ff6: 2053 movs r0, #83 ; 0x53
8000ff8: f240 113f movw r1, #319 ; 0x13f
8000ffc: f7ff fea8 bl 8000d50 <LCD_WriteRegister>
8001000: 2022 movs r0, #34 ; 0x22
8001002: f7ff fe87 bl 8000d14 <LCD_WriteIndex>
8001006: 2300 movs r3, #0
8001008: 71bb strb r3, [r7, #6]
800100a: e01b b.n 8001044 <LCD_Flip+0x74>
800100c: 2300 movs r3, #0
800100e: 71fb strb r3, [r7, #7]
8001010: e012 b.n 8001038 <LCD_Flip+0x68>
8001012: 79f9 ldrb r1, [r7, #7]
8001014: 79ba ldrb r2, [r7, #6]
8001016: 4613 mov r3, r2
8001018: 011b lsls r3, r3, #4
800101a: 1a9b subs r3, r3, r2
800101c: 011b lsls r3, r3, #4
800101e: 1a9b subs r3, r3, r2
8001020: 18ca adds r2, r1, r3
8001022: 4b0b ldr r3, [pc, #44] ; (8001050 <LCD_Flip+0x80>)
8001024: f833 3012 ldrh.w r3, [r3, r2, lsl #1]
8001028: 80bb strh r3, [r7, #4]
800102a: 88bb ldrh r3, [r7, #4]
800102c: 4618 mov r0, r3
800102e: f7ff fe7f bl 8000d30 <LCD_WriteData>
8001032: 79fb ldrb r3, [r7, #7]
8001034: 3301 adds r3, #1
8001036: 71fb strb r3, [r7, #7]
8001038: 79fb ldrb r3, [r7, #7]
800103a: 2bef cmp r3, #239 ; 0xef
800103c: d9e9 bls.n 8001012 <LCD_Flip+0x42>
800103e: 79bb ldrb r3, [r7, #6]
8001040: 3301 adds r3, #1
8001042: 71bb strb r3, [r7, #6]
8001044: 79bb ldrb r3, [r7, #6]
8001046: 2b63 cmp r3, #99 ; 0x63
8001048: d9e0 bls.n 800100c <LCD_Flip+0x3c>
800104a: 3708 adds r7, #8
800104c: 46bd mov sp, r7
800104e: bd80 pop {r7, pc}
推荐答案
不能完全回答你的问题,但我看到你渴望快速循环的执行.
Not exactly answering your question, but I see you aspire for fast execution of the loops.
以下是书中的一些提示:'ARM 系统开发人员指南:设计和优化系统软件(计算机体系结构中的摩根考夫曼系列和设计)'http://www.amazon.com/ARM-System-Developers-Guide-Architecture/dp/1558608745
Here are some tips from the book: 'ARM System Developer's Guide: Designing and Optimizing System Software (The Morgan Kaufmann Series in Computer Architecture and Design)' http://www.amazon.com/ARM-System-Developers-Guide-Architecture/dp/1558608745
第 5 章包含名为C 循环结构"的部分.以下是该部分的摘要:
Chapter 5 contains section named 'C looping structures'. Here is the summary of the section:
高效编写循环
- 使用倒计时到零的循环.那么编译器就不需要分配一个寄存器来保存终止值,与零的比较是自由的.
- 默认使用无符号循环计数器和继续条件 i!=0 而不是 i>0.这将确保循环开销只有两条指令.
- 当您知道循环将至少迭代一次时,请使用 do-while 循环而不是 for 循环.这节省了编译器检查以查看循环计数是否为零.
- 展开重要循环以减少循环开销.不要过度展开.如果循环开销占总数的比例很小,那么展开将增加代码大小并损害缓存的性能.
- 尽量安排数组中元素的个数是四或八的倍数.然后,您可以轻松地展开循环 2、4 或 8 次,而无需担心剩余的数组元素.
根据总结,您的内部循环可能如下所示.
Based on the summary, your inner loop might look as below.
uinsigned int i = 240/4; // Use unsigned loop counters by default
// and the continuation condition i!=0
do
{
// Unroll important loops to reduce the loop overhead
LCD_WriteData( (u16)frameBuffer[ (i--) + (j*fbWidth) ] );
LCD_WriteData( (u16)frameBuffer[ (i--) + (j*fbWidth) ] );
LCD_WriteData( (u16)frameBuffer[ (i--) + (j*fbWidth) ] );
LCD_WriteData( (u16)frameBuffer[ (i--) + (j*fbWidth) ] );
}
while ( i != 0 ) // Use do-while loops rather than for
// loops when you know the loop will
// iterate at least once
您可能还想尝试使用编译指示",例如:
You might want to experiment also with 'pragmas', e.g. :
#pragma Otime
http://www.keil.com/support/man/docs/armcc/armcc_chr1359124989673.htm
#pragma unroll(n)
http://www.keil.com/support/man/docs/armcc/armcc_chr1359124992247.htm
因为是 Cortex-M3,所以尝试找出 MCU 硬件是否让您有机会安排代码/数据以利用其 Harvard 架构(我体验了 30% 的速度提升).
And as it is Cortex-M3 try to find out if MCU hardware gives you chance to arrange the code/data to take advantage of its Harvard architecture (I experienced 30% speed increase).
也许并非所有内容都适用于您的应用程序(以相反的顺序填充缓冲区).我只是想画您对本书的关注以及可能的优化点.
Maybe not everything may be applicable in your application (filling a buffer in reverse order). I just wanted to draw your attention to the book and possible points for optimization.
这篇关于优化 ARM Cortex M3 代码的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!