优化的ARM Cortex M3 code [英] Optimizing ARM Cortex M3 code

查看：168 发布时间：2016/5/29 14:50:25 assembly arm disassembly stm32

本文介绍了优化的ARM Cortex M3 code的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有尝试将帧缓冲区拷贝到RAM FSMC C函数。

的功能吃游戏循环到10FPS的帧速率。我想知道如何分析拆解的功能，我应该算在每个指令周期？我想知道那里的CPU把时间花在，其中一部分。我敢肯定，该算法也有问题，因为它的O（N ^ 2）

C函数是：

 无效LCD_Flip（）
{    U8 I，J;
    LCD_SetCursor（0×00，0×0000）;
    LCD_WriteRegister（0x0050,0x00）; // GRAM水平起始位置
    LCD_WriteRegister（0x0051,239）; // GRAM水平结束位置
    LCD_WriteRegister（0x0052,0）; //垂直GRAM起始位置
    LCD_WriteRegister（0x0053,319）; //垂直GRAM结束位置
    LCD_WriteIndex（0×0022）;    为（J = 0; J＆LT; fbHeight; J ++）
    {
        对于（i = 0; I＆LT; 240;我++）
        {
            U16颜色= framebuffer的[I + J * fbWidth]。
            LCD_WriteData（颜色）;        }
    }}

拆解功能：

  08000fd0＆LT; LCD_Flip计算值：
 8000fd0：b580推{R7，LR}
 8000fd2：B082分SP，＃8
 8000fd4：AF00添加R7，SP，＃0
 8000fd6：2000 MOVS R0，＃0
 8000fd8：2100 MOVS R1，＃0
 8000fda：F7FF fde9 BL 8000bb0＆LT; LCD_SetCursor＆GT;
 8000fde 2050 MOVS R0，＃80;为0x50
 8000fe0：2100 MOVS R1，＃0
 8000fe2：F7FF feb5 BL 8000d50＆LT; LCD_WriteRegister＆GT;
 8000fe6：2051 MOVS R0，＃81; 0x51
 8000fe8：21ef MOVS R1，＃239; 0xef
 8000fea：F7FF feb1 BL 8000d50＆LT; LCD_WriteRegister＆GT;
 8000fee：2052 MOVS R0，＃82; 0×52
 8000ff0：2100 MOVS R1，＃0
 8000ff2：F7FF FEAD BL 8000d50＆LT; LCD_WriteRegister＆GT;
 8000ff6：2053 MOVS R0，＃83; 0x53
 8000ff8：F240 113F MOVW R1，＃319; 0x13f
 8000ffc：F7FF fea8 BL 8000d50＆LT; LCD_WriteRegister＆GT;
 8001000：2022 MOVS R0，＃34; 0x22
 8001002：F7FF fe87 BL 8000d14＆LT; LCD_WriteIndex＆GT;
 8001006：2300 MOVS R3，＃0
 8001008：71BB STRB R3，[R7，＃6]
 800100a：E01B b.n 8001044＆LT; LCD_Flip + 0x74＆GT;
 800100c：2300 MOVS R3，＃0
 800100e：71fb STRB R3，[R7，＃7]
 8001010：E012 b.n 8001038＆LT; LCD_Flip +为0x68＆GT;
 8001012：79f9 LDRB R1，[R7，＃7]
 8001014：79ba LDRB R2，[R7，＃6]
 8001016：4613 MOV R3，R2
 8001018：011B LSLS R3，R3，＃4
 800101a：1a9b潜艇R3，R3，R2
 800101c：011B LSLS R3，R3，＃4
 800101e：1a9b潜艇R3，R3，R2
 8001020：18ca添加R2，R1，R3
 8001022：4b0b LDR R3，[PC，＃44]; （8001050＆LT; LCD_Flip + 0x80的＆GT;）
 8001024：F833 3012 ldrh.w R3，[R3，R2，LSL＃1]
 8001028：80BB STRH R3，[R7，＃4]
 800102a：88bb LDRH R3，[R7，＃4]
 800102c：4618 MOV R0，R3
 800102e：F7FF fe7f BL 8000d30＆LT; LCD_WriteData＆GT;
 8001032：石川铃华LDRB R3，[R7，＃7]
 8001034：3301增加了R3，＃1
 8001036：71fb STRB R3，[R7，＃7]
 8001038：石川铃华LDRB R3，[R7，＃7]
 800103a：2bef CMP R3，＃239; 0xef
 800103c：D9E9 bls.n 8001012＆LT; LCD_Flip +的0x42＆GT;
 800103e：79bb LDRB R3，[R7，＃6]
 8001040：3301增加了R3，＃1
 8001042：71BB STRB R3，[R7，＃6]
 8001044：79bb LDRB R3，[R7，＃6]
 8001046：2b63 CMP R3，＃99; 0x63
 8001048：d9e0 bls.n 800100c＆LT; LCD_Flip +＆为0x3C GT;
 800104a：3708增加了R7，＃8
 800104c：46bd MOV SP，R7
 800104e：BD80流行{R7，PC}

解决方案

不完全回答你的问题，但我看到你渴望快速
执行的循环。

下面是一些提示，从书：

ARM系统开发人员指南：设计和优化系统软件（摩根考夫曼系列计算机体系结构和设计）
http://www.amazon.com/ARM-System-Developers-Guide-Architecture/dp/1558608745

第5章包含名为'C循环结构一节。
这里是部分的摘要

写作环路有效地

使用回路，倒计时到零。然后，编译器不需要分配一个寄存器来保存终止值，并与零的比较是免费的。

在默认情况下并继续条件i。使用无符号循环计数器！= 0，而不是我> 0。这将确保循环的开销是只有两个指令。

使用DO-while循环而非for循环，当你知道循环将遍历至少一次。这节省了编译器检查，看看是否循环计数是零。

展开重要的循环，以减少循环开销。不要overunroll。如果循环开销小，总的比例，然后展开将增加code尺寸，伤害了高速缓存的性能。

尝试安排在数组中元素的数量是四个或八个倍数。然后，可以通过二，四，八倍容易展开循环，而不必担心吃剩的数组元素。

根据总之，你的内循环看起来如下图所示。

  uinsigned INT I =四分之二百四十○; //使用默认的无符号循环计数器
                          //并继续条件我！= 0做
{
    //摊开重要的循环，以减少循环开销
    LCD_WriteData（（U16）FRAMEBUFFER [（--i）+（J * fbWidth）]）;
    LCD_WriteData（（U16）FRAMEBUFFER [（--i）+（J * fbWidth）]）;
    LCD_WriteData（（U16）FRAMEBUFFER [（--i）+（J * fbWidth）]）;
    LCD_WriteData（（U16）FRAMEBUFFER [（--i）+（J * fbWidth）]）;
}
同时，（我！= 0）//使用DO-while循环，而不是
                  //循环，当你知道循环将
                  //迭代至少一次

您可能想用杂注为好，如实验也

 的#pragma Otime

http://www.keil.com/support/man/文档/ armccref / armccref_bcfcdbic.htm

 为#pragma unroll（N）

http://www.keil.com/support/man/文档/ armccref / armccref_cjacacfe.htm

也许不是一切可能适用于您的应用程序
（填充以相反的顺序一个缓冲区）。我只是想画
你注意这本书，可能点进行优化。

I have a C Function which tries to copy a framebuffer to FSMC RAM.

The functions eats the frame rate of the game loop to 10FPS. I would like to know how to analyze the disassembled function, should I count each instruction cycle ? I want to know where the CPU spend its time, in which part. I'm sure that the algorithm is also a problem, because its O(N^2)

The C Function is:

void LCD_Flip()
{

    u8  i,j;


    LCD_SetCursor(0x00, 0x0000);
    LCD_WriteRegister(0x0050,0x00);//GRAM horizontal start position
    LCD_WriteRegister(0x0051,239);//GRAM horizontal end position
    LCD_WriteRegister(0x0052,0);//Vertical GRAM Start position
    LCD_WriteRegister(0x0053,319);//Vertical GRAM end position
    LCD_WriteIndex(0x0022);

    for(j=0;j<fbHeight;j++)
    {
        for(i=0;i<240;i++)
        {
            u16 color = frameBuffer[i+j*fbWidth];
            LCD_WriteData(color);

        }
    }

}

Disassembled function:

08000fd0 <LCD_Flip>:
 8000fd0:   b580        push    {r7, lr}
 8000fd2:   b082        sub sp, #8
 8000fd4:   af00        add r7, sp, #0
 8000fd6:   2000        movs    r0, #0
 8000fd8:   2100        movs    r1, #0
 8000fda:   f7ff fde9   bl  8000bb0 <LCD_SetCursor>
 8000fde:   2050        movs    r0, #80 ; 0x50
 8000fe0:   2100        movs    r1, #0
 8000fe2:   f7ff feb5   bl  8000d50 <LCD_WriteRegister>
 8000fe6:   2051        movs    r0, #81 ; 0x51
 8000fe8:   21ef        movs    r1, #239    ; 0xef
 8000fea:   f7ff feb1   bl  8000d50 <LCD_WriteRegister>
 8000fee:   2052        movs    r0, #82 ; 0x52
 8000ff0:   2100        movs    r1, #0
 8000ff2:   f7ff fead   bl  8000d50 <LCD_WriteRegister>
 8000ff6:   2053        movs    r0, #83 ; 0x53
 8000ff8:   f240 113f   movw    r1, #319    ; 0x13f
 8000ffc:   f7ff fea8   bl  8000d50 <LCD_WriteRegister>
 8001000:   2022        movs    r0, #34 ; 0x22
 8001002:   f7ff fe87   bl  8000d14 <LCD_WriteIndex>
 8001006:   2300        movs    r3, #0
 8001008:   71bb        strb    r3, [r7, #6]
 800100a:   e01b        b.n 8001044 <LCD_Flip+0x74>
 800100c:   2300        movs    r3, #0
 800100e:   71fb        strb    r3, [r7, #7]
 8001010:   e012        b.n 8001038 <LCD_Flip+0x68>
 8001012:   79f9        ldrb    r1, [r7, #7]
 8001014:   79ba        ldrb    r2, [r7, #6]
 8001016:   4613        mov r3, r2
 8001018:   011b        lsls    r3, r3, #4
 800101a:   1a9b        subs    r3, r3, r2
 800101c:   011b        lsls    r3, r3, #4
 800101e:   1a9b        subs    r3, r3, r2
 8001020:   18ca        adds    r2, r1, r3
 8001022:   4b0b        ldr r3, [pc, #44]   ; (8001050 <LCD_Flip+0x80>)
 8001024:   f833 3012   ldrh.w  r3, [r3, r2, lsl #1]
 8001028:   80bb        strh    r3, [r7, #4]
 800102a:   88bb        ldrh    r3, [r7, #4]
 800102c:   4618        mov r0, r3
 800102e:   f7ff fe7f   bl  8000d30 <LCD_WriteData>
 8001032:   79fb        ldrb    r3, [r7, #7]
 8001034:   3301        adds    r3, #1
 8001036:   71fb        strb    r3, [r7, #7]
 8001038:   79fb        ldrb    r3, [r7, #7]
 800103a:   2bef        cmp r3, #239    ; 0xef
 800103c:   d9e9        bls.n   8001012 <LCD_Flip+0x42>
 800103e:   79bb        ldrb    r3, [r7, #6]
 8001040:   3301        adds    r3, #1
 8001042:   71bb        strb    r3, [r7, #6]
 8001044:   79bb        ldrb    r3, [r7, #6]
 8001046:   2b63        cmp r3, #99 ; 0x63
 8001048:   d9e0        bls.n   800100c <LCD_Flip+0x3c>
 800104a:   3708        adds    r7, #8
 800104c:   46bd        mov sp, r7
 800104e:   bd80        pop {r7, pc}

解决方案

Not exactly answering your question, but I see you aspire for fast execution of the loops.

Here are some tips from the book:

ARM System Developer's Guide: Designing and Optimizing System Software (The Morgan Kaufmann Series in Computer Architecture and Design) http://www.amazon.com/ARM-System-Developers-Guide-Architecture/dp/1558608745

Chapter 5 contains section named 'C looping structures'. Here is the summary of the section:

Writing Loops Efficiently

Use loops that count down to zero. Then the compiler does not need to allocate a register to hold the termination value, and the comparison with zero is free.
Use unsigned loop counters by default and the continuation condition i!=0 rather than i>0. This will ensure that the loop overhead is only two instructions.
Use do-while loops rather than for loops when you know the loop will iterate at least once. This saves the compiler checking to see if the loop count is zero.
Unroll important loops to reduce the loop overhead. Do not overunroll. If the loop overhead is small as a proportion of the total, then unrolling will increase code size and hurt the performance of the cache.
Try to arrange that the number of elements in arrays are multiples of four or eight. You can then unroll loops easily by two, four, or eight times without worrying about the leftover array elements.

Based on summary, your inner loop might look as below.

uinsigned int i = 240/4;  // Use unsigned loop counters by default
                          // and the continuation condition i!=0

do
{
    // Unroll important loops to reduce the loop overhead
    LCD_WriteData( (u16)frameBuffer[ (--i) + (j*fbWidth) ] );
    LCD_WriteData( (u16)frameBuffer[ (--i) + (j*fbWidth) ] );
    LCD_WriteData( (u16)frameBuffer[ (--i) + (j*fbWidth) ] );
    LCD_WriteData( (u16)frameBuffer[ (--i) + (j*fbWidth) ] );
}
while ( i != 0 )  // Use do-while loops rather than for
                  // loops when you know the loop will
                  // iterate at least once

You might want to experiment also with 'pragmas' as well, e.g. :

#pragma Otime

http://www.keil.com/support/man/docs/armccref/armccref_bcfcdbic.htm

#pragma unroll(n)

http://www.keil.com/support/man/docs/armccref/armccref_cjacacfe.htm

Maybe not everything may be applicable in your application (filling a buffer in reverse order). I just wanted to draw your attention to the book and possible points for optimization.

这篇关于优化的ARM Cortex M3 code的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

优化的ARM Cortex M3 code [英] Optimizing ARM Cortex M3 code

问题描述

相关文章

其它硬件开发最新文章

热门教程

热门工具

登录关闭

优化的ARM Cortex M3 code [英] Optimizing ARM Cortex M3 code

问题描述

相关文章

其它硬件开发最新文章

热门教程

热门工具

登录 关闭

登录关闭