这个优化C（AVR）code [英] Optimising this C (AVR) code

查看：203 发布时间：2016/7/18 20:53:14 c optimization assembly avr

本文介绍了这个优化C（AVR）code的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有只是不为我想要做的运行速度不够快，中断处理程序。基本上我用它通过从查找表中的值输出到端口上的AVR microncontroller但不幸的是，这是速度不够快，我得到我想要的波的频率，产生正弦波。有人告诉我，我应该看看汇编实现为生成的程序集，编译器可能会稍微低效率，并能够进行优化，但看着组装code后，我实在看不出我能做些什么更好

I have an interrupt handler that just isn't running fast enough for what I want to do. Basically I'm using it to generate sine waves by outputting a value from a look up table to a PORT on an AVR microncontroller but, unfortunately, this isn't happening fast enough for me to get the frequency of the wave that I want. I was told that I should look at implementing it in assembly as the compiler generated assembly might be slightly inefficient and may be able to be optimised but after looking at the assembly code I really can't see what I could do any better.

这是C code：

const uint8_t amplitudes60[60] = {127, 140, 153, 166, 176, 191, 202, 212, 221, 230, 237, 243, 248, 251, 253, 254, 253, 251, 248, 243, 237, 230, 221, 212, 202, 191, 179, 166, 153, 140, 127, 114, 101, 88, 75, 63, 52, 42, 33, 24, 17, 11, 6, 3, 1, 0, 1, 3, 6, 11, 17, 24, 33, 42, 52, 63, 75, 88, 101, 114};
const uint8_t amplitudes13[13] = {127, 176,  221, 248,  202, 153, 101, 52, 17,  1, 6,  33,  75};
const uint8_t amplitudes10[10] = {127, 176,   248,  202, 101, 52, 17,  1,  33,  75};

volatile uint8_t numOfAmps = 60;
volatile uint8_t *amplitudes = amplitudes60;
volatile uint8_t amplitudePlace = 0; 

ISR(TIMER1_COMPA_vect) 
{
    PORTD = amplitudes[amplitudePlace];

    amplitudePlace++; 

    if(amplitudePlace == numOfAmps)
    {
        amplitudePlace = 0;
    }

}

幅度和numOfAmps都是由另一个改变中断运行比这慢得多程序（它基本上运行改变正在播放的频率）。在一天结束时，我将不会使用这些精确的阵列，但是这将是一个非常相似的设立。我将最有可能有60个值的数组，另一只30这是因为我建立一个频率清扫，并在较低的频率，我可以负担得起给它更多的样本，因为我有更多的时钟周期一起玩，但在较高的频率下，我非常紧张的时间。

amplitudes and numOfAmps are both changed by another interrupt routine that runs much slower than this one (it basically is run to change the frequencies that are being played). At the end of the day I won't be using those exact arrays but it will be a very similar set up. I'll most likely have an array with 60 values and another with just 30. This is because I'm building a frequency sweeper and at the lower frequencies I can afford to give it more samples as I have more clock cycles to play with but at the higher frequencies I'm very much strapped for time.

我不知道我可以得到它以较低的采样速率工作，但我不希望在每个周期30个样品去。我不认为具有指向数组使得任何慢作为大会获得来自阵列和装配值从一个指针数组得到一个价值似乎是相同的（这是有道理的）。

I do realise that I can get it to work with a lower sampling rate but I don't want to go under 30 samples per period. I don't think having the pointer to the array makes it any slower as the assembly to get a value from an array and the assembly to get a value from a pointer to an array seems the same (which makes sense).

目前，我必须出示我已经告诉我应该能够得到它，每个正弦波周期约30个样品的工作频率最高。在与30个样品的那一刻，它将运行速度最快的是在大约所需的最大频率的一半，我认为意味着我的中断需要两倍的速度运行。

At the highest frequency that I have to produce I've been told I should be able to get it working with about 30 samples per sine wave period. At the moment with 30 samples the fastest it will run is at about half the required max frequency which I think means that my interrupt needs to run twice as fast.

这样code模拟有需要时，65个周期来完成。再次，我已经告诉我应该能够在最好让它下降到约30次。

So that code there when simulated takes 65 cycles to complete. Again, I've been told I should be able to get it down to about 30 cycles at best.

这是生产的ASM code，我的想法是什么每一行做旁边：

This is the ASM code produced, with my thinking of what each line does next to it:

ISR(TIMER1_COMPA_vect) 
{
push    r1
push    r0
in      r0, 0x3f        ; save status reg
push    r0
eor     r1, r1      ; generates a 0 in r1, used much later
push    r24
push    r25
push    r30
push    r31         ; all regs saved


PORTD = amplitudes[amplitudePlace];
lds     r24, 0x00C8     ; r24 <- amplitudePlace I’m pretty sure
lds     r30, 0x00B4 ; these two lines load in the address of the 
lds     r31, 0x00B5 ; array which would explain why it’d a 16 bit number
                    ; if the atmega8 uses 16 bit addresses


add     r30, r24            ; aha, this must be getting the ADDRESS OF THE element 
adc     r31, r1             ; at amplitudePlace in the array.  

ld      r24, Z              ; Z low is r30, makes sense. I think this is loading
                            ; the memory located at the address in r30/r31 and
                            ; putting it into r24

out     0x12, r24           ; fairly sure this is putting the amplitude into PORTD

amplitudePlace++; 
lds     r24, 0x011C     ; r24 <- amplitudePlace
subi    r24, 0xFF       ; subi is subtract imediate.. 0xFF = 255 so I’m
                        ; thinking with an 8 bit value x, x+1 = x - 255;
                        ; I might just trust that the compiler knows what it’s 
                        ; doing here rather than try to change it to an ADDI 

sts     0x011C, r24     ; puts the new value back to the address of the
                        ; variable

if(amplitudePlace == numOfAmps)
lds     r25, 0x00C8 ; r24 <- amplitudePlace
lds     r24, 0x00B3 ; r25 <- numOfAmps 

cp      r24, r24        ; compares them 
brne    .+4             ; 0xdc <__vector_6+0x54>
        {
                amplitudePlace = 0;
                    sts     0x011C, r1 ; oh, this is why r1 was set to 0 earlier
        }


}

pop     r31             ; restores the registers
pop     r30
pop     r25
pop     r24
pop     r19
pop     r18
pop     r0
out     0x3f, r0        ; 63
pop     r0
pop     r1
reti

除了在中断可能使用更少的寄存器，这样我就少推/噼噗声我真的无法看到本次大会code是低效的。

Apart from maybe using less registers in the interrupt so that I have less push/pops I really can't see where this assembly code is inefficient.

我唯一想到的其他也许是if语句可以摆脱了，如果我能工作，如何让A N位的int数据类型用C，这样的数量将环绕当它到达终点？我的意思是我有2 ^ N - 1的样本，然后有amplitudePlace变量只是保持计数，这样，当它达到2 ^ N它会溢出，将被重置为零

My only other thought is maybe the if statement could be gotten rid of if I could work out how to get a n bit int datatype in C so that the number will wrap around when it reaches the end? By this I mean I would have 2^n - 1 samples and then have the amplitudePlace variable just keep counting up so that when it reaches 2^n it'll overflow and will be reset to zero.

我也尝试完全模拟code，而不该位，虽然它确实提高了速度，只用了约10个周期关闭，因此，这是在为一个执行约55个周期仍然是不够快不幸的是不够，所以我确实需要优化code更进一步，这是很难不考虑它的只有2行！

I did try simulating the code without the if bit completely and while it did improve the speed, it only took about 10 cycles off so that it was at about 55 cycles for one execution which still isn't quite fast enough unfortunately so I do need to optimise the code even further which is hard considering without that it's only 2 lines!!

我的另一个真正的想法是，看看我能的地方，花费更少的时钟周期来访问表存储静态的样子吗？它使用访问数组我觉得LDS指令都以2个周期，所以我可能不会真正有节省很多时间，但在这个阶段，我愿意尝试任何事情。

My only other real thought is to see if I can store the static look up tables somewhere that takes less clock cycles to access? The LDS instructions it uses to access the array I think all take 2 cycles so I probably wouldn't really be saving much time there but at this stage I'm willing to try anything.

我完全在哪里从这里走了损失。我不看我怎么能让我的C code任何更有效，但我只是相当新的这样的事情，所以我可能会失去了一些东西。我喜欢任何形式的帮助。我意识到这是一个pretty特殊和复杂的问题，通常我会尽量避免问那些诸如此类的问题在这里，但我一直在这对年龄和我在全部损失，所以我会真的采取任何帮助，我可以得到的。

I'm totally at a loss of where to go from here. I can't see how I could make my C code any more efficient but I'm only fairly new to this sort of thing so I could be missing something. I would love any sort of help.. I realise this is a pretty particular and involved problem and normally I'd try to avoid asking those sort of questions here but I've been working on this for ages and am at a total loss so I'll really take any help that I can get.

推荐答案

我可以看到几个方面展开工作，没有特定的顺序列出的：

I can see a few areas to start working on, listed in no particular order:

1 减少寄存器的数量推，因为每个PUSH / POP对占用4个周期。例如， AVR-GCC 允许你从它的寄存器分配除掉一些寄存器，所以你可以在单一的ISR将其用于寄存器变量，并确保它们仍然含有从上一次的值。您可能还摆脱 R1 的推动和 EOR R1，R1 如果你的程序不落<$ C $的C> R1 来什么，但 0 。

1. Reduce the number of registers to push, as each push/pop pair takes four cycles. For example, avr-gcc allows you to remove a few registers from its register allocator, so you can just use them for register variables in that single ISR and be sure they still contain the value from last time. You might also get rid of the pushing of r1 and eor r1,r1 if your program never sets r1 to anything but 0.

2 使用局部临时变量数组索引的新值不必要的加载和存储指令保存到volatile变量。事情是这样的：

2. Use a local temporary variable for the new value of the array index to save unnecessary load and store instructions to that volatile variable. Something like this:

volatile uint8_t amplitudePlace;

ISR() {
    uint8_t place = amplitudePlace;
    [ do all your stuff with place to avoid memory access to amplitudePlace ]
    amplitudePlace = place;
}

3。倒数从59到0，而不是从0到59，避免单独的比较指令（0比较反正发生在减法）。伪code：

3. Count backwards from 59 to 0 instead of from 0 to 59 to avoid the separate comparison instruction (comparison with 0 happens anyway in subtraction). Pseudo code:

     sub  rXX,1
     goto Foo if non-zero
     movi rXX, 59
Foo:

而不是

     add  rXX,1
     compare rXX with 60
     goto Foo if >=
     movi rXX, 0
Foo:

4。或许使用指针与指针的比较（以precalculated值！），而不是数组索引。它需要被检查与计数向后哪一个更有效。也许对齐阵列256字节边界并仅使用8位寄存器中的指针，以节省加载和保存的地址的高8位。（如果你用完了SRAM，你仍然可以适合那些60 NBSP 4的内容;字节数组到一个256＆NBSP;字节数组，仍然可以得到由8个恒定的高比特的所有地址和8变量低位的优势。）

4. Perhaps use pointers and pointer comparisons (with precalculated values!) instead of array indexes. It needs to be checked versus counting backwards which one is more efficient. Maybe align the arrays to 256 byte boundaries and use only 8-bit registers for the pointers to save on loading and saving the higher 8 bits of the addresses. (If you are running out of SRAM, you can still fit the content of 4 of those 60 byte arrays into one 256 byte array and still get the advantage of all addresses consisting of 8 constant high bits and the 8 variable lower bits.)

uint8_t array[60];
uint8_t *idx = array; /* shortcut for &array[0] */
const uint8_t *max_idx = &array[59];

ISR() {
    PORTFOO = *idx;
    ++idx;
    if (idx > max_idx) {
        idx = array;
    }
}

问题是，指针是16＆NBSP;而有点简单的数组索引前身为8＆NBSP;大小位。与帮助，如果您设计您的数组的地址，这样的高8＆NBSP可能是一招;地址位是常数（组装code，的Hi8（阵列） ），你只处理，实际上在ISR改变低8位。这是否意味着编写汇编code，虽然。从上面生成的程序集code可能是一个很好的起点在汇编编写该版本的ISR的。

The problem is that pointers are 16 bit whereas your simple array index formerly was 8 bit in size. Helping with that might be a trick if you design your array addresses such that the higher 8 bits of the address are constants (in assembly code, hi8(array)), and you only deal with the lower 8 bits that actually change in the ISR. That does mean writing assembly code, though. The generated assembly code from above might be a good starting point for writing that version of the ISR in assembly.

<强> 5 如果从一个定时点可行，调整样品缓冲大小为2的幂来代替如果-复位到零部分用一个简单的 I =第（i + 1）及（（1 <<;＆lt;电力）-1）; 。如果你想要去中提出的8位/ 8位地址拆分的 4 的，甚至可能会到256的两个功率（和复制必要的样本数据，以填补256群组。字节的缓冲区）甚至将节省您的ADD后和指导。

5. If feasible from a timing point of view, adjust the sample buffer size to a power of 2 to replace the if-reset-to-zero part with a simple i = (i+1) & ((1 << POWER)-1);. If you want to go with the 8-bit/8-bit address split proposed in 4., perhaps even going to 256 for the power of two (and duplicating sample data as necessary to fill the 256 byte buffer) will even save you the AND instruction after the ADD.

6。如果ISR中只使用不影响状态寄存器，停止推动和弹出 SREG 说明

6. In case the ISR only uses instructions which do not affect the status register, stop push and popping SREG.

常规

下面可能会派上用场尤其是对于手动检查所有其他装配code的假设：

The following might come in handy especially for manually checking all the other assembly code for assumptions:

firmware-%.lss: firmware-%.elf
        $(OBJDUMP) -h -S $< > $@

这生成了整个固件映像的完整评论汇编语言上市。你可以用它来验证寄存器（非）的使用。请注意，启动code只有你首先启用中断将不会与你的ISR后来的专用寄存器的干扰之前运行一次长。

This generates a commented complete assembly language listing of the whole firmware image. You can use that to verify register (non-)usage. Note that startup code only run once long before you first enable interrupts will not interfere with your ISR's later exclusive use of registers.

如果你决定不直接写在装配code的ISR，我会建议你写C code和每个编译后检查生成的程序集code，以便立即观察你的改变最终生成。

If you decide to not write that ISR in assembly code directly, I would recommend you write the C code and check the generated assembly code after every compilation, in order to immediately observe what your changes end up generating.

您可能最终写了十几个或C和汇编的ISR如此变种，增加了周期对每个变种，然后艇员选拔最好的一个。

You might end up writing a dozen or so variants of the ISR in C and assembly, adding up the cycles for each variant, and then chosing the best one.

注意没有做任何登记预约，我最终的东西围绕31个周期的ISR（不包括进出，这又增加了8个或10个周期）。彻底摆脱寄存器推会得到ISR下降到15个周期。更改为样本缓冲区为256字节大小不变，并给予ISR独家采用四个寄存器允许让到被消耗在ISR 6个周期（加上8或10进入/离开）。

Note Without doing any register reservation, I end up with something around 31 cycles for the ISR (excluding entering and leaving, which adds another 8 or 10 cycles). Completely getting rid of the register pushing would get the ISR down to 15 cycles. Changing to a sample buffer with a constant size of 256 bytes and giving the ISR exclusive use of four registers allows getting down to 6 cycles being spent in the ISR (plus 8 or 10 to enter/leave).

这篇关于这个优化C（AVR）code的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

这个优化C（AVR）code [英] Optimising this C (AVR) code

问题描述

推荐答案

相关文章

.NET Framework最新文章

热门教程

热门工具

登录关闭

这个优化C（AVR）code [英] Optimising this C (AVR) code

问题描述

推荐答案

相关文章

.NET Framework最新文章

热门教程

热门工具

登录 关闭

登录关闭