如何说服AVR-GCC全局字节数组的内存位置是一个常量 [英] How to convince avr-gcc, that the memory position of a global byte array is a constant

查看:152
本文介绍了如何说服AVR-GCC全局字节数组的内存位置是一个常量的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我为具有 8位反向"例程. ="nofollow"> ATmega2560 处理器. 我正在使用

  • GNU C( WinAVR 20100110)版本4.3.3(avr)/由GNU C版本3.4编译. 5(mingw-vista special r3),GMP版本4.2.3,MPFR版本2.4.1.

首先,我创建了一个反向字节的全局查找表(大小:0x100):

uint8_t BitReverseTable[]
        __attribute__((__progmem__, aligned(0x100))) = {
    0x00,0x80,0x40,0xC0,0x20,0xA0,0x60,0xE0,
    0x10,0x90,0x50,0xD0,0x30,0xB0,0x70,0xF0,
    [...]
    0x1F,0x9F,0x5F,0xDF,0x3F,0xBF,0x7F,0xFF
};

这按预期工作.那是我打算使用的宏,应该只用5个循环即可.

#define BITREVERSE(x) (__extension__({                                        \
    register uint8_t b=(uint8_t)x;                                            \
    __asm__ __volatile__ (                                                    \
        "ldi r31, hi8(table)"                                          "\n\t" \
        "mov r30, ioRegister"                                          "\n\t" \
        "lpm ioRegister, z"                                            "\n\t" \
        :[ioRegister] "+r" (b)                                                \
        :[table] "g" (BitReverseTable)                                        \
        :"r30", "r31"                                                         \
    );                                                                        \
}))

用于编译(或不编译)的代码.

int main() /// Test for bitreverse
{
    BITREVERSE(25);
    return 0;
}

那是我从编译器中得到的错误:

c:/winavr-20100110/bin/../lib/gcc/avr/4.3.3/../../../../avr/bin/as.exe -mmcu=atmega2560 -o bitreverse.o C:\Users\xxx\AppData\Local\Temp/ccCefE75.s
C:\Users\xxx\AppData\Local\Temp/ccCefE75.s: Assembler messages:
C:\Users\xxx\AppData\Local\Temp/ccCefE75.s:349: Error: constant value required
C:\Users\xxx\AppData\Local\Temp/ccCefE75.s:350: Error: constant value required

我想问题出在这里

    :[table] "g" (BitReverseTable)                                        \

从我的角度来看,BitReverseTable是数组的内存位置,该位置在编译时是固定的,并且是已知的.因此,它是恒定的. 也许我需要将BitReverseTable转换成某种东西(我尝试了我能想到的一切).也许我需要另一个约束("g"是我的最后一个测试).我确定我使用了所有可能和不可能的东西. 我编写了一个很好的汇编程序版本,但不是内联汇编代码,而是一个适当的函数,它增加了6个循环(用于调用和退出).

任何意见或建议都非常欢迎!

pastebin 上的bitreverse.c的完整源. 详细的编译器输出也在 pastebin

解决方案

以下内容似乎在avr-gcc(GCC)4.8.2上有效,但对我来说确实有明显的回味.

经过修改,可解决OP(托马斯)在评论中指出的问题:

  • Z寄存器的高字节是r31(我交换了r30r31)
  • 像ATmega2560这样的较新的AVR也支持lpm r,Z(仅较旧的AVR lpm r0,Z)

谢谢修复,托马斯!我确实有一个ATmega2560开发板,但是我更喜欢Teensies(部分是由于本地USB),所以我只对代码进行了编译测试,没有运行它来进行验证.我应该提到那件事;道歉.

const unsigned char reverse_bits_table[256] __attribute__((progmem, aligned (256))) = {
      0, 128,  64, 192,  32, 160,  96, 224,  16, 144,  80, 208,  48, 176, 112, 240,
      8, 136,  72, 200,  40, 168, 104, 232,  24, 152,  88, 216,  56, 184, 120, 248,
      4, 132,  68, 196,  36, 164, 100, 228,  20, 148,  84, 212,  52, 180, 116, 244,
     12, 140,  76, 204,  44, 172, 108, 236,  28, 156,  92, 220,  60, 188, 124, 252,
      2, 130,  66, 194,  34, 162,  98, 226,  18, 146,  82, 210,  50, 178, 114, 242,
     10, 138,  74, 202,  42, 170, 106, 234,  26, 154,  90, 218,  58, 186, 122, 250,
      6, 134,  70, 198,  38, 166, 102, 230,  22, 150,  86, 214,  54, 182, 118, 246,
     14, 142,  78, 206,  46, 174, 110, 238,  30, 158,  94, 222,  62, 190, 126, 254,
      1, 129,  65, 193,  33, 161,  97, 225,  17, 145,  81, 209,  49, 177, 113, 241,
      9, 137,  73, 201,  41, 169, 105, 233,  25, 153,  89, 217,  57, 185, 121, 249,
      5, 133,  69, 197,  37, 165, 101, 229,  21, 149,  85, 213,  53, 181, 117, 245,
     13, 141,  77, 205,  45, 173, 109, 237,  29, 157,  93, 221,  61, 189, 125, 253,
      3, 131,  67, 195,  35, 163,  99, 227,  19, 147,  83, 211,  51, 179, 115, 243,
     11, 139,  75, 203,  43, 171, 107, 235,  27, 155,  91, 219,  59, 187, 123, 251,
      7, 135,  71, 199,  39, 167, 103, 231,  23, 151,  87, 215,  55, 183, 119, 247,
     15, 143,  79, 207,  47, 175, 111, 239,  31, 159,  95, 223,  63, 191, 127, 255,
};

#define USING_REVERSE_BITS \
    register unsigned char r31 asm("r31"); \
    asm volatile ( "ldi r31,hi8(reverse_bits_table)\n\t" : [r31] "=d" (r31) )

#define REVERSE_BITS(v) \
    ({ register unsigned char r30 asm("r30") = v; \
       register unsigned char ret; \
       asm volatile ( "lpm %[ret],Z\n\t" : [ret] "=r" (ret) : [r30] "d" (r30), [r31] "d" (r31) ); \
       ret; })

unsigned char reverse_bits(const unsigned char value)
{
    USING_REVERSE_BITS;
    return REVERSE_BITS(value);
}

void reverse_bits_in(unsigned char *string, unsigned char length)
{
    USING_REVERSE_BITS;

    while (length-->0) {
        *string = REVERSE_BITS(*string);
        string++;
    }
}

对于仅支持lpm r0,Z的旧版AVR,请使用

#define REVERSE_BITS(v) \
    ({ register unsigned char r30 asm("r30") = v; \
       register unsigned char ret asm("r0"); \
       asm volatile ( "lpm %[ret],Z\n\t" : [ret] "=t" (ret) : [r30] "d" (r30), [r31] "d" (r31) ); \
       ret; })

我们的想法是,我们使用本地reg var r31,用于保留Z寄存器对的高字节. USING_REVERSE_BITS;宏在当前范围内对其进行定义,使用内联汇编有两个目的:避免不必要地将表地址的低位部分加载到寄存器中,并确保GCC知道我们已将值存储在其中(因为它是一个输出操作数)而没有任何方法知道该值是什么,因此希望将其保留在整个作用域中.

REVERSE_BITS()宏产生结果,告诉编译器它需要寄存器r30中的参数以及r31USING_REVERSE_BITS;设置的表地址高字节.

听起来有点复杂,但这只是因为我不知道如何更好地解释它.真的很简单.

使用avr-gcc-4.8.2 -O2 -fomit-frame-pointer -mmcu=atmega2560 -S编译上述内容即可产生汇编源. (我建议使用-O2 -fomit-frame-pointer.) 省略注释和常规指令:

    .text

reverse_bits:
    ldi r31,hi8(reverse_bits_table)
    mov r30,r24
    lpm r24,Z
    ret

reverse_bits_in:
    mov r26,r24
    mov r27,r25
    ldi r31,hi8(reverse_bits_table)
    ldi r24,lo8(-1)
    add r24,r22
    tst r22
    breq .L2
.L8:
    ld r30,X
    lpm r30,Z
    st X+,r30
    subi r24,1
    brcc .L8
.L2:
    ret

    .section    .progmem.data,"a",@progbits
    .p2align    8
reverse_bits_table:
    .byte    0
    .byte    -128
    ; Rest of data omitted for brevity

如果您想知道,GCC在ATmega2560上将第一个8位参数和8位函数结果都放在寄存器r24中.

据我所知,第一个功能是最佳的. (在仅支持lpm r0,Z的旧版AVR上,您可以采取其他措施将结果从r0复制到r24.)

对于第二个功能,设置部分可能不是最佳选择(对于其中一个,您可以先做tst r22 breq .L2来加快零长度数组检查的速度),但是我不确定如果我自己能写得更快/更短;我当然可以接受.

第二个函数中的循环对我而言似乎是最佳的.它使用r30的方式最初让我感到奇怪和恐惧,但是后来我意识到这很合理-使用了更少的寄存器,并且以这种方式重用r30也没有害处(即使它是寄存器),因为它将在下一次迭代开始时从string中加载新值.

请注意,在我先前的编辑中,我提到交换函数参数的顺序可以产生更好的代码,但是有了Thomas的添加,情况已不再如此.寄存器改变了,就是这样.

如果您确定始终使用大于零的length,请使用

void reverse_bits_in(unsigned char *string, unsigned char length)
{
    USING_REVERSE_BITS;
    do {
        *string = REVERSE_BITS(*string);
        string++;
    } while (--length);
}

收益

reverse_bits_in:
    mov r26,r24                      ; 1 cycle
    mov r27,r25                      ; 1 cycle
    ldi r31,hi8(reverse_bits_table)  ; 2 cycles
.L4:
    ld r30,X                         ; 2 cycles
    lpm r30,Z                        ; 3 cycles
    st X+,r30                        ; 2 cycles
    subi r22,lo8(-(-1))              ; 1 cycle
    brne .L4                         ; 2 cycles
    ret                              ; 4 cycles

看起来对我印象非常深刻:每字节10个周期,4个设置周期和3个清理周期(brne如果没有跳转,则只需一个周期).我从头顶上列出了周期的计数,因此在em中可能有一些小错误(这里或那里的周期). r26:r27X,并且在r24:r25中提供了该函数的第一个指针参数,在r22中提供了长度.

reverse_bits_table在正确的部分中,并且正确对齐. (.p2align 8确实对齐到256个字节;它指定了低8位为零的对齐方式.)

尽管GCC因多余的寄存器移动而臭名昭著,但我真的很喜欢它上面生成的代码.当然,总是有罚款的余地;对于重要的代码序列,我建议尝试不同的变体,甚至更改功能参数的顺序(或在本地范围内声明循环变量),依此类推,然后使用-S进行编译以查看生成的代码. AVR 指令时序很简单,因此比较代码序列很容易如果一个明显更好.我喜欢先删除指令和注释.这样可以更容易阅读程序集.


回味浓郁的原因是 GCC文档明确表示定义这样的寄存器变量不会保留寄存器;在流量控制确定该变量的值无效的地方,它仍可用于其他用途" ,我只是不信任这对GCC开发人员和我的意义相同.即使现在就这样做,将来也可能不会.没有标准的GCC开发人员应该遵守这里,因为这是GCC特有的功能.

另一方面,我只依靠上面记录的GCC行为,尽管它很"hacky",但它确实可以从简单的C代码生成有效的汇编.

就个人而言,无论何时更新avr-gcc,我都建议重新编译上面的测试代码,并查看生成的程序集(也许使用sed删除注释和标签,并与已知的良好版本进行比较?)./p>

问题?

I writing a fast "8 bit reverse"-routine for an avr-project with an ATmega2560 processor. I'm using

  • GNU C (WinAVR 20100110) version 4.3.3 (avr) / compiled by GNU C version 3.4.5 (mingw-vista special r3), GMP version 4.2.3, MPFR version 2.4.1.

First I created a global lookup-table of reversed bytes (size: 0x100):

uint8_t BitReverseTable[]
        __attribute__((__progmem__, aligned(0x100))) = {
    0x00,0x80,0x40,0xC0,0x20,0xA0,0x60,0xE0,
    0x10,0x90,0x50,0xD0,0x30,0xB0,0x70,0xF0,
    [...]
    0x1F,0x9F,0x5F,0xDF,0x3F,0xBF,0x7F,0xFF
};

This works as expected. That is the macro I intend to use, which should cost me only 5 cylces:

#define BITREVERSE(x) (__extension__({                                        \
    register uint8_t b=(uint8_t)x;                                            \
    __asm__ __volatile__ (                                                    \
        "ldi r31, hi8(table)"                                          "\n\t" \
        "mov r30, ioRegister"                                          "\n\t" \
        "lpm ioRegister, z"                                            "\n\t" \
        :[ioRegister] "+r" (b)                                                \
        :[table] "g" (BitReverseTable)                                        \
        :"r30", "r31"                                                         \
    );                                                                        \
}))

The code to get it compiled (or not).

int main() /// Test for bitreverse
{
    BITREVERSE(25);
    return 0;
}

That's the error I get from the compiler:

c:/winavr-20100110/bin/../lib/gcc/avr/4.3.3/../../../../avr/bin/as.exe -mmcu=atmega2560 -o bitreverse.o C:\Users\xxx\AppData\Local\Temp/ccCefE75.s
C:\Users\xxx\AppData\Local\Temp/ccCefE75.s: Assembler messages:
C:\Users\xxx\AppData\Local\Temp/ccCefE75.s:349: Error: constant value required
C:\Users\xxx\AppData\Local\Temp/ccCefE75.s:350: Error: constant value required

I guess the problem is here:

    :[table] "g" (BitReverseTable)                                        \

From my point of view BitReverseTable is the memory position of the array, which is fixed and known at compile time. Therefor it is constant. Maybe I need to cast BitReverseTable into something (i tried anything I could think of). Maybe I need another constraint ("g" was my last test). I'm sure I used anything possible and impossible. I coded an assembler version, which works fine, but instead of being an inline assembly code, this is a proper function which adds another 6 cycles (for call and ret).

Any advice or suggestions are very welcome!

Full source of bitreverse.c on pastebin. Verbose compiler output also on pastebin

解决方案

The following does seem to work on avr-gcc (GCC) 4.8.2, but it does have a distinct hacky aftertaste to me.

Edited to fix the issues pointed out by the OP (Thomas) in the comments:

  • The high byte of Z register is r31 (I had r30 and r31 swapped)
  • Newer AVR's like ATmega2560 support also lpm r,Z (older AVRs only lpm r0,Z)

Thanks for the fixes, Thomas! I do have an ATmega2560 board, but I prefer Teensies (in part because of the native USB), so I only compile-tested the code, didn't run it to verify. I should have mentioned that; apologies.

const unsigned char reverse_bits_table[256] __attribute__((progmem, aligned (256))) = {
      0, 128,  64, 192,  32, 160,  96, 224,  16, 144,  80, 208,  48, 176, 112, 240,
      8, 136,  72, 200,  40, 168, 104, 232,  24, 152,  88, 216,  56, 184, 120, 248,
      4, 132,  68, 196,  36, 164, 100, 228,  20, 148,  84, 212,  52, 180, 116, 244,
     12, 140,  76, 204,  44, 172, 108, 236,  28, 156,  92, 220,  60, 188, 124, 252,
      2, 130,  66, 194,  34, 162,  98, 226,  18, 146,  82, 210,  50, 178, 114, 242,
     10, 138,  74, 202,  42, 170, 106, 234,  26, 154,  90, 218,  58, 186, 122, 250,
      6, 134,  70, 198,  38, 166, 102, 230,  22, 150,  86, 214,  54, 182, 118, 246,
     14, 142,  78, 206,  46, 174, 110, 238,  30, 158,  94, 222,  62, 190, 126, 254,
      1, 129,  65, 193,  33, 161,  97, 225,  17, 145,  81, 209,  49, 177, 113, 241,
      9, 137,  73, 201,  41, 169, 105, 233,  25, 153,  89, 217,  57, 185, 121, 249,
      5, 133,  69, 197,  37, 165, 101, 229,  21, 149,  85, 213,  53, 181, 117, 245,
     13, 141,  77, 205,  45, 173, 109, 237,  29, 157,  93, 221,  61, 189, 125, 253,
      3, 131,  67, 195,  35, 163,  99, 227,  19, 147,  83, 211,  51, 179, 115, 243,
     11, 139,  75, 203,  43, 171, 107, 235,  27, 155,  91, 219,  59, 187, 123, 251,
      7, 135,  71, 199,  39, 167, 103, 231,  23, 151,  87, 215,  55, 183, 119, 247,
     15, 143,  79, 207,  47, 175, 111, 239,  31, 159,  95, 223,  63, 191, 127, 255,
};

#define USING_REVERSE_BITS \
    register unsigned char r31 asm("r31"); \
    asm volatile ( "ldi r31,hi8(reverse_bits_table)\n\t" : [r31] "=d" (r31) )

#define REVERSE_BITS(v) \
    ({ register unsigned char r30 asm("r30") = v; \
       register unsigned char ret; \
       asm volatile ( "lpm %[ret],Z\n\t" : [ret] "=r" (ret) : [r30] "d" (r30), [r31] "d" (r31) ); \
       ret; })

unsigned char reverse_bits(const unsigned char value)
{
    USING_REVERSE_BITS;
    return REVERSE_BITS(value);
}

void reverse_bits_in(unsigned char *string, unsigned char length)
{
    USING_REVERSE_BITS;

    while (length-->0) {
        *string = REVERSE_BITS(*string);
        string++;
    }
}

For older AVRs that only support lpm r0,Z, use

#define REVERSE_BITS(v) \
    ({ register unsigned char r30 asm("r30") = v; \
       register unsigned char ret asm("r0"); \
       asm volatile ( "lpm %[ret],Z\n\t" : [ret] "=t" (ret) : [r30] "d" (r30), [r31] "d" (r31) ); \
       ret; })

The idea is that we use a local reg var r31, to keep the high byte of the Z register pair. The USING_REVERSE_BITS; macro defines it in the current scope, using inline assembly for two purposes: to avoid an unnecessary load of the low part of the table address into a register, and to make sure GCC knows we have stored a value into it (because it is an output operand) without having any way of knowing what the value should be, thus hopefully retaining it throughout the scope.

The REVERSE_BITS() macro yields the result, telling the compiler it needs the argument in register r30, and the table address high byte set by USING_REVERSE_BITS; in r31.

Sounds a bit complicated, but that's just because I don't know how to explain it better. It really is quite simple.

Compiling the above with avr-gcc-4.8.2 -O2 -fomit-frame-pointer -mmcu=atmega2560 -S yields the assembly source. (I do recommend using -O2 -fomit-frame-pointer.) Omitting comments and the normal directives:

    .text

reverse_bits:
    ldi r31,hi8(reverse_bits_table)
    mov r30,r24
    lpm r24,Z
    ret

reverse_bits_in:
    mov r26,r24
    mov r27,r25
    ldi r31,hi8(reverse_bits_table)
    ldi r24,lo8(-1)
    add r24,r22
    tst r22
    breq .L2
.L8:
    ld r30,X
    lpm r30,Z
    st X+,r30
    subi r24,1
    brcc .L8
.L2:
    ret

    .section    .progmem.data,"a",@progbits
    .p2align    8
reverse_bits_table:
    .byte    0
    .byte    -128
    ; Rest of data omitted for brevity

In case you are wondering, on ATmega2560 GCC puts the first 8-bit parameter and the 8-bit function result both in register r24.

The first function is optimal, as far as I can tell. (On older AVRs that only support lpm r0,Z, you get an added move to copy the result from r0 to r24.)

For the second function, the setup part might not be exactly optimal (for one, you could do the tst r22 breq .L2 first thing to speed up the zero-length-array check), but I'm not sure if I could write a faster/shorter one myself; it's certainly acceptable to me.

The loop in the second function looks optimal to me. The way it uses r30 I found strange and scary at first, but then I realized it makes perfect sense -- fewer registers used, and there is no harm in reusing r30 this way (even if it is low part of Z register too), because it will be loaded with a new value from string at the start of the next iteration.

Note that in my previous edit, I mentioned that swapping the order of the function parameters yielded better code, but with Thomas's additions, that is no longer the case. The registers change, that's it.

If you are sure you always supply a larger-than-zero length, using

void reverse_bits_in(unsigned char *string, unsigned char length)
{
    USING_REVERSE_BITS;
    do {
        *string = REVERSE_BITS(*string);
        string++;
    } while (--length);
}

yields

reverse_bits_in:
    mov r26,r24                      ; 1 cycle
    mov r27,r25                      ; 1 cycle
    ldi r31,hi8(reverse_bits_table)  ; 2 cycles
.L4:
    ld r30,X                         ; 2 cycles
    lpm r30,Z                        ; 3 cycles
    st X+,r30                        ; 2 cycles
    subi r22,lo8(-(-1))              ; 1 cycle
    brne .L4                         ; 2 cycles
    ret                              ; 4 cycles

which starts to look downright impressive to me: ten cycles per byte, four cycles for setup, and three cycles cleanup (brne takes just one cycle if no jump). The cycle counts I listed off the top of my head, so there are likely small errors in 'em (a cycle here or there). r26:r27 is X, and the first pointer parameter to the function is supplied in r24:r25, with length in r22.

The reverse_bits_table is in the correct section, and correctly aligned. (.p2align 8 does align to 256 bytes; it specifies an alignment where the low 8 bits are zero.)

Although GCC is notorious for superfluous register moves, I really like the code it generates above. Sure, there is always room for finessing; for the important code sequences I recommend trying different variants, even changing the order of function parameters (or declaring loop variables in local scopes), and so on, then compile using -S to see the generated code. The AVR instruction timings are simple, so it is pretty easy to compare code sequences, to see if one is clearly better. I like to remove the directives and comments first; it makes it easier to read the assembly.


The reason for the hacky aftertaste is that the GCC documentation explicitly says that "Defining such a register variable does not reserve the register; it remains available for other uses in places where flow control determines the variable's value is not live", and I just don't trust that this means the same to the GCC developers as it means to me. Even if it did right now, it might not in the future; there is no standard GCC developers ought to adhere to here, since this is a GCC-specific feature.

On the other hand, I do only rely on documented GCC behaviour, above, and although "hacky", it does generate efficient assembly from straightforward C code.

Personally, I would recommend recompiling the above test code, and looking at the generated assembly (perhaps use sed to strip out the comments and labels, and compare to a known good version?), whenever you update avr-gcc.

Questions?

这篇关于如何说服AVR-GCC全局字节数组的内存位置是一个常量的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆