如何防止 ARM Compiler 5 armcc 内联汇编器中的 LDM/STM 指令扩展? [英] How to prevent LDM/STM instructions expansion in ARM Compiler 5 armcc inline assembler?

查看:36
本文介绍了如何防止 ARM Compiler 5 armcc 内联汇编器中的 LDM/STM 指令扩展?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试在使用 ARM Compiler 5 armcc 编译的 .c 文件中的内联汇编中使用 STM/LDM 指令生成 AXI 总线突发访问.

I'm trying to generate AXI bus burst accesses using STM/LDM instructions in inline assembly in .c file compiled with ARM Compiler 5 armcc.

inline void STMIA2(uint32_t addr, uint32_t w0, uint32_t w1)
{
    __asm {
        STMIA addr!, { w0, w1 }
    }
}

但是 ARM Compiler armcc User Guide,第 7.18 段说:所有 LDM 和 STM 指令都扩展为具有等效效果的一系列 LDR 和 STR 指令.但是,编译器随后可能会在优化期间将单独的指令重新组合成 LDM 或 STM."

But ARM Compiler armcc User Guide, paragraph 7.18 is saying: "All LDM and STM instructions are expanded into a sequence of LDR and STR instructions with equivalent effect. However, the compiler might subsequently recombine the separate instructions into an LDM or STM during optimization."

这就是实际发生的情况,在某些情况下,LDM/STM 被扩展为一组 LDR/STR,并且这些指令的顺序是任意的.这会影响性能,因为我们使用针对突发处理优化的硬件.这也破坏了功能的正确性,因为我们使用的硬件考虑了单词的顺序并忽略了偏移量(但编译器认为更改指令顺序是安全的).

And that is what really happens in practice, LDM/STM are expanded into a set of LDR/STR in some cases and order of these instuctions is arbitrary. This affects performance since HW we use optimized for bursts processing. Also this breaks functional correctness because HW we use takes into consideration sequence of words and ignores offsets (but compiler think that it's safe to change the order of instructions).

为了解决这个问题,可以使用嵌入式汇编程序而不是内联汇编程序,但这会导致额外的函数调用 - 返回影响性能的内容.

To resolve this it's possible to use embedded assembler instead of inline assembler, but this leads to extra function calls-returns what affects performance.

所以我想知道是否有一种方法可以在不损失性能的情况下正确生成 LDM/STM?我们能够在 GCC 中做到这一点,但没有找到 armcc 的任何解决方案.

So I'm wondering if there is a way to generate LDM/STM properly without losing performance? We were able to do this in GCC, but didn't find any solution for armcc.

目标 CPU:Cortex M0+ (ARMv6-M).

Target CPU: Cortex M0+ (ARMv6-M).

从设备都是片上设备,大部分是非存储设备.对于每个支持地址空间突发访问区域的非内存从属寄存器都是保留的(例如 [0x10000..0x10100]),我不完全确定为什么,也许 CPU 或总线不支持固定(非增量) 地址.HW 忽略该区域内的偏移.例如,完整请求可以是 16 个字节,完整请求的第一个字是写入的第一个字(即使偏移量非零).

Slave devices are all on-chip devices, most of them are non-memory devices. For every register of non-memory slave that supports burst access region of address space is reserved (for example [0x10000..0x10100]), I'm not completely sure why, maybe CPU or bus doesn't support fixed (non-incremental) addresses. HW ignores offsets within this region. Full request can be 16 bytes for example and first word of the full request is first word written (even if offset is non-zero).

推荐答案

所以我想知道是否有一种方法可以在不损失性能的情况下正确生成 LDM/STM?我们能够在 GCC 中做到这一点,但没有找到 armcc 的任何解决方案.

So I'm wondering if there is a way to generate LDM/STM properly without losing performance? We were able to do this in GCC, but didn't find any solution for armcc.

关于编译器优化的一点点.注册分配是最艰巨的工作之一.任何编译器代码生成的核心可能都在分配物理 CPU 寄存器时.大多数编译器都使用 单一静态分配或 SSA 来重命名您的 'C' 变量变成一堆伪变量(或时间顺序变量).

A little bit about compiler optimizations. Register allocation is one of it's toughest jobs. The heart of any compiler's code generation is probably around when it allocates physical CPU registers. Most compilers are using Single static assignment or SSA to rename your 'C' variables into a bunch of pseudo variable (or time order variables).

为了让您的 STMIA 和 LDMIA 工作,您需要加载和存储一致.即,如果它是 stmia [rx], {r3,r7} 和像 ldmia [rx], {r4,r8} 这样的恢复,其中 'r3' 映射到新的r4"和存储的r7"映射到恢复的r8".这对于任何编译器来说都不容易实现,因为将根据需要分配C"变量.同一变量的不同版本可能在不同的寄存器中.要使 stm/ldm 工作,必须分配这些变量,以便寄存器以正确的顺序递增.即,对于上面的ldmia,如果编译器想要在r0 中存储r7(可能是一个返回值?),没有办法创建一个好的 ldm 指令而不生成额外的代码.

In order for your STMIA and LDMIA to work you need the loads and stores to be consistent. Ie, if it is stmia [rx], {r3,r7} and a restore like ldmia [rx], {r4,r8} with the 'r3' mapping to the new 'r4' and the stored 'r7' mapping to the restored 'r8'. This is not simple for any compiler to implement generically as 'C' variables will be assigned according to need. Different versions of the same variable maybe in different registers. To make the stm/ldm work those variable must be assigned so that register increments in the right order. Ie, for the ldmia above if the compiler want the stored r7 in r0 (maybe a return value?), there is no way for it to create a good ldm instruction without generating additional code.

你可能已经得到了 gcc 来生成这个,但这可能是运气.如果你只使用 gcc,你可能会发现它不起作用.

You may have gotten gcc to generate this, but it was probably luck. If you proceed with only gcc, you will probably find it doesn't work as well.

请参阅:ldm/stm 和 gcc,了解 GCC 的问题stm/ldm.

See: ldm/stm and gcc for issues with GCC stm/ldm.

以你为例

inline void STMIA2(uint32_t addr, uint32_t w0, uint32_t w1)
{
    __asm {
        STMIA addr!, { w0, w1 }
    }
}

inline 的价值在于可以将整个函数体放在代码中.调用者可能在寄存器 R8 和 R4 中有 w0w1.如果函数不是inline,则编译必须将它们放在R1 和R2 中,但可能会产生额外的移动.任何编译器都难以笼统地满足ldm/stm的要求.

The value of inline is that the whole function body may be put right in the code. The caller might have the w0 and w1 in registers R8 and R4. If the function is not inline, then the compile must place them in R1 and R2 but may have generated extra moves. It is difficult for any compiler to fulfil the requirements of the ldm/stm generically.

这会影响性能,因为我们使用针对突发处理进行了优化的硬件.这也破坏了功能的正确性,因为我们使用的硬件考虑了单词的顺序并忽略了偏移量(但编译器认为更改指令顺序是安全的).

This affects performance since HW we use optimized for bursts processing. Also this breaks functional correctness because HW we use takes into consideration sequence of words and ignores offsets (but compiler think that it's safe to change the order of instructions).

如果硬件是总线上特定的非内存从属外设,那么您可以将写入此从属的功能包装在外部包装器中并强制分配寄存器(请参阅AAPCS) 以便 ldm/stm 工作.这将导致性能下降,而设备驱动程序中的某些自定义汇编器可以减轻这种影响.

If the hardware is a particular non-memory slave peripheral on the bus, then you can wrap the functionality to write to this slave in an external wrapper and force the register allocation (see AAPCS) so that ldm/stm will work. This will result in a performance hit which could be mitigated by some custom assembler in the driver for the device.

但是,听起来设备可能是内存?在这种情况下,您遇到了问题.通常,像这样的内存设备只会使用缓存吗?如果您的 CPU 具有 MPU(内存保护单元)并且可以同时启用数据和代码缓存,那么您可能会解决此问题.缓存行将始终是突发访问.只需要在代码中注意设置 MPU 和数据缓存. OPs Cortex-M0+ 没有缓存,设备是非内存的,所以这是不可能的(也不需要).

However, it sounds like the device might be memory? In this case, you have a problem. Normally, memory devices like this will use a cache only? If your CPU has an MPU (memory protection unit) and can enable both data and code cache, then you might resolve this issue. Cache lines will always be burst accesses. Care only needs to be taken in the code to setup the MPU and the data cache. OPs Cortex-M0+ has no cache and the devices are non-memory so this will not be possible (nor needed).

如果您的设备是内存并且您没有数据缓存,那么您的问题可能无法解决(无需付出巨大努力)并且您需要不同的硬件.或者你可以像外围设备一样包装它并降低性能;失去了随机访问存储设备的好处.

If your device is memory and you have no data cache then your issue is probably unresolvable (without massive effort) and you need different hardware. Or you can wrap it like the peripheral device and take a performance hit; loosing the benefits of the random access of the memory device.

这篇关于如何防止 ARM Compiler 5 armcc 内联汇编器中的 LDM/STM 指令扩展?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆