ARM编译器5中如何防止LDM / STM指令扩展armcc内联汇编器? [英] How to prevent LDM/STM instructions expansion in ARM Compiler 5 armcc inline assembler?

查看:737
本文介绍了ARM编译器5中如何防止LDM / STM指令扩展armcc内联汇编器?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用ARM编译器5 armcc编译的.c文件中的内联汇编中使用STM / LDM指令生成AXI 总线突发访问。

I'm trying to generate AXI bus burst accesses using STM/LDM instructions in inline assembly in .c file compiled with ARM Compiler 5 armcc.

inline void STMIA2(uint32_t addr, uint32_t w0, uint32_t w1)
{
    __asm {
        STMIA addr!, { w0, w1 }
    }
}

但是ARM编译器armcc用户指南第7.18节说:
所有LDM和STM指令都扩展成一个具有同等效果的LDR和STR指令序列,但是编译器可能会在优化期间将单独的指令重组为LDM或STM。

But ARM Compiler armcc User Guide, paragraph 7.18 is saying: "All LDM and STM instructions are expanded into a sequence of LDR and STR instructions with equivalent effect. However, the compiler might subsequently recombine the separate instructions into an LDM or STM during optimization."

这就是实际发生的情况,在某些情况下,LDM / STM扩展为一组LDR / STR,这些引用的顺序是任意的。
这会影响性能,因为我们使用优化的脉冲串进行处理。此外,这会破坏功能正确性,因为我们使用的HW考虑到单词序列和忽略偏移量(但是编译器认为可以更改指令的顺序)。

And that is what really happens in practice, LDM/STM are expanded into a set of LDR/STR in some cases and order of these instuctions is arbitrary. This affects performance since HW we use optimized for bursts processing. Also this breaks functional correctness because HW we use takes into consideration sequence of words and ignores offsets (but compiler think that it's safe to change the order of instructions).

要解决这可以使用嵌入式汇编器而不是内联汇编器,但这会导致额外的函数调用 - 返回影响性能的方式。

To resolve this it's possible to use embedded assembler instead of inline assembler, but this leads to extra function calls-returns what affects performance.

所以我想知道是否有方法在不损失性能的情况下正确生成LDM / STM?我们能够在GCC中执行此操作,但没有找到任何解决方案。

So I'm wondering if there is a way to generate LDM/STM properly without losing performance? We were able to do this in GCC, but didn't find any solution for armcc.

目标CPU:Cortex M0 +(ARMv6-M)。

Target CPU: Cortex M0+ (ARMv6-M).

编辑:
从设备都是片上设备,大多数是非内存设备。对于支持地址空间突发存取区域的非内存从站的每个寄存器都是保留的(例如[0x10000..0x10100]),我不完全确定为什么CPU或总线不支持固定(非增量) )地址。 HW忽略此区域内的偏移量。完全请求可以是16字节,例如,完整请求的第一个字是第一个字写(即使偏移量不为零)。

Slave devices are all on-chip devices, most of them are non-memory devices. For every register of non-memory slave that supports burst access region of address space is reserved (for example [0x10000..0x10100]), I'm not completely sure why, maybe CPU or bus doesn't support fixed (non-incremental) addresses. HW ignores offsets within this region. Full request can be 16 bytes for example and first word of the full request is first word written (even if offset is non-zero).

推荐答案


所以我想知道是否有办法正确地生成LDM / STM,而不会失去性能?我们能够在GCC中执行此操作,但没有找到任何解决方案。

So I'm wondering if there is a way to generate LDM/STM properly without losing performance? We were able to do this in GCC, but didn't find any solution for armcc.

有关编译器优化的一点。 注册分配是最棘手的工作之一。任何编译器代码生成的核心可能在分配物理CPU寄存器时。大多数编译器正在使用单个静态分配或 SSA 来重命名 C'变量变成一串伪变量(或时间顺序变量)。

A little bit about compiler optimizations. Register allocation is one of it's toughest jobs. The heart of any compiler's code generation is probably around when it allocates physical CPU registers. Most compilers are using Single static assignment or SSA to rename your 'C' variables into a bunch of pseudo variable (or time order variables).

为了使您的STMIA和LDMIA能够正常工作,您需要加载和存储才能保持一致。也就是说,如果它是 stmia [rx],{r3,r7} 和像 ldmia [rx],{r4,r8} code>与'r3'映射到新的'r4'和存储的'r7'映射到恢复的'r8'。对于任何编译器来说,这并不简单,因为C变量将根据需要进行分配。相同变量的不同版本可能在不同的寄存器中。要使 stm / ldm 工作,必须分配变量,以便寄存器以正确的顺序递增。也就是说,如果编译器想要在 r0 r7 ,则为 ldmia c $ c>(也许是一个返回值?),没有办法创建一个好的 ldm 指令而不生成额外的代码。

In order for your STMIA and LDMIA to work you need the loads and stores to be consistent. Ie, if it is stmia [rx], {r3,r7} and a restore like ldmia [rx], {r4,r8} with the 'r3' mapping to the new 'r4' and the stored 'r7' mapping to the restored 'r8'. This is not simple for any compiler to implement generically as 'C' variables will be assigned according to need. Different versions of the same variable maybe in different registers. To make the stm/ldm work those variable must be assigned so that register increments in the right order. Ie, for the ldmia above if the compiler want the stored r7 in r0 (maybe a return value?), there is no way for it to create a good ldm instruction without generating additional code.

你可能已经得到gcc来生成这个,但是它可能是运气。如果您仅继续使用gcc,您可能会发现它不起作用。

You may have gotten gcc to generate this, but it was probably luck. If you proceed with only gcc, you will probably find it doesn't work as well.

请参阅: ldm / stm和gcc 关于GCC stm / ldm的问题。

See: ldm/stm and gcc for issues with GCC stm/ldm.

以你的例子,

inline void STMIA2(uint32_t addr, uint32_t w0, uint32_t w1)
{
    __asm {
        STMIA addr!, { w0, w1 }
    }
}

inline 的值是整个函数体可能被放在代码中。调用者可能在寄存器R8和R4中有 w0 w1 。如果函数不是 inline ,则编译必须将它们放在R1和R2中,但可能会产生额外的动作。一般编译器很难满足 ldm / stm 的要求。

The value of inline is that the whole function body may be put right in the code. The caller might have the w0 and w1 in registers R8 and R4. If the function is not inline, then the compile must place them in R1 and R2 but may have generated extra moves. It is difficult for any compiler to fulfil the requirements of the ldm/stm generically.


这会影响性能,因为我们使用优化的脉冲串处理HW。此外,这会破坏功能正确性,因为我们使用的HW考虑到单词的顺序和忽略偏移(但编译器认为可以更改指令的顺序)。

This affects performance since HW we use optimized for bursts processing. Also this breaks functional correctness because HW we use takes into consideration sequence of words and ignores offsets (but compiler think that it's safe to change the order of instructions).

如果硬件是总线上特定的非内存从设备外围设备,那么您可以将功能包装在外部包装器中写入此从站,并强制注册分配(请参阅 AAPCS ),以便 ldm / stm 将工作。这将导致性能下降,这可以通过设备驱动程序中的某些自定义汇编器来缓解。

If the hardware is a particular non-memory slave peripheral on the bus, then you can wrap the functionality to write to this slave in an external wrapper and force the register allocation (see AAPCS) so that ldm/stm will work. This will result in a performance hit which could be mitigated by some custom assembler in the driver for the device.

但是,听起来像设备可能是内存?在这种情况下,您有问题。通常这样的内存设备只能使用缓存吗?如果您的CPU有一个MPU(内存保护单元),并且可以启用数据和代码缓存,那么您可能会解决此问题。缓存线将始终是突发访问。只需要在代码中注意设置MPU和数据缓存。 OPs Cortex-M0 +没有缓存,设备是非内存,所以这是不可能的(也不需要)。

However, it sounds like the device might be memory? In this case, you have a problem. Normally, memory devices like this will use a cache only? If your CPU has an MPU (memory protection unit) and can enable both data and code cache, then you might resolve this issue. Cache lines will always be burst accesses. Care only needs to be taken in the code to setup the MPU and the data cache. OPs Cortex-M0+ has no cache and the devices are non-memory so this will not be possible (nor needed).

如果您的设备是内存,并且没有数据缓存,那么您的问题可能无法解决(无需大量工作),您需要不同的硬件。或者您可以像外围设备一样包装,并进行性能打击;失去存储设备随机存取的好处。

If your device is memory and you have no data cache then your issue is probably unresolvable (without massive effort) and you need different hardware. Or you can wrap it like the peripheral device and take a performance hit; loosing the benefits of the random access of the memory device.

这篇关于ARM编译器5中如何防止LDM / STM指令扩展armcc内联汇编器?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆