在编写干净的 C 代码时利用 ARM 未对齐的内存访问 [英] Take advantage of ARM unaligned memory access while writing clean C code

查看:25
本文介绍了在编写干净的 C 代码时利用 ARM 未对齐的内存访问的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

过去,ARM 处理器无法正确处理未对齐的内存访问(ARMv5 及更低版本).如果 ptr 未在 4 字节上正确对齐,则类似 u32 var32 = *(u32*)ptr; 只会失败(引发异常).

It used to be that ARM processors were unable to properly handle unaligned memory access (ARMv5 and below). Something like u32 var32 = *(u32*)ptr; would just fail (raise exception) if ptr was not properly aligned on 4-bytes.

不过,编写这样的语句对于 x86/x64 会很好,因为这些 CPU 总是非常有效地处理这种情况.但是根据 C 标准,这不是编写它的正确"方式.u32 显然相当于 4 个字节的结构,必须在 4 个字节上对齐.

Writing such a statement would work fine for x86/x64 though, since these CPU have always handled such situation very efficiently. But according to C standard, this is not a "proper" way to write it. u32 is apparently equivalent to a structure of 4 bytes which must be aligned on 4 bytes.

在保持正统正确性确保与任何 cpu 完全兼容的同时获得相同结果的正确方法是:

A proper way to achieve the same result while keeping the orthodoxy correctness and ensuring full compatibility with any cpu is :

u32 read32(const void* ptr) 
{ 
    u32 result; 
    memcpy(&result, ptr, 4); 
    return result; 
}

这是正确的,将为任何能够或不能在未对齐位置读取的 CPU 生成正确的代码.更好的是,在 x86/x64 上,它针对单个读取操作进行了适当的优化,因此具有与第一条语句相同的性能.它便携、安全且快速.谁能问更多?

This one is correct, will generate proper code for any cpu able or not to read at unaligned positions. Even better, on x86/x64, it's properly optimized to a single read operation, hence has the same performance as the first statement. It's portable, safe, and fast. Who can ask more ?

嗯,问题是,在 ARM 上,我们就没那么幸运了.

Well, problem is, on ARM, we are not so lucky.

编写memcpy版本确实安全,但似乎会导致系统的谨慎操作,这对于ARMv6和ARMv7(基本上是任何智能手机)来说非常慢.

Writing the memcpy version is indeed safe, but seems to result in systematic cautious operations, which are very slow for ARMv6 and ARMv7 (basically, any smartphone).

在严重依赖读取操作的面向性能的应用程序中,可以测量第 1 版和第 2 版之间的差异:它在 gcc -O2 处为 > 5x设置.这太多了,不容忽视.

In a performance oriented application which heavily relies on read operations, the difference between the 1st and 2nd version could be measured : it stands at > 5x at gcc -O2 settings. This is way too much to be ignored.

为了找到一种使用 ARMv6/v7 功能的方法,我寻找了一些示例代码的指导.不幸的是,他们似乎选择了第一条语句(直接u32 访问),这不应该是正确的.

Trying to find a way to use ARMv6/v7 capabilities, I've looked for guidance on a few example codes around. Unfortunatley, they seem to select the first statement (direct u32 access), which is not supposed to be correct.

这还不是全部:新的 GCC 版本现在正在尝试实现自动矢量化.在 x64 上,这意味着 SSE/AVX,在 ARMv7 上意味着 NEON.ARMv7 还支持一些新的加载多个"(LDM) 和存储多个"(STM) 操作码,它们要求指针对齐.

That's not all : new GCC versions are now trying to implement auto-vectorization. On x64, that means SSE/AVX, on ARMv7 that means NEON. ARMv7 also supports some new "Load Multiple" (LDM) and "Store Multiple" (STM) opcodes, which require pointer to be aligned.

这是什么意思?好吧,编译器可以自由地使用这些高级指令,即使它们不是从 C 代码中专门调用的(不是内在的).为了做出这样的决定,它使用了 u32* 指针 应该在 4 个字节上对齐的事实.如果不是,那么所有的赌注都将关闭:未定义的行为、崩溃.

What does that mean ? Well, the compiler is free to use these advanced instructions, even if they were not specifically called from the C code (no intrinsic). To take such decision, it uses the fact the an u32* pointer is supposed to be aligned on 4 bytes. If it's not, then all bets are off : undefined behavior, crashes.

这意味着即使在支持未对齐内存访问的 CPU 上,现在使用直接 u32 访问也是危险的,因为它可能导致在高优化设置下生成错误代码 (-O3).

What that means is that even on CPU which support unaligned memory access, it's now dangerous to use direct u32 access, as it can lead to buggy code generation at high optimization settings (-O3).

所以现在,这是一个难题:如何在未对齐的内存访问中访问 ARMv6/v7 的本机性能而不编写不正确的版本 u32 访问?

So now, this is a dilemna : how to access the native performance of ARMv6/v7 on unaligned memory access without writing the incorrect version u32 access ?

PS:我也试过 __packed() 指令,从性能角度来看,它们似乎与 memcpy 方法完全一样.

PS : I've also tried __packed() instructions, and from a performance perspective, they seem to work exactly the same as the memcpy method.

:感谢到目前为止收到的优秀元素.

: Thanks for the excellent elements received so far.

查看生成的程序集,我可以确认@Notlikethat 发现 memcpy 版本确实生成了正确的 ldr 操作码(未对齐加载).但是,我也发现生成的程序集无用地调用了 str(命令).所以完整的操作现在是一个未对齐的加载,一个对齐的存储,然后是一个最终的对齐加载.这比必要的工作要多得多.

Looking at the generated assembly, I could confirm @Notlikethat finding that memcpy version does indeed generate proper ldr opcode (unaligned load). However, I also found that the generated assembly uselessly invokes str (command). So the complete operation is now an unaligned load, an aligned store, and then a final aligned load. That's a lot more work than necessary.

回答@haneefmubarak,是的,代码已正确内联.不,memcpy 远未提供最佳速度,因为强制代码接受直接 u32 访问会转化为巨大的性能提升.所以一定存在更好的可能性.

Answering @haneefmubarak, yes the code is properly inlined. And no, memcpy is very far from providing the best possible speed, since forcing the code to accept direct u32 access translates into huge performance gains. So some better possibility must exist.

非常感谢@artless_noise.与 Godbolt 服务的链接是无价的.我从来没有能够如此清楚地看到 C 源代码与其程序集表示之间的等价性.这非常鼓舞人心.

A big thank to @artless_noise. The link to godbolt service is unvaluable. I've never been able to see so clearly the equivalence between a C source code and its assembly representation. This is highly inspiring.

我完成了一个@artless 示例,它给出了以下内容:

I completed one of @artless examples, and it gives the following :

#include <stdlib.h>
#include <memory.h>
typedef unsigned int u32;

u32 reada32(const void* ptr) { return *(const u32*) ptr; }

u32 readu32(const void* ptr) 
{ 
    u32 result; 
    memcpy(&result, ptr, 4); 
    return result; 
}

使用 ARM GCC 4.8.2 at -O3 或 -O2 编译后:

once compiled using ARM GCC 4.8.2 at -O3 or -O2 :

reada32(void const*):
    ldr r0, [r0]
    bx  lr
readu32(void const*):
    ldr r0, [r0]    @ unaligned
    sub sp, sp, #8
    str r0, [sp, #4]    @ unaligned
    ldr r0, [sp, #4]
    add sp, sp, #8
    bx  lr

很有说服力....

推荐答案

好吧,情况比人们想象的要混乱.因此,为了澄清,以下是这次旅程的发现:

OK, the situation is more confusing than one would like. So, in an effort to clarify, here are the findings on this journey :

  1. 访问未对齐内存的唯一可移植 C 标准解决方案是 memcpy 解决方案.我希望通过这个问题得到另一个,但显然这是迄今为止唯一找到的.
  1. The only portable C standard solution to access unaligned memory is the memcpy one. I was hoping to get another one through this question, but apparently it's the only one found so far.

示例代码:

u32 read32(const void* ptr)  { 
    u32 value; 
    memcpy(&value, ptr, sizeof(value)); 
    return value;  }

此解决方案在所有情况下都是安全的.它还使用 GCC 在 x86 目标上编译为一个简单的加载寄存器操作.

This solution is safe in all circumstances. It also compiles into a trivial load register operation on x86 target using GCC.

但是,在使用 GCC 的 ARM 目标上,它会转化为一个太大且无用的汇编序列,从而拖累性能.

However, on ARM target using GCC, it translates into a way too large and useless assembly sequence, which bogs down performance.

在 ARM 目标上使用 Clang,memcpy 工作正常(请参阅下面的 @notlikethat 评论).很容易将责任归咎于 GCC,但这并不简单:memcpy 解决方案在具有 x86/x64、PPC 和 ARM64 目标的 GCC 上运行良好.最后,尝试另一个编译器 icc13,memcpy 版本在 x86/x64 上出奇地重(4 条指令,一个应该足够了).到目前为止,这只是我可以测试的组合.

Using Clang on ARM target, memcpy works fine (see @notlikethat comment below). It would be easy to blame GCC at large, but it's not that simple : the memcpy solution works fine on GCC with x86/x64, PPC and ARM64 targets. Lastly, trying another compiler, icc13, the memcpy version is surprisingly heavier on x86/x64 (4 instructions, while one should be enough). And that's just the combinations I could test so far.

我要感谢godbolt的项目做出这样的声明易于观察.

I have to thank godbolt's project to make such statements easy to observe.

  1. 第二种解决方案是使用 __packed 结构.这个解决方案不是 C 标准的,完全取决于编译器的扩展.因此,编写它的方式取决于编译器,有时还取决于它的版本.这对于维护可移植代码来说是一团糟.
  1. The second solution is to use __packed structures. This solution is not C standard, and entirely depends on compiler's extension. As a consequence, the way to write it depends on the compiler, and sometimes on its version. This is a mess for maintenance of portable code.

话虽如此,在大多数情况下,与 memcpy 相比,它可以生成更好的代码.大多数情况下只有...

That being said, in most circumstances, it leads to better code generation than memcpy. In most circumstances only ...

例如,对于上述 memcpy 解决方案不起作用的情况,以下是调查结果:

For example, regarding the above cases where memcpy solution does not work, here are the findings :

  • 在带有 ICC 的 x86 上:__packed 解决方案有效
  • 在带有 GCC 的 ARMv7 上:__packed 解决方案有效
  • 在带有 GCC 的 ARMv6 上:不起作用.程序集看起来比 memcpy 还要难看.

  • on x86 with ICC : __packed solution works
  • on ARMv7 with GCC : __packed solution works
  • on ARMv6 with GCC : does not work. Assembly looks even uglier than memcpy.

  1. 最后一个解决方案是使用直接u32 访问未对齐的内存位置.此解决方案曾在 x86 cpu 上运行了数十年,但不推荐使用,因为它违反了一些 C 标准原则:编译器有权将此声明视为数据正确对齐的保证,从而导致代码生成错误.
  1. The last solution is to use direct u32 access to unaligned memory positions. This solution used to work for decades on x86 cpus, but is not recommended, as it violates some C standard principles : compiler is authorized to consider this statement as a guarantee that data is properly aligned, leading to buggy code generation.

不幸的是,至少在一种情况下,它是唯一能够从目标中提取性能的解决方案.即针对 ARMv6 上的 GCC.

Unfortunately, in at least one case, it is the only solution able to extract performance from target. Namely for GCC on ARMv6.

不要在 ARMv7 上使用这个解决方案:GCC 可以生成为对齐内存访问保留的指令,即 LDM(加载多个),导致崩溃.

Do not use this solution for ARMv7 though : GCC can generate instructions which are reserved for aligned memory accesses, namely LDM (Load Multiple), leading to crash.

即使在 x86/x64 上,现在以这种方式编写代码也变得危险,因为新一代编译器可能会尝试自动矢量化一些兼容的循环,基于这些内存的假设生成 SSE/AVX 代码 位置正确对齐,导致程序崩溃.

Even on x86/x64, it becomes dangerous to write your code this way nowadays, as the new generation compilers may try to auto-vectorize some compatible loops, generating SSE/AVX code based on the assumption that these memory positions are properly aligned, crashing the program.

作为回顾,以下是汇总为表格的结果,使用约定:memcpy > 打包 > 直接.

As a recap, here are the results summarized as a table, using the convention : memcpy > packed > direct.

| compiler  | x86/x64 | ARMv7  | ARMv6  | ARM64  |  PPC   |
|-----------|---------|--------|--------|--------|--------|
| GCC 4.8   | memcpy  | packed | direct | memcpy | memcpy |
| clang 3.6 | memcpy  | memcpy | memcpy | memcpy |   ?    |
| icc 13    | packed  | N/A    | N/A    | N/A    | N/A    |

这篇关于在编写干净的 C 代码时利用 ARM 未对齐的内存访问的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆