在编写干净的C代码时利用ARM未对齐的内存访问 [英] Take advantage of ARM unaligned memory access while writing clean C code

查看:162
本文介绍了在编写干净的C代码时利用ARM未对齐的内存访问的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

过去,ARM处理器无法正确处理未对齐的内存访问(ARMv5及更低版本).如果ptr在4字节上未正确对齐,则u32 var32 = *(u32*)ptr;之类的操作只会失败(引发异常).

It used to be that ARM processors were unable to properly handle unaligned memory access (ARMv5 and below). Something like u32 var32 = *(u32*)ptr; would just fail (raise exception) if ptr was not properly aligned on 4-bytes.

尽管这样的语句对于x86/x64来说也可以正常工作,因为这些CPU总是非常有效地处理这种情况.但是根据C标准,这不是编写它的适当"方法. u32显然等同于4个字节的结构,该结构必须对齐4个字节.

Writing such a statement would work fine for x86/x64 though, since these CPU have always handled such situation very efficiently. But according to C standard, this is not a "proper" way to write it. u32 is apparently equivalent to a structure of 4 bytes which must be aligned on 4 bytes.

在保持正统正确性并确保与任何cpu完全兼容的同时获得相同结果的正确方法是:

A proper way to achieve the same result while keeping the orthodoxy correctness and ensuring full compatibility with any cpu is :

u32 read32(const void* ptr) 
{ 
    u32 result; 
    memcpy(&result, ptr, 4); 
    return result; 
}

这是正确的,它将为任何能够在未对齐位置读取的cpu生成正确的代码.更好的是,在x86/x64上,它已针对单个读取操作进行了适当的优化,因此与第一条语句具有相同的性能.它是便携式,安全且快速的.谁可以提出更多要求?

This one is correct, will generate proper code for any cpu able or not to read at unaligned positions. Even better, on x86/x64, it's properly optimized to a single read operation, hence has the same performance as the first statement. It's portable, safe, and fast. Who can ask more ?

嗯,问题是,在ARM上,我们并不那么幸运.

Well, problem is, on ARM, we are not so lucky.

编写memcpy版本确实是安全的,但似乎会导致系统的谨慎操作,这对于ARMv6和ARMv7(基本上是任何智能手机)来说都非常慢.

Writing the memcpy version is indeed safe, but seems to result in systematic cautious operations, which are very slow for ARMv6 and ARMv7 (basically, any smartphone).

在高度依赖读取操作的面向性能的应用程序中,可以测量第一版和第二版之间的差异:在gcc -O2设置下,其值为> 5倍.这太多了,不容忽视.

In a performance oriented application which heavily relies on read operations, the difference between the 1st and 2nd version could be measured : it stands at > 5x at gcc -O2 settings. This is way too much to be ignored.

试图找到一种使用ARMv6/v7功能的方法,我一直在寻找有关一些示例代码的指南.不幸的是,他们似乎选择了第一个语句(直接u32访问),这应该是不正确的.

Trying to find a way to use ARMv6/v7 capabilities, I've looked for guidance on a few example codes around. Unfortunatley, they seem to select the first statement (direct u32 access), which is not supposed to be correct.

这还不是全部:新的GCC版本现在正在尝试实现自动矢量化.在x64上,这意味着SSE/AVX;在ARMv7上,这意味着NEON. ARMv7还支持一些新的加载多个"(LDM)和存储多个"(STM)操作码,它们需要指针要对齐.

That's not all : new GCC versions are now trying to implement auto-vectorization. On x64, that means SSE/AVX, on ARMv7 that means NEON. ARMv7 also supports some new "Load Multiple" (LDM) and "Store Multiple" (STM) opcodes, which require pointer to be aligned.

那是什么意思?好了,即使没有从C代码中明确调用这些高级指令(没有内在函数),编译器也可以自由使用这些高级指令.为了做出这样的决定,它使用了u32* pointer应该在4个字节上对齐的事实.如果不是,那么所有选择都关闭:未定义的行为,崩溃.

What does that mean ? Well, the compiler is free to use these advanced instructions, even if they were not specifically called from the C code (no intrinsic). To take such decision, it uses the fact the an u32* pointer is supposed to be aligned on 4 bytes. If it's not, then all bets are off : undefined behavior, crashes.

这意味着即使在支持不对齐内存访问的CPU上,使用直接u32访问现在也很危险,因为它可能导致在高优化设置(-O3)下生成错误的代码.

What that means is that even on CPU which support unaligned memory access, it's now dangerous to use direct u32 access, as it can lead to buggy code generation at high optimization settings (-O3).

所以现在这是一个难题:如何在未对齐的内存访问 情况下访问ARMv6/v7的本机性能,而无需写入错误的u32版本?

So now, this is a dilemna : how to access the native performance of ARMv6/v7 on unaligned memory access without writing the incorrect version u32 access ?

PS:我也尝试过__packed()指令,从性能的角度看,它们的工作原理与memcpy方法完全相同.

PS : I've also tried __packed() instructions, and from a performance perspective, they seem to work exactly the same as the memcpy method.

:谢谢您到目前为止所获得的出色贡献.

: Thanks for the excellent elements received so far.

查看生成的程序集,我可以确认@Notlike,发现memcpy版本确实生成了正确的ldr操作码(未对齐的加载).但是,我还发现生成的程序集无用地调用了str(命令).因此,完整的操作现在是未对齐的负载,已对齐的存储,然后是最终的已对齐负载.这比必要的工作多得多.

Looking at the generated assembly, I could confirm @Notlikethat finding that memcpy version does indeed generate proper ldr opcode (unaligned load). However, I also found that the generated assembly uselessly invokes str (command). So the complete operation is now an unaligned load, an aligned store, and then a final aligned load. That's a lot more work than necessary.

回答@haneefmubarak,是的,代码已正确内联.不,memcpy不能提供最佳的速度,因为强制代码接受直接的u32访问将转化为巨大的性能提升.因此,必须存在更好的可能性.

Answering @haneefmubarak, yes the code is properly inlined. And no, memcpy is very far from providing the best possible speed, since forcing the code to accept direct u32 access translates into huge performance gains. So some better possibility must exist.

非常感谢@artless_noise.与Godbolt服务的链接是无价的.我从未如此清楚地看到C源代码与其程序集表示形式之间的等效性.这非常鼓舞人心.

A big thank to @artless_noise. The link to godbolt service is unvaluable. I've never been able to see so clearly the equivalence between a C source code and its assembly representation. This is highly inspiring.

我完成了@artless示例之一,它给出了以下内容:

I completed one of @artless examples, and it gives the following :

#include <stdlib.h>
#include <memory.h>
typedef unsigned int u32;

u32 reada32(const void* ptr) { return *(const u32*) ptr; }

u32 readu32(const void* ptr) 
{ 
    u32 result; 
    memcpy(&result, ptr, 4); 
    return result; 
}

一旦在-O3或-O2上使用ARM GCC 4.8.2进行了编译:

once compiled using ARM GCC 4.8.2 at -O3 or -O2 :

reada32(void const*):
    ldr r0, [r0]
    bx  lr
readu32(void const*):
    ldr r0, [r0]    @ unaligned
    sub sp, sp, #8
    str r0, [sp, #4]    @ unaligned
    ldr r0, [sp, #4]
    add sp, sp, #8
    bx  lr

很有趣....

推荐答案

好的,情况比人们想像的还要混乱.因此,为澄清起见,以下是这次旅程的发现:

OK, the situation is more confusing than one would like. So, in an effort to clarify, here are the findings on this journey :

  1. 唯一可访问未对齐内存的C标准解决方案是memcpy.我希望通过这个问题再解决一个问题,但是显然这是迄今为止找到的唯一一个问题.
  1. The only portable C standard solution to access unaligned memory is the memcpy one. I was hoping to get another one through this question, but apparently it's the only one found so far.

示例代码:

u32 read32(const void* ptr)  { 
    u32 value; 
    memcpy(&value, ptr, sizeof(value)); 
    return value;  }

此解决方案在所有情况下都是安全的.使用GCC,它也可以编译为对x86目标的琐碎的load register操作.

This solution is safe in all circumstances. It also compiles into a trivial load register operation on x86 target using GCC.

但是,在使用GCC的ARM目标上,它转化为一种过大且无用的组装顺序,从而降低了性能.

However, on ARM target using GCC, it translates into a way too large and useless assembly sequence, which bogs down performance.

在ARM目标上使用Clang时,memcpy可以正常工作(请参见下面的@notlikethat注释).总体上来说,责怪GCC很容易,但是并不是那么简单:memcpy解决方案在带有x86/x64,PPC和ARM64目标的GCC上可以很好地工作.最后,尝试使用另一种编译器icc13,其memcpy版本在x86/x64上令人惊讶地重(有4条指令,而一条指令就足够了).到目前为止,这只是我可以测试的组合.

Using Clang on ARM target, memcpy works fine (see @notlikethat comment below). It would be easy to blame GCC at large, but it's not that simple : the memcpy solution works fine on GCC with x86/x64, PPC and ARM64 targets. Lastly, trying another compiler, icc13, the memcpy version is surprisingly heavier on x86/x64 (4 instructions, while one should be enough). And that's just the combinations I could test so far.

我必须感谢Godbolt的项目做出这样的声明

I have to thank godbolt's project to make such statements easy to observe.

  1. 第二种解决方案是使用__packed结构.此解决方案不是C标准,并且完全取决于编译器的扩展.因此,编写它的方式取决于编译器,有时还取决于其版本.这对于维护可移植代码是一团糟.
  1. The second solution is to use __packed structures. This solution is not C standard, and entirely depends on compiler's extension. As a consequence, the way to write it depends on the compiler, and sometimes on its version. This is a mess for maintenance of portable code.

话虽这么说,在大多数情况下,它比memcpy导致更好的代码生成.仅在大多数情况下...

That being said, in most circumstances, it leads to better code generation than memcpy. In most circumstances only ...

例如,对于上述memcpy解决方案不起作用的情况,以下是发现结果:

For example, regarding the above cases where memcpy solution does not work, here are the findings :

  • 在具有ICC的x86上:__packed解决方案有效
  • 在具有GCC的ARMv7上:__packed解决方案有效
  • 带有GCC的ARMv6上的
  • :不起作用.汇编看起来比memcpy还要难看.

  • on x86 with ICC : __packed solution works
  • on ARMv7 with GCC : __packed solution works
  • on ARMv6 with GCC : does not work. Assembly looks even uglier than memcpy.

  1. 最后一个解决方案是直接使用u32访问未对齐的内存位置.该解决方案过去在x86 cpus上可使用数十年,但不建议使用,因为它违反了一些C标准原则:编译器被授权考虑此语句,以保证数据正确对齐,从而导致生成错误的代码.
  1. The last solution is to use direct u32 access to unaligned memory positions. This solution used to work for decades on x86 cpus, but is not recommended, as it violates some C standard principles : compiler is authorized to consider this statement as a guarantee that data is properly aligned, leading to buggy code generation.

不幸的是,在至少一种情况下,它是唯一能够从目标中提取性能的解决方案.即用于ARMv6上的GCC.

Unfortunately, in at least one case, it is the only solution able to extract performance from target. Namely for GCC on ARMv6.

尽管如此,请不要对ARMv7使用此解决方案:GCC可以生成为对齐的内存访问保留的指令,即LDM(多次加载),从而导致崩溃.

Do not use this solution for ARMv7 though : GCC can generate instructions which are reserved for aligned memory accesses, namely LDM (Load Multiple), leading to crash.

即使在x86/x64上,如今以这种方式编写代码也很危险,因为新一代编译器可能会尝试自动向量化一些兼容的循环,并基于这些内存的假设来生成SSE/AVX代码位置正确对齐,使程序崩溃.

Even on x86/x64, it becomes dangerous to write your code this way nowadays, as the new generation compilers may try to auto-vectorize some compatible loops, generating SSE/AVX code based on the assumption that these memory positions are properly aligned, crashing the program.

回顾一下,以下是总结为表格的结果,使用约定:memcpy>包装>直接.

As a recap, here are the results summarized as a table, using the convention : memcpy > packed > direct.

| compiler  | x86/x64 | ARMv7  | ARMv6  | ARM64  |  PPC   |
|-----------|---------|--------|--------|--------|--------|
| GCC 4.8   | memcpy  | packed | direct | memcpy | memcpy |
| clang 3.6 | memcpy  | memcpy | memcpy | memcpy |   ?    |
| icc 13    | packed  | N/A    | N/A    | N/A    | N/A    |

这篇关于在编写干净的C代码时利用ARM未对齐的内存访问的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆