在aarch64上未对齐的SIMD加载/存储的性能 [英] Performance of unaligned SIMD load/store on aarch64
问题描述
旧答案表示aarch64支持未对齐的读/写并提到了性能成本,但是目前尚不清楚答案是否仅涵盖ALU或SIMD(128位寄存器)操作.
An older answer indicates that aarch64 supports unaligned reads/writes and has a mention about performance cost, but it's unclear if the answer covers only the ALU or SIMD (128-bit register) operations, too.
相对于对齐的128位NEON加载和存储,未对齐的128位NEON加载和存储在aarch64上速度要慢多少(如果有的话)?
Relative to aligned 128-bit NEON loads and stores, how much slower (if at all) are unaligned 128-bit NEON loads and stores on aarch64?
是否有针对未对齐的SIMD加载和存储的单独说明(如SSE2的情况),或者已知对齐的加载/存储的指令与潜在未对齐的加载/存储的指令相同?
Are there separate instructions for unaligned SIMD loads and stores (as is the case with SSE2) or are the known-aligned loads/stores the same instructions as potentially-unaligned loads/stores?
推荐答案
根据 Cortex-A57软件优化指南 在 4.6加载/存储对齐部分中说:
ARMv8-A体系结构允许任意类型的多种加载和存储访问. Cortex-A57处理器可处理大多数未对齐的访问,而不会影响性能.但是,在某些情况下 减少带宽或产生额外的延迟,如下所述:
The ARMv8-A architecture allows many types of load and store accesses to be arbitrarily aligned. The Cortex-A57 processor handles most unaligned accesses without performance penalties. However, there are cases which reduce bandwidth or incur additional latency, as described below:
- 加载跨越缓存行(64字节)边界的操作
- 存储跨越16字节边界的操作
因此,它可能取决于您所使用的处理器,是乱序的(A57,A72,A-72,A-75)还是乱序的(A-35,A-53,A-55).我没有找到有序处理器的任何优化指南,但是它们确实具有硬件性能计数器,可用于检查未对齐指令的数量是否确实影响性能:
So it may depend on the processor that you are using, out of order (A57, A72, A-72, A-75) or in-order (A-35, A-53, A-55). I didn't find any optimization guide for the in-order processors, however they do have a Hardware Performance Counter that you could use to check if the number of unaligned instructions do affect performance:
0xOF_UNALIGNED_LDST_RETIRED Unaligned load-store
可以与perf
工具一起使用.
AArch64中没有针对未对齐访问的特殊说明.
There are no special instructions for unaligned accesses in AArch64.
这篇关于在aarch64上未对齐的SIMD加载/存储的性能的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!