aarch64 上未对齐 SIMD 加载/存储的性能 [英] Performance of unaligned SIMD load/store on aarch64
问题描述
较旧的答案表明 aarch64 支持未对齐的读/写并提到了性能成本,但不清楚答案是否也仅涵盖 ALU 或 SIMD(128 位寄存器)操作.
An older answer indicates that aarch64 supports unaligned reads/writes and has a mention about performance cost, but it's unclear if the answer covers only the ALU or SIMD (128-bit register) operations, too.
相对于对齐的 128 位 NEON 加载和存储,未对齐的 128 位 NEON 加载和存储在 aarch64 上慢多少(如果有的话)?
Relative to aligned 128-bit NEON loads and stores, how much slower (if at all) are unaligned 128-bit NEON loads and stores on aarch64?
对于未对齐的 SIMD 加载和存储是否有单独的指令(如 SSE2 的情况),或者已知对齐的加载/存储与潜在未对齐的加载/存储是否具有相同的指令?
Are there separate instructions for unaligned SIMD loads and stores (as is the case with SSE2) or are the known-aligned loads/stores the same instructions as potentially-unaligned loads/stores?
推荐答案
根据Cortex-A57 软件优化指南在 4.6 加载/存储对齐部分说:
ARMv8-A 架构允许任意对齐多种类型的加载和存储访问.Cortex-A57 处理器可以处理大多数未对齐的访问,而不会降低性能.但是,也有一些情况减少带宽或导致额外的延迟,如下所述:
The ARMv8-A architecture allows many types of load and store accesses to be arbitrarily aligned. The Cortex-A57 processor handles most unaligned accesses without performance penalties. However, there are cases which reduce bandwidth or incur additional latency, as described below:
- 加载跨越缓存行(64 字节)边界的操作
- 存储跨越 16 字节边界的操作
因此这可能取决于您使用的处理器,无序(A57、A72、A-72、A-75)或有序(A-35、A-53、A-55).我没有找到任何有关有序处理器的优化指南,但是它们确实有一个硬件性能计数器,您可以使用它来检查未对齐指令的数量是否会影响性能:
So it may depend on the processor that you are using, out of order (A57, A72, A-72, A-75) or in-order (A-35, A-53, A-55). I didn't find any optimization guide for the in-order processors, however they do have a Hardware Performance Counter that you could use to check if the number of unaligned instructions do affect performance:
0xOF_UNALIGNED_LDST_RETIRED Unaligned load-store
这可以与 perf
工具一起使用.
This can be used with the perf
tool.
AArch64 中没有针对未对齐访问的特殊说明.
There are no special instructions for unaligned accesses in AArch64.
这篇关于aarch64 上未对齐 SIMD 加载/存储的性能的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!