aarch64 上未对齐 SIMD 加载/存储的性能 [英] Performance of unaligned SIMD load/store on aarch64

查看:32
本文介绍了aarch64 上未对齐 SIMD 加载/存储的性能的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

较旧的答案表明 aarch64 支持未对齐的读/写并提到了性能成本,但不清楚答案是否也仅涵盖 ALU 或 SIMD(128 位寄存器)操作.

An older answer indicates that aarch64 supports unaligned reads/writes and has a mention about performance cost, but it's unclear if the answer covers only the ALU or SIMD (128-bit register) operations, too.

相对于对齐的 128 位 NEON 加载和存储,未对齐的 128 位 NEON 加载和存储在 aarch64 上慢多少(如果有的话)?

Relative to aligned 128-bit NEON loads and stores, how much slower (if at all) are unaligned 128-bit NEON loads and stores on aarch64?

对于未对齐的 SIMD 加载和存储是否有单独的指令(如 SSE2 的情况),或者已知对齐的加载/存储与潜在未对齐的加载/存储是否具有相同的指令?

Are there separate instructions for unaligned SIMD loads and stores (as is the case with SSE2) or are the known-aligned loads/stores the same instructions as potentially-unaligned loads/stores?

推荐答案

根据Cortex-A57 软件优化指南4.6 加载/存储对齐部分说:

ARMv8-A 架构允许任意对齐多种类型的加载和存储访问.Cortex-A57 处理器可以处理大多数未对齐的访问,而不会降低性能.但是,也有一些情况减少带宽或导致额外的延迟,如下所述:

The ARMv8-A architecture allows many types of load and store accesses to be arbitrarily aligned. The Cortex-A57 processor handles most unaligned accesses without performance penalties. However, there are cases which reduce bandwidth or incur additional latency, as described below:

  • 加载跨越缓存行(64 字节)边界的操作
  • 存储跨越 16 字节边界的操作

因此这可能取决于您使用的处理器,无序(A57、A72、A-72、A-75)或有序(A-35、A-53、A-55).我没有找到任何有关有序处理器的优化指南,但是它们确实有一个硬件性能计数器,您可以使用它来检查未对齐指令的数量是否会影响性能:

So it may depend on the processor that you are using, out of order (A57, A72, A-72, A-75) or in-order (A-35, A-53, A-55). I didn't find any optimization guide for the in-order processors, however they do have a Hardware Performance Counter that you could use to check if the number of unaligned instructions do affect performance:

    0xOF_UNALIGNED_LDST_RETIRED Unaligned load-store

这可以与 perf 工具一起使用.

This can be used with the perf tool.

AArch64 中没有针对未对齐访问的特殊说明.

There are no special instructions for unaligned accesses in AArch64.

这篇关于aarch64 上未对齐 SIMD 加载/存储的性能的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆