对齐与不对齐x86 SIMD指令之间的选择 [英] Choice between aligned vs. unaligned x86 SIMD instructions

查看:448
本文介绍了对齐与不对齐x86 SIMD指令之间的选择的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

通常有两种类型的SIMD指令:

There are generally two types of SIMD instructions:

A.使用对齐的内存地址的地址,如果地址未在操作数大小边界上对齐,则会引发一般保护(#GP)异常:

A. Ones that work with aligned memory addresses, that will raise general-protection (#GP) exception if the address is not aligned on the operand size boundary:

movaps  xmm0, xmmword ptr [rax]
vmovaps ymm0, ymmword ptr [rax]
vmovaps zmm0, zmmword ptr [rax]

B.而那些使用未对齐内存地址的地址,则不会引发此类异常:

B. And the ones that work with unaligned memory addresses, that will not raise such exception:

movups  xmm0, xmmword ptr [rax]
vmovups ymm0, ymmword ptr [rax]
vmovups zmm0, zmmword ptr [rax]

但是我很好奇,为什么我要朝自己的脚开枪,并完全使用第一组中对齐的记忆指令?

But I'm just curious, why would I want to shoot myself in the foot and use aligned memory instructions from the first group at all?

推荐答案

  • 未对齐的访问:只能使用movups/vmovups.在统一访问案例中讨论的相同处罚(见下)也适用于此.此外,跨越缓存行或虚拟页面边界的访问始终会在所有处理器上造成损失.
  • 对齐访问:
    • 在Intel Nehalem和更高版本(包括Silvermont和更高版本)以及AMD Bulldozer和更高版本上:预解码后,它们以相同的精确方式对相同的操作数执行.这包括对消除运动的支持.对于提取和预解码阶段,它们为相同的操作数消耗相同的确切资源.
    • 在Nehalem之前和Bonnell以及Bull-dozer之前:它们被解码为不同的融合域uops和未融合域uops. movups/vmovups在管道的前端和后端消耗更多的资源(最多两倍).换句话说,就延迟和/或吞吐量而言,movups/vmovups的速度可能是movaps/vmovaps的两倍.
      • Unaligned access: Only movups/vmovups can be used. The same penalties discussed in the aligned access case (see next) apply here too. In addition, accesses that cross a cache line or virtual page boundary always incur penalty on all processors.
      • Aligned access:
        • On Intel Nehalem and later (including Silvermont and later) and AMD Bulldozer and later: After predecoding, they are executed in the same exact way for the same operands. This includes support for move elimination. For the fetch and predecode stages, they consume the same exact resources for the same operands.
        • On pre-Nehalem and Bonnell and pre-Bulldozer: They get decoded into different fused domain uops and unfused domain uops. movups/vmovups consume more resources (up to twice as much) in the frontend and the backend of the pipeline. In other words, movups/vmovups can be up to twice as slow as movaps/vmovaps in terms of latency and/or throughput.
        • 因此,如果您不关心较旧的微体系结构,则两者在技术上是等效的.尽管如果您知道或期望数据会对齐,则应使用对齐的指令来确保数据确实对齐,而不必在代码中添加显式检查.

          Therefore, if you don't care about the older microarchitectures, both are technically equivalent. Although if you know or expect the data to be aligned, you should use the aligned instructions to ensure that the data is indeed aligned without having to add explicit checks in the code.

          这篇关于对齐与不对齐x86 SIMD指令之间的选择的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆