为什么PSHUFD指令没有固有的浮点数? [英] Why is there no floating point intrinsic for `PSHUFD` instruction?

查看:176
本文介绍了为什么PSHUFD指令没有固有的浮点数?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我面临的任务是将一个 _m128向量洗牌并将结果存储在另一个向量中.

The task I'm facing is to shuffle one _m128 vector and store the result in the other one.

我认为,有两种基本的方法可以对压缩的浮点_m128向量进行随机播放:

The way I see it, there are two basic ways to shuffle a packed floating point _m128 vector:

  • _mm_shuffle_ps,它使用SHUFPS指令,如果只希望从一个向量获取值,则不一定是最佳选择:它从目标操作数中获取两个值,这意味着需要多做一些动作.
  • _mm_shuffle_epi32,它使用的PSHUFD指令似乎完全符合此处的预期,并且与SHUFPS相比,可以具有更好的延迟/吞吐量.
  • _mm_shuffle_ps, which uses SHUFPS instruction that is not necessarily the best option if you want the values from one vector only: it takes two values from the destination operand, which implies an extra move.
  • _mm_shuffle_epi32, which uses PSHUFD instruction that seems to do exactly what is expected here and can have better latency/throughput than SHUFPS.

但是,后者的内在函数可用于整数矢量(_m128i),并且似乎没有浮点对应物,因此将其与_m128一起使用将需要进行一些难看的显式转换.同样,没有这样的对应对象这一事实可能意味着有一些适当的原因,我不知道.

The latter intrinsic however works with integer vectors (_m128i) and there seems to be no floating point counterpart, so using it with _m128 would require some ugly explicit casting. Also the fact that there is no such counterpart probably means that there is some proper reason for that, which I am not aware of.

问题是为什么没有内在函数来洗净一个浮点向量并将结果存储在另一个浮点向量中?
如果_mm_shuffle_ps(x,x, ...)可以生成PSHUFPD,可以保证吗?
如果PSHUFD不应用于浮点值,那是什么原因?

The question is why is there no intrinsic to shuffle one floating point vector and store the result in another?
If _mm_shuffle_ps(x,x, ...) can generate PSHUFPD, can it be guaranteed?
If PSHUFD should not be used for floating point values, what is the reason for that?

谢谢!

推荐答案

内部函数应该与指令一对一映射. _mm_shuffle_ps生成PSHUFD是非常不希望的.它应始终生成SHUFPS.该文档不建议在某些情况下会这样做.

Intrinsics are supposed to map one-to-one with instructions. It would be very undesirable for _mm_shuffle_ps to generate PSHUFD. It should always generate SHUFPS. The documentation does not suggest that there is a case where it would do otherwise.

将数据强制转换为单精度或双精度浮点时,某些处理器会降低性能.这是因为处理器用包含数据的FP分类的内部寄存器扩充SSE寄存器.零或NaN或无穷大或正常.切换类型时,执行该步骤会导致停顿.我不知道现代处理器是否仍然如此,但是您可以参考英特尔架构优化手册以获取该信息.

There is a performance penalty on certain processors when data is cast to single- or double-precision floating-point. This is because the processor augments the SSE registers with internal registers containing the FP classification of the data, e.g. zero or NaN or infinity or normal. When switching types you incur a stall as it performs that step. I don't know if this is still true of modern processors, but you can consult the Intel Architecture Optimization manuals for that information.

在现代处理器上,SHUFPS不会比PSHUFD显着慢.根据Agner Fog的说明表( http://www.agner.org/optimize/instruction_tables.pdf ),它们在Haswell(第四代Core i7)上具有相同的延迟和吞吐量.在Nehalem(第一代Core i7)上,它们具有相同的延迟,但是PSHUFD的吞吐量为2/周期,而SHUFPS的吞吐量为1/周期.因此,即使您忽略与切换类型相关的性能损失,也不能说在所有处理器上都应该优先使用一条指令.

SHUFPS is not significantly slower than PSHUFD on modern processors. According to Agner Fog's instruction tables (http://www.agner.org/optimize/instruction_tables.pdf), they have identical latency and throughput on Haswell (4th gen. Core i7). On Nehalem (1st gen. Core i7), they have identical latency, but PSHUFD has a throughput of 2/cycle and SHUFPS has a throughput of 1/cycle. So, you cannot say that one instruction should be preferred over the other across all processors, even if you ignore the performance penalty associated with switching types.

还有一种在__m128,__ m128d和__m128i之间进行转换的方法:_mm_castXX_YY(

There is also a way to cast between __m128, __m128d, and __m128i: _mm_castXX_YY (https://software.intel.com/en-us/node/695375?language=es) where XX and YY are each one of ps, pd, or si128. For example, _mm_castps_pd(). This is really a bad idea because the processors on which PSHUFD is faster suffer from the performance penalty associated with switching back to FP afterward. In other words, there is no faster way to do a SHUFPS other than doing a SHUFPS.

这篇关于为什么PSHUFD指令没有固有的浮点数?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆