SSE指令MOVSD(延伸:浮点标量和放;在x86向量运算,X86-64) [英] SSE instruction MOVSD (extended: floating point scalar & vector operations on x86, x86-64)

查看:2425
本文介绍了SSE指令MOVSD(延伸:浮点标量和放;在x86向量运算,X86-64)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我莫名其妙地被MOVSD汇编指令混淆。我写了一些数值code计算一些矩阵乘法,只需使用普通C code,没有上证所内部函数。我甚至不包括SSE2内部函数编译的头文件。但是,当我检查汇编输出,我看到:

I am somehow confused by the MOVSD assembly instruction. I wrote some numerical code computing some matrix multiplication, simply using ordinary C code with no SSE intrinsics. I do not even include the header file for SSE2 intrinsics for compilation. But when I check the assembler output, I see that:

1)128位的矢量寄存器XMM被使用;
2)SSE2指令MOVSD被调用。

1) 128-bit vector registers XMM are used; 2) SSE2 instruction MOVSD is invoked.

据我所知,MOVSD单双层precision浮点本质工作。它只使用XMM寄存器的低64位,并设置上限64位为0,但我就是不明白两件事情:

I understand that MOVSD essentially operates on single double precision floating point. It only uses the lower 64-bit of an XMM register and set the upper 64-bit 0. But I just don't understand two things:

1)我从来没有给编译器任何暗示使用SSE2。另外,我使用的GCC不是英特尔的编译器。据我所知,英特尔编译器会自动寻找矢量化机遇,但GCC不会。那么,如何GCC知道使用MOVSD?或者说,已经有很长SSE指令集在此之前的x86指令,和_mm_load_sd()内联函数中SSE2只是为了使用XMM寄存器标量计算提供向后兼容性?

1) I never give the compiler any hint for using SSE2. Plus, I am using GCC not intel compiler. As far as I know, intel compiler will automatically seek opportunities for vectorization, but GCC will not. So how does GCC know to use MOVSD?? Or, has this x86 instruction been around long before SSE instruction set, and the _mm_load_sd() intrinsics in SSE2 is just to provide backward compatibility for using XMM registers for scalar computation?

2)为什么不编译器使用其它的浮点寄存器,无论是80-bit浮点堆栈,或64位浮点寄存器?为什么一定要把使用XMM寄存器收费(通过设置高64位0,基本上浪费了存储空间)?是否XMM确实提供更快的访问??

2) Why does not the compiler use other floating point registers, either the 80-bit floating point stack, or 64-bit floating point registers?? Why must it take the toll using XMM register (by setting upper 64-bit 0 and essentially wasting that storage)? Does XMM do provide faster access??

顺便说一句,我有一个关于SSE2另外一个问题。我看不出_mm_store_sd()和_mm_storel_sd之间的差异()。两个较低的64位值存储到一个地址。有什么不同?性能上的差异?对准差异??

By the way, I have another question regarding SSE2. I just can't see the difference between _mm_store_sd() and _mm_storel_sd(). Both store the lower 64-bit value to an address. What is the difference? Performance difference?? Alignment difference??

感谢您。

更新1:

好吧,很明显当我第一次问这个问题,我缺乏对如何管理着CPU浮点运算的一些基本知识。因此,专家们倾向于认为我的问题是无感。由于我没有包括连最短的样品C code,人可能会认为这个问题含糊不清,以及。在这里,我将提供审查作为一个答案,它希望将是有益的不清楚现代浮点操作的任何人处理器。

OKAY, obviously when I first asked this question, I lacked some basic knowledge on how a CPU manages floating point operations. So experts tend to think my question is non-sense. Since I did not include even the shortest sample C code, people might think this question vague as well. Here I would provide a review as an answer, which hopefully will be useful to any people unclear about the floating point operations on modern CPUs.

推荐答案

浮于现代CPU点标/矢量处理的审查

矢量处理的想法可以追溯到旧时代向量处理器的,但这些处理器已经由现代建筑与缓存系统所取代。因此,我们专注于现代的CPU,特别是 86 并的 X86-64 。这些架构是在高性能科学计算的主流。

The idea of vector processing dates back to old time vector processors, but these processors had been superseded by modern architectures with cache systems. So we focus on modern CPUs, especially x86 and x86-64. These architectures are the main stream in high performance scientific computing.

自的i386,Intel推出了浮点堆栈,其中浮点数高达80位宽可以保持。该堆栈通常被称为的x87或387浮点寄存器,拥有一整套的<一个HREF =htt​​p://home.agh.edu.pl/~amrozek/x87.pdf相对=nofollow>的x87 FPU说明。的x87堆栈是不是真实的,可直接寻址寄存器像通用寄存器,因为他们是一个堆栈中。访问寄存器ST(i)是通过偏移栈顶寄存器%ST(0),或仅仅%圣。一个指令FXCH其中交换当前堆栈顶部%次和一些偏移寄存器%到ST(i)之间的内容的帮助下,随机接入可以实现的。但是FXCH可以判处的性能损失,但最小化。的x87栈通过计算与默认为80位precision的,中间结果在数值上不稳定的算法,最大限度地减少舍入误差提供了高precision计算。然而,指令的x87完全是标量。

Since i386, Intel introduced the floating point stack where floating point numbers up to 80-bit wide can be held. This stack is commonly known as x87 or 387 floating point "registers", with a set of x87 FPU instructions. x87 stack are not real, directly addressable registers like general purpose registers, as they are on a stack. Access to register st(i) is by offsetting the stack top register %st(0) or simply %st. With help of an instruction FXCH which swaps the contents between current stack top %st and some offset register %st(i), random access can be achieved. But FXCH can impose some performance penalty, though minimized. x87 stack provides high precision computation by calculating intermediate results with 80 bits of precision by default, to minimise roundoff error in numerically unstable algorithms. However, x87 instructions are completely scalar.

这是矢量的第一次努力是 MMX指令集,该实施< STRONG>整数向量运算。 MMX下的向量寄存器是64位宽的寄存器MMX0,MMX1,...,MMX7。每个可用于容纳任一64位整数,或多个较小整数压缩的格式。单个指令可以被同时应用于两个32位整数,4个16位整数,或者8个8位整数。所以,现在有标量整数运算与无共享的执行资源整数矢量运算的传统通用寄存器,以及新的MMX。但是MMX共享的执行资源与标量的x87 FPU操作:每个MMX寄存器对应一个的x87寄存器的低64位,并使用x87寄存器的高16位是未使用的。这些MMX寄存器分别可直接寻址。但混叠使得难以与浮点和整数向量运算在同一应用程序的工作。为了最大限度地提高性能,程序员通常使用的处理器只在一种模式或其他的,尽可能长地推迟它们之间的相对缓慢的开关

The first effort on vectorization is the MMX instruction set, which implemented integer vector operations. The vector registers under MMX are 64-bit wide registers MMX0, MMX1, ..., MMX7. Each can be used to hold either 64-bit integers, or multiple smaller integers in a "packed" format. A single instruction can then be applied to two 32-bit integers, four 16-bit integers, or eight 8-bit integers at once. So now there are the legacy general purpose registers for scalar integer operations, as well as new MMX for integer vector operations with no shared execution resources. But MMX shared execution resources with scalar x87 FPU operation: each MMX register corresponded to the lower 64 bits of an x87 register, and the upper 16 bits of the x87 registers is unused. These MMX registers were each directly addressable. But the aliasing made it difficult to work with floating point and integer vector operations in the same application. To maximize performance, programmers often used the processor exclusively in one mode or the other, deferring the relatively slow switch between them as long as possible.

后来, SSE 创建一套独立的128位宽的寄存器XMM0-XMM7沿边的的x87栈。 SSE指令集中的完全在单precision浮点运算(32位);使用MMX寄存器和MMX指令集的整数矢量操作仍在进行。但是,现在这两个操作可以继续进行,同时,因为它们不共享的执行资源。 重要的是要知道,SSE既做浮点矢量运算,而且浮点标量操作是非常重要的。本质上,它提供了一个新的地方抹平操作发生,并且使用x87堆栈不再是优先选择进行浮点运算。使用XMM寄存器的标量浮点运算比使用的x87栈更快,因为所有的XMM寄存器都更容易访问,而堆栈的x87不得随意无FXCH访问。:当我张贴了我的问题,我很明显不知道这个事实。另一个概念我并不清楚的是,通用寄存器是整数/地址寄存器。即使他们是在X86-64 64位,他们不能容纳64位浮点。主要的原因是,与通用寄存器相关联的执行单元的ALU(算术和放大器;逻辑单元),这是不为浮点计算

Later, SSE created a separate set of 128-bit wide registers XMM0–XMM7 along side of x87 stack. SSE instructions focused exclusively on single-precision floating-point operations (32-bit); integer vector operations were still performed using the MMX register and MMX instruction set. But now both operations can proceed at the same time, as they share no execution resources. It is important to know that SSE not only do floating point vector operations, but also floating point scalar operations. Essentially it provides a new place where floating operations take place, and the x87 stack is no longer prior choice to carry out floating operations. Using XMM registers for scalar floating point operations is faster than using x87 stack, as all XMM registers are easier to access, while the x87 stack can't be randomly accessed without FXCH. When I posted my question, I was clearly unaware of this fact. The other concept I was not clear about is that general purpose registers are integer/address registers. Even if they are 64-bit on x86-64, they can not hold 64-bit floating point. The main reason is that the execution unit associated with general purpose registers is ALU (arithmetic & logical unit), which is not for floating point computation.

SSE2 是一大进步,因为它扩展了矢量数据类型,所以SSE2指令集,无论是标量或矢量,可与所有标准C数据类型。实际上,这样的扩展使MMX过时。此外,协议栈的x87不只要它曾经是非常重要的。由于有两种可供选择的地方,浮点运算可能发生,您可以指定选项编译器。例如,对于GCC,编译与标志

SSE2 is a major progress, as it extends vector data type, so SSE2 instructions, either scalar or vector, can work with all C standard data type. Such extension in fact makes MMX obsolete. Also, x87 stack is no long as important as it once was. Since there are two alternative places where floating point operations can take place, you can specify your option to the compiler. For example for GCC, compilation with flag

-mfpmath=387

将安排浮动传统的x87栈点操作。 请注意,这似乎是32位x86默认,即使上证所已经可用。例如,我在2007年做出了英特尔酷睿2的笔记本电脑,它已经配备了上证所发布了到版本SSE4,而GCC仍然会使用默认的x87栈,这使得科学计算不必要的慢。在这种情况下,我们需要用标志编译

will schedule floating point operations on the legacy x87 stack. Note that this seems to be the default for 32-bit x86, even if SSE is already available. For example, I have an Intel Core2Duo laptop made in 2007, and it was already equipped with SSE release up to version SSE4, while GCC will still by default use x87 stack, which makes scientific computations unnecessarily slower. In this case, we need compile with flag

-mfpmath=sse

和海湾合作委员会将安排漂浮在XMM寄存器浮点运算。 64位X86-64用户毋须担心这样的配置,因为这是默认的x86-64的。这样的信号只会影响标量浮点运算。如果我们使用矢量指令和编译器的code,旗书面code

and GCC will schedule floating point operations on XMM registers. 64-bit x86-64 user needs not worry about such configuration as this is default on x86-64. Such signal will only affect scalar floating point operation. If we have written code using vector instructions and compiler the code with flag

-msse2

然后XMM寄存器将在那里计算可以发生的地方。换句话说,这个标志开启-mfpmath = SSE。欲了解更多信息,请参阅 86的GCC的配置, X86-64 。对于编写SSE2 C $ C $的C例子,看到我的其他职位<一个href=\"http://stackoverflow.com/questions/36110591/how-to-ask-gcc-to-completely-unroll-this-loop-i-e-peel-this-loop\">How问GCC完全展开这个循环(即剥离这个循环)?。

then XMM registers will be the only place where computation can take place. In other words, this flags turns on -mfpmath=sse. For more information see GCC's configuration of x86, x86-64. For examples of writing SSE2 C code, see my other post How to ask GCC to completely unroll this loop (i.e., peel this loop)?.

SSE指令集,虽然非常有用的,不是最新矢量扩展。该AVX,高级矢量扩展通过提供3操作数和4操作数的指令提高了SSE。请参见在指令操作数集的数量,如果你是不清楚这意味着什么。 3操作数指令优化常见乘加(FMA)使用1较少的寄存器1科学计算)操作; 2)减少寄存器之间的数据移动的明确量3)在自身加快FMA计算。例如,使用AVX,请参阅 @Nominal动物的回答我的帖子

SSE set of instructions, though very useful, are not the latest vector extensions. The AVX, advanced vector extensions enhances SSE by providing 3-operands and 4 operands instructions. See number of operands in instruction set if you are unclear of what this means. 3-operands instruction optimizes the commonly seen fused multiply-add (FMA) operation in scientific computing by 1) using 1 fewer register; 2) reducing the explicit amount of data movement between registers; 3) speeding up FMA computations in itself. For example of using AVX, see @Nominal Animal's answer to my post.

这篇关于SSE指令MOVSD(延伸:浮点标量和放;在x86向量运算,X86-64)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆