为什么AVX-256 VMOVAPS指令只复制四个单precision花车,而不是8? [英] Why is the AVX-256 VMOVAPS Instruction only copying four single precision floats instead of 8?

查看:962
本文介绍了为什么AVX-256 VMOVAPS指令只复制四个单precision花车,而不是8?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想对一些较新的Intel处理器可用的256位AVX指令来熟悉自己。我已经验证了我的i7-4720HQ支持256位AVX指令。我遇到的问题是,VMOVAPS指令,它应该复制8个单precision浮点值,只能复制4。

 点PROC
    VMOVAPS YMM1,ymmword PTR [RCX]
    VDPPS YMM2,YMM1,ymmword PTR [RDX] 255
    VMOVAPS ymmword PTR [RCX],YMM2
    MOVSS XMM0,DWORD PTR [RCX]
    RET
点ENDP

如果你不熟悉的调用约定时,Visual C ++ 2015年预计这一函数的返回(因为它是一个浮动),以在返回在XMM0。

除此之外,该标准是第一个参数在RCX传递,第二个参数在RDX传递。

下面是C code调用这个函数。

  _declspec(调整(32))浮D1 [] = {1.0F,1.0F,1.0F,1.0F,1.0F,1.0F,1.0F,1.0F};
_declspec(调整(32))浮D2 [] = {2.0F,2.0F,2.0F,2.0F,2.0F,2.0F,2.0F,2.0F};
的printf(点产品测试:%F \\ N,点(D1,D2));

点函数的返回值始终是8.0。除了这一点,我已调试的功能,并发现第一组件指令后,只有四个值被复制到YMM1。 YMM1的其余部分保持为零。

我是不是做错了什么吗?我已经通过了Intel的文档和一些第三方文档看去。至于我可以告诉大家,我做的一切权利。我使用了错误的指令?顺便说一句,如果你来这里是为了告诉我使用英特尔编译器内在函数,不要打扰。


解决方案

您忘了阅读 VDPPS 指令集参考页。它提到,结果在两半制备:


  VDPPS(VEX.256 EN $ C $光盘版)
DEST [127:0]←DP_Primitive(SRC1 [127:0],SRC2 [127:0]);
DEST [255:128]←DP_Primitive(SRC1 [255:128],SRC2 [255:128]);


这不是 VMOVAPS 这是错误的。

I am trying to familiarize myself with the 256-bit AVX instructions available on some of the newer Intel processors. I have already verified that my i7-4720HQ supports 256-bit AVX instructions. The problem I am having is that the VMOVAPS instruction, which should copy 8 single precision floating point values, is only copying 4.

dot PROC
    VMOVAPS YMM1, ymmword ptr [RCX]                
    VDPPS   YMM2, YMM1, ymmword ptr [RDX], 255      
    VMOVAPS ymmword ptr [RCX], YMM2                 
    MOVSS   XMM0, DWORD PTR [RCX]                  
    RET
dot ENDP

In case you aren't familiar with the calling convention, Visual C++ 2015 expects the return of this function (since it is a float) to be in XMM0 upon return.

In addition to this, the standard is for the first argument to be passed in RCX and the second argument to be passed in RDX.

Here is the C code that calls this function.

_declspec(align(32)) float d1[] = { 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f };
_declspec(align(32)) float d2[] = { 2.0f, 2.0f, 2.0f, 2.0f, 2.0f, 2.0f, 2.0f, 2.0f };
printf("Dot Product Test: %f\n", dot(d1, d2));

The return value of the dot function is always 8.0. In addition to this, I have debugged the function and found that after the first assembly instruction, only four values get copied into YMM1. The rest of YMM1 remains zeroed.

Am I doing something wrong here? I've looked through the Intel documentation and some third party documentation. As far as I can tell I'm doing everything right. Am I using the wrong instruction? By the way, if you are here to tell me to use the Intel compiler intrinsics, don't bother.

解决方案

You forgot to read the instruction set reference page for VDPPS. It mentions that the result is produced in two halves:

VDPPS (VEX.256 encoded version)
DEST[127:0] ← DP_Primitive(SRC1[127:0], SRC2[127:0]);
DEST[255:128] ← DP_Primitive(SRC1[255:128], SRC2[255:128]);

It's not the VMOVAPS that's wrong.

这篇关于为什么AVX-256 VMOVAPS指令只复制四个单precision花车,而不是8?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆