强制AVX内部函数改为使用SSE指令 [英] Forcing AVX intrinsics to use SSE instructions instead

查看:229
本文介绍了强制AVX内部函数改为使用SSE指令的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

不幸的是,我有一个AMD打桩机cpu,它似乎在AVX指令上有问题:

Unfortunately I have an AMD piledriver cpu, which seems to have problems with AVX instructions:

使用256位AVX寄存器进行的内存写入速度非常慢.测量的吞吐量比以前的模型(Bulldozer)慢5-6倍,比两次128位写入慢8-9倍.

Memory writes with the 256-bit AVX registers are exceptionally slow. The measured throughput is 5 - 6 times slower than on the previous model (Bulldozer), and 8 - 9 times slower than two 128-bit writes.

以我自己的经验,我发现mm256内在函数比mm128慢得多,我认为这是由于上述原因.

In my own experience, I've found mm256 intrinsics to be much slower than mm128, and I'm assuming it's because of the above reason.

我确实想为最新的指令集AVX编码,同时仍然能够以合理的速度在我的机器上测试构建.有没有一种方法可以强制mm256内部函数改为使用SSE指令?我正在使用VS 2015.

I really want to code for the newest instruction set AVX though, while still being able to test builds on my machine at a reasonable speed. Is there a way to force mm256 intrinsics to use SSE instructions instead? I'm using VS 2015.

如果没有简单的方法,那么困难的方法呢?将<immintrin.h>替换为包含我自己的内部定义的自定义标头,可以将其编码为使用SSE?不确定这是多么合理,在可能的情况下,请尝试更简单的方法.

If there is no easy way, what about a hard way. Replace <immintrin.h> with a custom made header containing my own definitions for the intrinsics which can be coded to use SSE? Not sure how plausible this is, prefer easier way if possible before I go through that work.

推荐答案

使用Agner Fog的矢量类库并将其添加到Visual Studio中的命令行:-D__SSE4_2__ -D__XOP__.

Use Agner Fog's Vector Class Library and add this to the command line in Visual Studio: -D__SSE4_2__ -D__XOP__.

然后将AVX大小的向量(例如Vec8f)用于八个浮点.在没有启用AVX的情况下进行编译时,它将使用文件vectorf256e.h,该文件使用两个SSE寄存器来模拟AVX.例如,Vec8f是从Vec256fe继承的,它的开头是这样的:

Then use an AVX sized vector such as Vec8f for eight floats. When you compile without AVX enable it will use the file vectorf256e.h which emulates AVX with two SSE registers. For example Vec8f inherits from Vec256fe which starts like this:

class Vec256fe {
protected:
    __m128 y0;                         // low half
    __m128 y1;                         // high half

如果使用/arch:AVX -D__XOP__进行编译,则VCL将改为使用文件vectorf256.h和一个AVX寄存器.然后,只需更改编译器开关,您的代码即可用于AVX和SSE.

If you compile with /arch:AVX -D__XOP__ the VCL will instead use the file vectorf256.h and one AVX register. Then your code works for AVX and SSE with only a compiler switch change.

如果您不想使用XOP,请不要使用-D__XOP__.

If you don't want to use XOP don't use -D__XOP__.

正如Peter Cordes在他的回答中指出的那样,如果您的目标只是避免256位加载/存储,那么您可能仍希望使用VEX编码的指令(尽管不清楚,这在某些特殊情况下会有所不同) .您可以使用像这样的向量类来做到这一点

As Peter Cordes pointed out in his answer, if you your goal is only to avoid 256-bit load/stores then you may still want VEX encoded instructions (though it's not clear this will make a difference except in some special cases). You can do that with the vector class like this

Vec8f a;
Vec4f lo = a.get_low();  // a is a Vec8f type
Vec4f hi = a.get_high();
lo.store(&b[0]);         // b is a float array
hi.store(&b[4]);

然后使用/arch:AVX -D__XOP__进行编译.

另一种选择是一个使用Vecnf然后执行的源文件

Another option would be be one source file that uses Vecnf and then do

//foo.cpp
#include "vectorclass.h"
#if SIMDWIDTH == 4
typedef Vec4f Vecnf;
#else
typedef Vec8f Vecnf;
#endif  

并像这样编译

cl /O2 /DSIMDWIDTH=4                     foo.cpp /Fofoo_sse
cl /O2 /DSIMDWIDTH=4 /arch:AVX /D__XOP__ foo.cpp /Fofoo_avx128
cl /O2 /DSIMDWIDTH=8 /arch:AVX           foo.cpp /Fofoo_avx256

这将使用一个源文件创建三个可执行文件.无需链接它们,您只需使用/c对其进行编译,然后它们将成为CPU调度程序.我将XOP与avx128一起使用,是因为除了AMD之外,没有充分的理由使用avx128.

This would create three executables with one source file. Instead of linking them you could just compile them with /c and them make a CPU dispatcher. I used XOP with avx128 because I don't think there is a good reason to use avx128 except on AMD.

这篇关于强制AVX内部函数改为使用SSE指令的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆