gcc(6.1.0)在SSE内在函数中使用'错误'指令 [英] gcc (6.1.0) using 'wrong' instructions in SSE intrinsics

查看:209
本文介绍了gcc(6.1.0)在SSE内在函数中使用'错误'指令的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

:我开发了一个用C / C ++编写的计算密集型工具,它必须能够在各种不同的x86_64处理器上运行。为了加速浮点和整数的计算,代码包含了很多SSE *内在函数,并针对不同的CPU SSE功能量身定制了不同的路径。 (因为CPU标志在程序开始时被检测到并用于设置布尔值,所以我认为定制的代码块的分支预测将非常有效)。
$ b $为了简单起见,我假设只有SSE2到SSE4.2需要考虑。

为了访问4.2路径的SSE4.2内部函数,我需要使用gcc的-msse4.2选项。



问题
我遇到的问题是,至少使用6.1.0,gcc使用 sse4.2 指令p​​insrd实现 sse2 内部函数mm_cvtsi32_si128。



如果我使用-msse2限制编译,它将使用sse2指令,即movd,即。这是英特尔内在指南所说的应该使用的。

这有两个问题令人讨厌。

1)关键问题在于,当程序运行在4.2以前的CPU上时,该程序现在会崩溃并出现非法指令。我无法控制使用何种硬件,因此可执行文件需要与旧机器兼容,但需要利用新硬件上的功能(如有)。



2)根据Intel intrinsics指南,pinsrd指令比它替换的mov要慢很多。 (pinsrd比较普遍,但这不是必须的)。



有谁知道如何使用gcc 来使用intrinsics指南中的说明应该使用但仍然允许在同一个编译单元中通过SSE4 *访问所有SSE2?更新:我还应该注意,相同的代码是在Linux,Windows和Linux下编译的OSX使用各种不同的编译器,因此如果可能的话,宁愿避免或至少具有最少的编译器特定的扩展。



Update2 :(感谢@PeterCordes)似乎如果启用了优化,gcc将在适当的时候恢复使用来自pinsrd的movd。 如果您给 -msse4.2 标志到编译步骤中的gcc命令行,它将假定它可以自由地使用整个翻译单元的SSE 4.2指令集。这可能导致你描述的行为。如果您需要仅 使用SSE2和以下代码的代码,则使用 -msse2 (如果您正在为x86_64构建,则根本没有标记)




  • p>如果您可以在功能级别轻松分解您的代码,那么gcc的多版本控制功能可以提供帮助。它需要一个相对较新版本的编译器,但它允许你做这样的事情(从上面的链接中获得):

      __attribute__((target(default)))
    int foo()
    {
    // foo的默认版本。
    返回0; ((target(sse4.2)))
    int foo()
    {
    // foo version for SSE4。


    __attribute__ 2
    返回1; $(b

    $ _ bat b)((target(arch = atom)))
    int foo()
    {
    //英特尔的foo版本ATOM处理器
    返回2;

    $ b $ __attribute__((target(arch = amdfam10)))
    int foo()
    {
    // AMD的foo版本家庭0x10处理器。
    返回3;
    }

    int main()
    {
    int(* p)()=& foo;
    assert((* p)()== foo());
    返回0;





    $ b在这个例子中,gcc会自动编译不同版本的 foo()并根据CPU的能力在运行时分派给适当的人。
  • 你可以打破不同的实现(SSE2,SSE4.2等)转换成不同的翻译单元,然后在运行时适当地调度到正确的实现。 你可以把所有的SIMD编码到共享库中,并使用不同的编译器标志多次构建共享库。然后在运行时,您可以检测CPU的功能并加载相应版本的共享库。这是像英特尔数学核心函数库这样的图书馆采用的方法。



Background: I develop a computationally intensive tool, written in C/C++, that has to be able to be run on a variety of different x86_64 processors. To speed the calculations which are both float and integer, the code contains rather a lot of SSE* intrinsics with different paths tailored to different CPU SSE capabilities. (As the CPU flags are detected at the start of the program and used to set Booleans, I've assumed that the branch prediction for the tailored blocks of code will work very effectively).

For simplicity I've assumed only SSE2 through to SSE4.2 need to be considered.

In order to access SSE4.2 intrinsics fpr the 4.2 paths, I need to use gcc's -msse4.2 option.

The problem The issue I'm having is that, at least with 6.1.0, gcc goes and implements the sse2 intrinsic, mm_cvtsi32_si128, with the sse4.2 instruction, pinsrd.

If I limit the compilation by using -msse2, it will use the sse2 instruction, movd, ie. the one that the intel "intrinsics guide" says it's supposed to use.

This is annoying on two counts.

1) The critical problem is that the program now crashes with an illegal instruction when it gets run on a pre4.2 CPU. I don't have control over what HW is used so the executable needs to be compatible with older machines, yet needs to take advantage of features on newer HW where available.

2) According to the Intel intrinsics guide, the pinsrd instruction is quite a lot slower than the mov it replaces. (pinsrd is more general but this is not needed).

Does anyone know how to make gcc just use the instructions that the intrinsics guide says should be used yet still allow access to all SSE2 through SSE4* in the same compilation unit?

Update: I should also note the same code is compiled under Linux, Windows and OSX using a variety of different compilers so would rather like to avoid or at least have the fewest compiler-specific extensions if possible.

Update2: (Thanks to @PeterCordes) Seems that if optimisation is enabled, gcc will revert back to using movd from pinsrd where appropriate.

解决方案

If you give the -msse4.2 flag to gcc's command line during a compilation step, it will assume that it is free to use up to the SSE 4.2 instruction set for the entire translation unit. This can lead to the behavior that you described. If you need code that only uses SSE2 and below code, then using -msse2 (or no flag at all if you're building for x86_64) is required.

Some options that I can think of are:

  • If you can easily break down your code at the function level, then gcc's multiversioning feature can help. It requires a relatively recent version of the compiler, but it allows you to do things like this (taken from the link above):

     __attribute__ ((target ("default")))
     int foo ()
     {
       // The default version of foo.
       return 0;
     }
    
     __attribute__ ((target ("sse4.2")))
     int foo ()
     {
       // foo version for SSE4.2
       return 1;
     }
    
     __attribute__ ((target ("arch=atom")))
     int foo ()
     {
       // foo version for the Intel ATOM processor
       return 2;
     }
    
     __attribute__ ((target ("arch=amdfam10")))
     int foo ()
     {
       // foo version for the AMD Family 0x10 processors.
       return 3;
     }
    
     int main ()
     {
       int (*p)() = &foo;
       assert ((*p) () == foo ());
       return 0;
     }
    

    In this example, gcc will automatically compile the different versions of foo() and dispatch to the appropriate one at runtime based on the CPU's capabilities.

  • You can break the different implementations (SSE2, SSE4.2, etc.) into different translation units, then dispatch appropriately to the right implementation at runtime.

  • You can put all of the SIMD code into a shared library and build the shared library multiple times with different compiler flags. Then at runtime, you can detect the CPU's capabilities and load the appropriate version of the shared library. This is the approach taken by libraries like Intel's Math Kernel Library.

这篇关于gcc(6.1.0)在SSE内在函数中使用'错误'指令的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆