使用更新的CPU指令支持构建向后兼容的二进制文件 [英] Building backward compatible binaries with newer CPU instructions support

查看:81
本文介绍了使用更新的CPU指令支持构建向后兼容的二进制文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

实现同一功能的多个版本的最佳方法是使用特定的CPU指令(如果在运行时进行了测试),或者如果使用慢的实现则回退到较慢的实现?

例如,x86 BMI2提供了非常有用的 PDEP 指令.我将如何编写C代码,以便在启动时测试正在执行的CPU的BMI2可用性,并使用两种实现方式之一-一种使用 _pdep_u64 调用(可用于 -mbmi2 ),另一个使用C代码手动"进行位操作.是否有针对此类情况的内置支持?在提供对较新的内在函数的访问权限时,我将如何使GCC针对较旧的arch进行编译?我怀疑如果通过全局函数指针而不是每次if/else调用函数,执行速度会更快?

解决方案

您可以声明一个函数指针,并在程序启动时通过调用 cpuid 确定当前体系结构将其指向正确的版本

但是最好利用许多现代编译器的支持.英特尔的ICC具有自动功能分派以选择优化的很久以前每种架构的版本.我不知道细节,但看起来它仅适用于Intel的库.此外,它仅在Intel CPU上分发到高效版本,因此将是对其他制造商不公平.在 Agner的CPU中,有许多补丁和解决方法.博客

后来在函数多版本化的功能./www.agner.org/optimize/blog/read.php?i=49#130"rel =" nofollow noreferrer> GCC 4.8 .它将添加您将在每个函数版本中声明的 target 属性

  __ attribute__((target("sse4.2"))))int foo(){返回1;}__attribute__((target("arch = atom"))))int foo(){返回2;}int main(){int(* p)()=& foo;返回foo()+ p();} 

那会重复很多代码,而且很麻烦,因此GCC 6添加了 target_clones ,告诉GCC将一个函数编译到多个克隆中.例如 __ attribute __((target_clones("avx2","arch = atom",默认"))))void foo(){} 将创建3个不同的 foo 版本.有关它们的更多信息,请参见 GCC有关函数属性的文档

然后, Clang 功能符号 (而不是运行时)加载.这是英特尔的Clear Linux 的原因之一快速运行 .ICC可能还会创建 GCC 6中的功能多版本化

  • 函数多版本化
  • 在GCC中功能多版本化的作用-出乎意料的有限-
  • 为多种SIMD架构生成代码
  • 以下是一个具有多版本的人的示例(第二部分)连同其演示大约是POPCNT但你的想法

      __ attribute __((target_clones("popcnt","default"))))int runPopcount64_builtin_multiarch_loop(const uint8_t *位字段,int64_t大小,整数重复){int res = 0;const uint64_t * data =(const uint64_t *)bitfield;为(int r = 0; r< repeat; r ++)for(int i = 0; i< size/8; i ++){res + = popcount64_builtin_multiarch_loop(data [i]);}返回资源;} 

    请注意, PDEP PEXT 在当前的AMD CPU上非常慢,因此只能在Intel上启用它们

    What is the best way to implement multiple versions of the same function that uses a specific CPU instructions if available (tested at run time), or falls back to a slower implementation if not?

    For example, x86 BMI2 provides a very useful PDEP instruction. How would I write a C code such that it tests BMI2 availability of the executing CPU on startup, and uses one of the two implementations -- one that uses _pdep_u64 call (available with -mbmi2), and another that does bit manipulation "by hand" using C code. Are there any built-in support for such cases? How would I make GCC compile for older arch while providing access to the newer intrinsic? I suspect execution is faster if the function is invoked via a global function pointer, rather than an if/else every time?

    解决方案

    You can declare a function pointer and point it to the correct version at program startup by calling cpuid to determine the current architecture

    But it's better to utilize support from many modern compilers. Intel's ICC has automatic function dispatching to select the optimized version for each architecture long ago. I don't know the details but looks like it only applies to Intel's libraries. Besides it only dispatches to the efficient version on Intel CPUs, hence would be unfair to other manufacturers. There are many patches and workarounds for that in Agner`s CPU blog

    Later a feature called Function Multiversioning was introduced in GCC 4.8. It adds the target attribute that you'll declare on each version of your function

    __attribute__ ((target ("sse4.2")))
    int foo() { return 1; }
    
    __attribute__ ((target ("arch=atom")))
    int foo() { return 2; }
    
    int main() {
        int (*p)() = &foo;
        return foo() + p();
    }
    

    That duplicates a lot of code and is cumbersome so GCC 6 added target_clones that tells GCC to compile a function to multiple clones. For example __attribute__((target_clones("avx2","arch=atom","default"))) void foo() {} will create 3 different foo versions. More information about them can be found in GCC's documentation about function attribute

    The syntax was then adopted by Clang and ICC. Performance can even be better than a global function pointer because the function symbols can be resolved at process loading time instead of runtime. It's one of the reasons Intel's Clear Linux runs so fast. ICC may also create multiple versions of a single loop during auto-vectorization

    Here's an example from The one with multi-versioning (Part II) along with its demo which is about popcnt but you get the idea

    __attribute__((target_clones("popcnt","default")))
    int runPopcount64_builtin_multiarch_loop(const uint8_t* bitfield, int64_t size, int repeat) {
        int res = 0;
        const uint64_t* data = (const uint64_t*)bitfield;
    
        for (int r=0; r<repeat; r++)
        for (int i=0; i<size/8; i++) {
            res += popcount64_builtin_multiarch_loop(data[i]);
        }
    
        return res;
    }
    

    Note that PDEP and PEXT are very slow on current AMD CPUs so they should only be enabled on Intel

    这篇关于使用更新的CPU指令支持构建向后兼容的二进制文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

    查看全文
    登录 关闭
    扫码关注1秒登录
    发送“验证码”获取 | 15天全站免登陆