使用更新的CPU指令支持构建向后兼容的二进制文件 [英] Building backward compatible binaries with newer CPU instructions support
问题描述
实现同一功能的多个版本的最佳方法是使用特定的CPU指令(如果在运行时进行了测试),或者如果使用慢的实现则回退到较慢的实现?
例如,x86 BMI2提供了非常有用的 PDEP 指令.我将如何编写C代码,以便在启动时测试正在执行的CPU的BMI2可用性,并使用两种实现方式之一-一种使用 _pdep_u64
调用(可用于 -mbmi2
),另一个使用C代码手动"进行位操作.是否有针对此类情况的内置支持?在提供对较新的内在函数的访问权限时,我将如何使GCC针对较旧的arch进行编译?我怀疑如果通过全局函数指针而不是每次if/else调用函数,执行速度会更快?
您可以声明一个函数指针,并在程序启动时通过调用 cpuid
确定当前体系结构将其指向正确的版本>
但是最好利用许多现代编译器的支持.英特尔的ICC具有自动功能分派以选择优化的很久以前每种架构的版本.我不知道细节,但看起来它仅适用于Intel的库.此外,它仅在Intel CPU上分发到高效版本,因此将是对其他制造商不公平.在 Agner的CPU中,有许多补丁和解决方法.博客
后来在 target
属性
__ attribute__((target("sse4.2"))))int foo(){返回1;}__attribute__((target("arch = atom"))))int foo(){返回2;}int main(){int(* p)()=& foo;返回foo()+ p();}
那会重复很多代码,而且很麻烦,因此GCC 6添加了 target_clones
,告诉GCC将一个函数编译到多个克隆中.例如 __ attribute __((target_clones("avx2","arch = atom",默认"))))void foo(){}
将创建3个不同的 foo
版本.有关它们的更多信息,请参见 GCC有关函数属性的文档
然后, Clang 和 功能符号 (而不是运行时)加载.这是英特尔的Clear Linux 的原因之一快速运行 .ICC可能还会创建 GCC 6中的功能多版本化
以下是一个具有多版本的人的示例(第二部分)一>连同其演示大约是POPCNT但你的想法
__ attribute __((target_clones("popcnt","default"))))int runPopcount64_builtin_multiarch_loop(const uint8_t *位字段,int64_t大小,整数重复){int res = 0;const uint64_t * data =(const uint64_t *)bitfield;为(int r = 0; r< repeat; r ++)for(int i = 0; i< size/8; i ++){res + = popcount64_builtin_multiarch_loop(data [i]);}返回资源;}
请注意, PDEP
和 PEXT
在当前的AMD CPU上非常慢,因此只能在Intel上启用它们
What is the best way to implement multiple versions of the same function that uses a specific CPU instructions if available (tested at run time), or falls back to a slower implementation if not?
For example, x86 BMI2 provides a very useful PDEP instruction. How would I write a C code such that it tests BMI2 availability of the executing CPU on startup, and uses one of the two implementations -- one that uses _pdep_u64
call (available with -mbmi2
), and another that does bit manipulation "by hand" using C code. Are there any built-in support for such cases? How would I make GCC compile for older arch while providing access to the newer intrinsic? I suspect execution is faster if the function is invoked via a global function pointer, rather than an if/else every time?
You can declare a function pointer and point it to the correct version at program startup by calling cpuid
to determine the current architecture
But it's better to utilize support from many modern compilers. Intel's ICC has automatic function dispatching to select the optimized version for each architecture long ago. I don't know the details but looks like it only applies to Intel's libraries. Besides it only dispatches to the efficient version on Intel CPUs, hence would be unfair to other manufacturers. There are many patches and workarounds for that in Agner`s CPU blog
Later a feature called Function Multiversioning was introduced in GCC 4.8. It adds the target
attribute that you'll declare on each version of your function
__attribute__ ((target ("sse4.2")))
int foo() { return 1; }
__attribute__ ((target ("arch=atom")))
int foo() { return 2; }
int main() {
int (*p)() = &foo;
return foo() + p();
}
That duplicates a lot of code and is cumbersome so GCC 6 added target_clones
that tells GCC to compile a function to multiple clones. For example __attribute__((target_clones("avx2","arch=atom","default"))) void foo() {}
will create 3 different foo
versions. More information about them can be found in GCC's documentation about function attribute
The syntax was then adopted by Clang and ICC. Performance can even be better than a global function pointer because the function symbols can be resolved at process loading time instead of runtime. It's one of the reasons Intel's Clear Linux runs so fast. ICC may also create multiple versions of a single loop during auto-vectorization
- Function multi-versioning in GCC 6
- Function Multi-Versioning
- The - surprisingly limited - usefulness of function multiversioning in GCC
- Generate code for multiple SIMD architectures
Here's an example from The one with multi-versioning (Part II) along with its demo which is about popcnt but you get the idea
__attribute__((target_clones("popcnt","default")))
int runPopcount64_builtin_multiarch_loop(const uint8_t* bitfield, int64_t size, int repeat) {
int res = 0;
const uint64_t* data = (const uint64_t*)bitfield;
for (int r=0; r<repeat; r++)
for (int i=0; i<size/8; i++) {
res += popcount64_builtin_multiarch_loop(data[i]);
}
return res;
}
Note that PDEP
and PEXT
are very slow on current AMD CPUs so they should only be enabled on Intel
这篇关于使用更新的CPU指令支持构建向后兼容的二进制文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!