什么时候汇编比C更快? [英] When is assembly faster than C?

查看:96
本文介绍了什么时候汇编比C更快?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

了解汇编器的陈述原因之一是,有时可以用它来编写比用高级语言(尤其是C)编写更高性能的代码.但是,我也听到过很多次声明,尽管这并非完全错误,但可以实际上使用汇编程序生成性能更高的代码的情况非常少见,并且需要专业知识和经验部件.

这个问题甚至都没有涉及到汇编程序指令将是特定于机器且不可移植的,或汇编程序的任何其他方面.当然,除了汇编语言之外,还有很多了解汇编语言的充分理由,但这只是一个特定的问题,需要征集示例和数据,而不是对汇编语言和高级语言的扩展论述.

任何人都可以提供一些特定示例来说明使用现代编译器进行汇编比编写良好的C代码要快的情况,您是否可以提供带有分析依据的主张?我对这些案例的存在很有信心,但是我真的想确切地知道这些案例有多深奥,因为这似乎有些争议.

解决方案

这是一个真实的示例:在旧编译器上不动点相乘.

这些不仅在没有浮点的设备上派上用场,它们在精度方面也很出色,因为它们为您提供32位精度并带有可预测的错误(浮点只有23位,很难预测精度损失).也就是说,在整个范围内,统一的绝对精度,而不是接近均匀的相对精度(float).


现代编译器很好地优化了此定点示例,因此,对于仍需要特定于编译器代码的更现代示例,请参见

  • 获得64位的高位整数乘法:使用uint64_t进行32x32 => 64位乘法的便携式版本无法在64位CPU上进行优化,因此在64位系统上需要有效的代码才能使用内在函数或__int128./li>
  • _umul128在Windows 32位上:MSVC并不总是做得很好当将32位整数与强制转换为64的整数进行比较时,内在函数有很大帮助.

C没有全乘法运算符(来自N位输入的2N位结果).用C表示它的通常方法是将输入转换为更宽的类型,并希望编译器认识到输入的高位并不有趣:

// on a 32-bit machine, int can hold 32-bit fixed-point integers.
int inline FixedPointMul (int a, int b)
{
  long long a_long = a; // cast to 64 bit.

  long long product = a_long * b; // perform multiplication

  return (int) (product >> 16);  // shift by the fixed point bias
}

此代码的问题是我们做了一些不能直接用C语言表达的事情.我们想将两个32位数字相乘并得到64位结果,然后返回中间的32位.但是,在C中这种乘法不存在.您所要做的就是将整数提升为64位,并执行64 * 64 = 64乘法.

但是,

x86(以及ARM,MIPS等)可以在一条指令中进行乘法运算.一些编译器过去常常忽略这一事实,并生成调用运行时库函数进行乘法的代码.通常由库例程完成16的移位(x86也可以执行这种移位).

因此,我们只剩下一个或两个库调用即可进行乘法运算.这具有严重的后果.不仅移位变慢,而且必须在函数调用之间保留寄存器,并且这也不利于内联和代码展开.

如果在(内联)汇编器中重写相同的代码,则可以显着提高速度.

除此之外:使用ASM并不是解决问题的最佳方法.如果您无法用C表示它们,大多数编译器允许您以固有形式使用一些汇编程序指令.例如,VS.NET2008编译器将32 * 32 = 64位mul公开为__emul,将64位移位公开为__ll_rshift.

使用内在函数,您可以以C编译器有机会了解正在发生的情况的方式来重写函数.这样就可以内联代码,分配寄存器,消除公共子表达式并实现常数传播.这样,与手写汇编代码相比,您将获得巨大的性能提升.

作为参考:VS.NET编译器的定点mul的最终结果是:

int inline FixedPointMul (int a, int b)
{
    return (int) __ll_rshift(__emul(a,b),16);
}

定点除法的性能差异更大.通过编写几行asm行,我对除法重固定点代码进行了多达10倍的改进.


使用Visual C ++ 2013可以为两种方式提供相同的汇编代码.

从2007年开始的

gcc4.1也很好地优化了纯C版本. (Godbolt编译器资源管理器没有安装任何较早的gcc版本,但大概是较旧的GCC版本也可以在没有内部函数的情况下执行此操作.)

https://en.wikipedia.org/wiki/Find_first_set ).

某些编译器有时可以识别出一个循环,该循环对整数中设置的位数进行计数并将其编译为popcnt指令(如果在编译时启用),但是在GNU C中使用__builtin_popcnt更加可靠,或在x86上(如果您仅针对使用SSE4.2的硬件): https://en.wikipedia.org/wiki/Hamming_weight#Language_support .

类似地,ntohl可以在具有C的某些C实现中编译为bswap(用于字节序转换的x86 32位字节交换).


内在函数或手写asm的另一个主要领域是使用SIMD指令进行手动矢量化.编译器对于像dst[i] += src[i] * 10.0;这样的简单循环来说还不错,但是当事情变得更加复杂时,它们通常做得不好或者根本不进行自动向量化.例如,您不太可能获得如何使用SIMD实施atoi的信息.由编译器从标量代码自动生成.

One of the stated reasons for knowing assembler is that, on occasion, it can be employed to write code that will be more performant than writing that code in a higher-level language, C in particular. However, I've also heard it stated many times that although that's not entirely false, the cases where assembler can actually be used to generate more performant code are both extremely rare and require expert knowledge of and experience with assembly.

This question doesn't even get into the fact that assembler instructions will be machine-specific and non-portable, or any of the other aspects of assembler. There are plenty of good reasons for knowing assembly besides this one, of course, but this is meant to be a specific question soliciting examples and data, not an extended discourse on assembler versus higher-level languages.

Can anyone provide some specific examples of cases where assembly will be faster than well-written C code using a modern compiler, and can you support that claim with profiling evidence? I am pretty confident these cases exist, but I really want to know exactly how esoteric these cases are, since it seems to be a point of some contention.

解决方案

Here is a real world example: Fixed point multiplies on old compilers.

These don't only come handy on devices without floating point, they shine when it comes to precision as they give you 32 bits of precision with a predictable error (float only has 23 bit and it's harder to predict precision loss). i.e. uniform absolute precision over the entire range, instead of close-to-uniform relative precision (float).


Modern compilers optimize this fixed-point example nicely, so for more modern examples that still need compiler-specific code, see


C doesn't have a full-multiplication operator (2N-bit result from N-bit inputs). The usual way to express it in C is to cast the inputs to the wider type and hope the compiler recognizes that the upper bits of the inputs aren't interesting:

// on a 32-bit machine, int can hold 32-bit fixed-point integers.
int inline FixedPointMul (int a, int b)
{
  long long a_long = a; // cast to 64 bit.

  long long product = a_long * b; // perform multiplication

  return (int) (product >> 16);  // shift by the fixed point bias
}

The problem with this code is that we do something that can't be directly expressed in the C-language. We want to multiply two 32 bit numbers and get a 64 bit result of which we return the middle 32 bit. However, in C this multiply does not exist. All you can do is to promote the integers to 64 bit and do a 64*64 = 64 multiply.

x86 (and ARM, MIPS and others) can however do the multiply in a single instruction. Some compilers used to ignore this fact and generate code that calls a runtime library function to do the multiply. The shift by 16 is also often done by a library routine (also the x86 can do such shifts).

So we're left with one or two library calls just for a multiply. This has serious consequences. Not only is the shift slower, registers must be preserved across the function calls and it does not help inlining and code-unrolling either.

If you rewrite the same code in (inline) assembler you can gain a significant speed boost.

In addition to this: using ASM is not the best way to solve the problem. Most compilers allow you to use some assembler instructions in intrinsic form if you can't express them in C. The VS.NET2008 compiler for example exposes the 32*32=64 bit mul as __emul and the 64 bit shift as __ll_rshift.

Using intrinsics you can rewrite the function in a way that the C-compiler has a chance to understand what's going on. This allows the code to be inlined, register allocated, common subexpression elimination and constant propagation can be done as well. You'll get a huge performance improvement over the hand-written assembler code that way.

For reference: The end-result for the fixed-point mul for the VS.NET compiler is:

int inline FixedPointMul (int a, int b)
{
    return (int) __ll_rshift(__emul(a,b),16);
}

The performance difference of fixed point divides is even bigger. I had improvements up to factor 10 for division heavy fixed point code by writing a couple of asm-lines.


Using Visual C++ 2013 gives the same assembly code for both ways.

gcc4.1 from 2007 also optimizes the pure C version nicely. (The Godbolt compiler explorer doesn't have any earlier versions of gcc installed, but presumably even older GCC versions could do this without intrinsics.)

See source + asm for x86 (32-bit) and ARM on the Godbolt compiler explorer. (Unfortunately it doesn't have any compilers old enough to produce bad code from the simple pure C version.)


Modern CPUs can do things C doesn't have operators for at all, like popcnt or bit-scan to find the first or last set bit. (POSIX has a ffs() function, but its semantics don't match x86 bsf / bsr. See https://en.wikipedia.org/wiki/Find_first_set).

Some compilers can sometimes recognize a loop that counts the number of set bits in an integer and compile it to a popcnt instruction (if enabled at compile time), but it's much more reliable to use __builtin_popcnt in GNU C, or on x86 if you're only targeting hardware with SSE4.2: _mm_popcnt_u32 from <immintrin.h>.

Or in C++, assign to a std::bitset<32> and use .count(). (This is a case where the language has found a way to portably expose an optimized implementation of popcount through the standard library, in a way that will always compile to something correct, and can take advantage of whatever the target supports.) See also https://en.wikipedia.org/wiki/Hamming_weight#Language_support.

Similarly, ntohl can compile to bswap (x86 32-bit byte swap for endian conversion) on some C implementations that have it.


Another major area for intrinsics or hand-written asm is manual vectorization with SIMD instructions. Compilers are not bad with simple loops like dst[i] += src[i] * 10.0;, but often do badly or don't auto-vectorize at all when things get more complicated. For example, you're unlikely to get anything like How to implement atoi using SIMD? generated automatically by the compiler from scalar code.

这篇关于什么时候汇编比C更快?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆