为什么__int128_t在x86-64 GCC上快于long long? [英] Why is __int128_t faster than long long on x86-64 GCC?

查看:165
本文介绍了为什么__int128_t在x86-64 GCC上快于long long?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这是我的测试代码:

#include <chrono>
#include <iostream>
#include <cstdlib>
using namespace std;

using ll = long long;

int main()
{
    __int128_t a, b;
    ll x, y;

    a = rand() + 10000000;
    b = rand() % 50000;
    auto t0 = chrono::steady_clock::now();
    for (int i = 0; i < 100000000; i++)
    {
        a += b;
        a /= b;
        b *= a;
        b -= a;
        a %= b;
    }
    cout << chrono::duration_cast<chrono::milliseconds>(chrono::steady_clock::now() - t0).count() << ' '
         << (ll)a % 100000 << '\n';

    x = rand() + 10000000;
    y = rand() % 50000;
    t0 = chrono::steady_clock::now();
    for (int i = 0; i < 100000000; i++)
    {
        x += y;
        x /= y;
        y *= x;
        y -= x;
        x %= y;
    }
    cout << chrono::duration_cast<chrono::milliseconds>(chrono::steady_clock::now() - t0).count() << ' '
         << (ll)x % 100000 << '\n';

    return 0;
}

这是测试结果:

$ g++ main.cpp -o main -O2
$ ./main
2432 1
2627 1

在x64 GNU / Linux上使用GCC 10.1.0,无论是使用-O2优化还是未优化, __ int128_t 总是比 long long 快一点。

Using GCC 10.1.0 on x64 GNU/Linux, no matter if it's using optimization of -O2 or un-optimized, __int128_t is always a little faster than long long.

int double 都比 long long 快得多; long long 已成为最慢的类型。

int and double are both significantly faster than long long; long long has become the slowest type.

这是怎么回事?

推荐答案

性能差异来自在这种情况下,使用GCC / Clang 的128位除法/模量的效率

实际上,在我的系统以及 godbolt 上, sizeof(long long)= 8 sizeof(__ int128_t)= 16 。因此,对前者的操作由本机指令执行,而对后者则不执行(因为我们专注于64位平台)。 __ int128_t 的加,减和减慢。但是,内置函数可对16字节类型(x86 GCC / Clang上的 __ divti3 __ modti3 )进行除法/模运算)比本机 idiv 指令(这是非常慢的,至少在Intel处理器上是很慢的)要快得多。

Indeed, on my system as well as on godbolt, sizeof(long long) = 8 and sizeof(__int128_t) = 16. Thus operation on the former are performed by native instruction while not the latter (since we focus on 64 bit platforms). Additions, multiplications and subtractions are slower with __int128_t. But, built-in functions for divisions/modulus on 16-byte types (__divti3 and __modti3 on x86 GCC/Clang) are surprisingly faster than the native idiv instruction (which is pretty slow, at least on Intel processors).

如果我们深入研究在实现GCC / Clang内置函数(此处仅用于 __ int128_t )中,我们可以看到 __ modti3 使用条件(在调用 __ udivmodti4 )。 英特尔处理器可以更快地执行代码,原因是:

If we look deeper in the implementation of GCC/Clang built-in functions (used only for __int128_t here), we can see that __modti3 uses conditionals (when calling __udivmodti4). Intel processors can execute the code faster because:


  • 在这种情况下,分支可以很好地预测因为它们始终是相同的(并且还因为循环执行了数百万次);

  • 除法/模数被拆分为更快的本机指令,这些指令通常可以并行执行通过多个CPU端口(并且可以从乱序执行中受益)。 div 指令在大多数可能的路径中(特别是在这种情况下)仍使用

  • 执行 div / idiv 指令的时间由于其高延迟<。由于循环依赖性,不能并行执行 div / idiv 指令。但是, div 延迟时间低于 idiv 延迟,从而使前者更快。 / li>
  • taken branches can be well predicted in this case since they are always the same (and also because the loop is executed millions of times);
  • the division/modulus is split in faster native instructions that can mostly be executed in parallel by multiple CPU ports (and that benefit from out-of-order execution). A div instruction is still used in most possible paths (especially in this case);
  • The execution time of the div/idiv instructions covers most of the overall execution time because of their very high latencies. The div/idiv instructions cannot be executed in parallel because of the loop dependencies. However, the latency of a div lower than a idiv making the former faster.

请注意,两种实现的性能可能从一种架构到另一种架构有很大差异( (因为CPU端口数量,分支预测功能和 idiv 指令的延迟/吞吐量))。
实际上, 64位 idiv 的延迟指令在Skylake上需要41-95个周期,而在AMD Ryzen处理器上则需要8-41个周期。 div 的延迟分别在Skylake上约为6-89个周期,而在Ryzen上仍然相同。这意味着基准性能结果在Ryzen处理器上应该有显着差异(由于在128位情况下的额外指令/分支成本,可能会看到相反的效果)。

Please note that the performance of the two implementations can highly differ from one architecture to another (because of the number of CPU ports, the branch prediction capability and the latency/througput of the idiv instruction). Indeed, the latency of a 64-bit idiv instruction takes 41-95 cycles on Skylake while it takes 8-41 cycles on AMD Ryzen processors for example. Respectively the latency of a div is about 6-89 cycles on Skylake and still the same on Ryzen. This means that the benchmark performance results should be significantly different on Ryzen processors (the opposite effect may be seen due to the additional instructions/branch costs in the 128-bits case).

这篇关于为什么__int128_t在x86-64 GCC上快于long long?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆