为什么在x86_64中乘法时uint_least16_t比uint_fast16_t快? [英] Why is uint_least16_t faster than uint_fast16_t for multiplication in x86_64?

查看:148
本文介绍了为什么在x86_64中乘法时uint_least16_t比uint_fast16_t快?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

对于uint_fast*_t系列类型,C标准尚不清楚.在gcc-4.4.4 linux x86_64系统上,类型uint_fast16_tuint_fast32_t的大小均为8个字节.但是,8字节数字的乘法似乎比4字节数字的乘法慢得多.以下代码演示了这一点:

The C standard is quite unclear about the uint_fast*_t family of types. On a gcc-4.4.4 linux x86_64 system, the types uint_fast16_t and uint_fast32_t are both 8 bytes in size. However, multiplication of 8-byte numbers seems to be fairly slower than multiplication of 4-byte numbers. The following piece of code demonstrates that:

#include <stdio.h>
#include <stdint.h>
#include <inttypes.h>

int
main ()
{
  uint_least16_t p, x;
  int count;

  p = 1;
  for (count = 100000; count != 0; --count)
    for (x = 1; x != 50000; ++x)
      p*= x;

  printf("%"PRIuLEAST16, p);
  return 0;
}

在程序上运行time命令,我得到

Running the time command on the program, I get

real 0m7.606s
user 0m7.557s
sys  0m0.019s

如果我将类型更改为uint_fast16_t(和printf修饰符),则计时变为

If I change the type to uint_fast16_t (and the printf modifier), the timing becomes

real 0m12.609s
user 0m12.593s
sys  0m0.009s

那么,如果stdint.h标头将uint_fast16_t(以及uint_fast32_t)定义为4字节类型,会不会更好?

So, would it not be much better if the stdint.h header defined uint_fast16_t (and also uint_fast32_t) to be a 4-byte type?

推荐答案

AFAIK编译器仅定义自己的(u)int_(fast/least)XX_t类型版本(如果系统尚未定义它们的版本).这是因为在单个系统上的所有库/二进制文件中均等地定义这些类型非常重要.否则,如果不同的编译器对这些类型的定义不同,则使用CompilerA构建的库与使用CompilerB构建的二进制库可能具有不同的uint_fast32_t类型,但该二进制文件仍可能链接到该库;没有正式标准要求,系统的所有可执行代码都必须由相同的编译器来构建(实际上在某些系统上,例如Windows,通常已经由各种编译器编译了代码.不同的编译器).如果现在此二进制文件调用该库的函数,则一切都会中断!

AFAIK compilers only define their own versions of (u)int_(fast/least)XX_t types if these are not already defined by the system. That is because it is very important that these types are equally defined across all libraries/binaries on a single system. Otherwise, if different compilers would define those types differently, a library built with CompilerA may have a different uint_fast32_t type than a binary built with CompilerB, yet this binary may still link against the library; there is no formal standard requirement that all executable code of a system has to be built by the same compiler (actually on some systems, e.g. Windows, it is rather common that code has been compiled by all kind of different compilers). If now this binary calls a function of the library, things will break!

所以问题是:是在这里真正是GCC定义uint_fast16_t,还是实际上是Linux(我在这里是指内核)或什至是标准C Lib(在大多数情况下是glibc)来定义这些类型?由于如果Linux或glibc定义了这些约定,那么在该系统上构建的GCC除了采用已建立的约定外别无选择.其他所有可变宽度类型也是如此:charshortintlonglong long;所有这些类型在 C Standard 中都只有一个最小保证位宽度(对于int实际上为16位,因此在int为32位的平台上,它已经比是标准要求的.

So the question is: Is it really GCC defining uint_fast16_t here, or is it actually Linux (I mean the kernel here) or maybe even the Standard C Lib (glibc in most cases), that defines those types? Since if Linux or glibc defines these, GCC built on that system has no choice other than to adopt whatever conventions these have established. Same is true for all other variable width types, too: char, short, int, long, long long; all these types have only a minimum guaranteed bit width in the C Standard (and for int it is actually 16 bit, so on platforms where int is 32 bit, it is already much bigger than would be required by the standard).

除此之外,我实际上想知道您的CPU/编译器/系统出了什么问题.在我的系统上,64位乘法与32位乘法同样快.我修改了您的代码以测试16、32和64位:

Other than that, I actually wonder what is wrong with your CPU/compiler/system. On my system 64 bit multiplication is equally fast to 32 bit multiplication. I modified your code to test 16, 32, and 64 bit:

#include <time.h>
#include <stdio.h>
#include <inttypes.h>

#define RUNS 100000

#define TEST(type)                                  \
    static type test ## type ()                     \
    {                                               \
        int count;                                  \
        type p, x;                                  \
                                                    \
        p = 1;                                      \
        for (count = RUNS; count != 0; count--) {   \
            for (x = 1; x != 50000; x++) {          \
                p *= x;                             \
            }                                       \
        }                                           \
        return p;                                   \
    }

TEST(uint16_t)
TEST(uint32_t)
TEST(uint64_t)

#define CLOCK_TO_SEC(clock) ((double)clockTime / CLOCKS_PER_SEC)

#define RUN_TEST(type)                             \
    {                                              \
        clock_t clockTime;                         \
        unsigned long long result;                 \
                                                   \
        clockTime = clock();                       \
        result = test ## type ();                  \
        clockTime = clock() - clockTime;           \
        printf("Test %s took %2.4f s. (%llu)\n",   \
            #type, CLOCK_TO_SEC(clockTime), result \
        );                                         \
    }

int main ()
{
    RUN_TEST(uint16_t)
    RUN_TEST(uint32_t)
    RUN_TEST(uint64_t)
    return 0;
}

使用未优化的代码(-O0),我得到:

Using unoptimized code (-O0), I get:

Test uint16_t took 13.6286 s. (0)
Test uint32_t took 12.5881 s. (0)
Test uint64_t took 12.6006 s. (0)

使用优化代码(-O3),我得到:

Using optimized code (-O3), I get:

Test uint16_t took 13.6385 s. (0)
Test uint32_t took 4.5455 s. (0)
Test uint64_t took 4.5382 s. (0)

第二个输出非常有趣. @R ..在上面的评论中写道:

The second output is quite interesting. @R.. wrote in a comment above:

在x86_64上,32位算术永远不应比64位慢 算术,句号.

On x86_64, 32-bit arithmetic should never be slower than 64-bit arithmetic, period.

第二个输出显示对于32/16位算术不能说相同的事情.即使我的x86 CPU可以本地执行16位算术运算,在32/64位CPU上16位算术运算也会明显慢一些.与某些其他CPU(例如PPC)不同,后者只能执行32位算术运算.但是,这似乎仅适用于我的CPU上的乘法,当更改代码以进行加/减/除运算时,16位和32位之间不再存在显着差异.

The second output shows that the same thing cannot be said for 32/16 bit arithmetic. 16 bit arithmetic can be significantly slower on a 32/64 bit CPU, even though my x86 CPU can natively perform 16 bit arithmetic; unlike some other CPUs, like a PPC, for example, that can only perform 32 bit arithmetic. However, this only seems to apply to multiplication on my CPU, when changing the code to do addition/subtraction/division, there is no significant difference between 16 and 32 bit any longer.

以上结果来自Intel Core i7(2.66 GHz),但如果有兴趣的人,我也可以在Intel Core 2 Duo(较旧的一个CPU一代)和Motorola PowerPC G4上运行此基准测试.

The results above are from an Intel Core i7 (2.66 GHz), yet if anyone is interested, I can run this benchmark also on an Intel Core 2 Duo (one CPU generation older) and on an Motorola PowerPC G4.

这篇关于为什么在x86_64中乘法时uint_least16_t比uint_fast16_t快?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆