对于x86-64,是imm64还是m64哪个更快? [英] Which is faster, imm64 or m64 for x86-64?

查看:94
本文介绍了对于x86-64,是imm64还是m64哪个更快?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

经过约100亿次测试,如果imm64比AMD64的m64快0.1纳秒,那么m64似乎更快,但我不太了解.以下代码中的val_ptr的地址本身不是立即值吗?

After testing about 10 billion times, if imm64 is 0.1 nanoseconds faster than m64 for AMD64, The m64 seems to be faster, but I don't really understand. Isn't the address of val_ptr in the following code an immediate value itself?

# Text section
.section __TEXT,__text,regular,pure_instructions
# 64-bit code
.code64
# Intel syntax
.intel_syntax noprefix
# Target macOS High Sierra
.macosx_version_min 10,13,0

# Make those two test functions global for the C measurer
.globl _test1
.globl _test2

# Test 1, imm64
_test1:
  # Move the immediate value 0xDEADBEEFFEEDFACE to RAX (return value)
  movabs rax, 0xDEADBEEFFEEDFACE
  ret
# Test 2, m64
_test2:
  # Move from the RAM (val_ptr) to RAX (return value)
  mov rax, qword ptr [rip + val_ptr]
  ret
# Data section
.section __DATA,__data
val_ptr:
  .quad 0xDEADBEEFFEEDFACE

测量代码为:

#include <stdio.h>            // For printf
#include <stdlib.h>           // For EXIT_SUCCESS
#include <math.h>             // For fabs
#include <stdint.h>           // For uint64_t
#include <stddef.h>           // For size_t
#include <string.h>           // For memset
#include <mach/mach_time.h>   // For time stuff

#define FUNCTION_COUNT  2     // Number of functions to test
#define TEST_COUNT      0x10000000  // Number of times to test each function

// Type aliases
typedef uint64_t rettype_t;
typedef rettype_t(*function_t)();

// External test functions (defined in Assembly)
rettype_t test1();
rettype_t test2();

// Program entry point
int main() {

  // Time measurement stuff
  mach_timebase_info_data_t info;
  mach_timebase_info(&info);

  // Sums to divide by the test count to get average
  double sums[FUNCTION_COUNT];

  // Initialize sums to 0
  memset(&sums, 0, FUNCTION_COUNT * sizeof (double));

  // Functions to test
  function_t functions[FUNCTION_COUNT] = {test1, test2};

  // Useless results (should be 0xDEADBEEFFEEDFACE), but good to have
  rettype_t results[FUNCTION_COUNT];

  // Function loop, may get unrolled based on optimization level
  for (size_t test_fn = 0; test_fn < FUNCTION_COUNT; test_fn++) {
    // Test this MANY times
    for (size_t test_num = 0; test_num < TEST_COUNT; test_num++) {
      // Get the nanoseconds before the action
      double nanoseconds = mach_absolute_time();
      // Do the action
      results[test_fn] = functions[test_fn]();
      // Measure the time it took
      nanoseconds = mach_absolute_time() - nanoseconds;

      // Convert it to nanoseconds
      nanoseconds *= info.numer;
      nanoseconds /= info.denom;

      // Add the nanosecond count to the sum
      sums[test_fn] += nanoseconds;
    }
  }
  // Compute the average
  for (size_t i = 0; i < FUNCTION_COUNT; i++) {
    sums[i] /= TEST_COUNT;
  }

  if (FUNCTION_COUNT == 2) {
    // Print some fancy information
    printf("Test 1 took %f nanoseconds average.\n", sums[0]);
    printf("Test 2 took %f nanoseconds average.\n", sums[1]);
    printf("Test %d was faster, with %f nanoseconds difference\n", sums[0] < sums[1] ? 1 : 2, fabs(sums[0] - sums[1]));
  } else {
    // Else, just print something
    for (size_t fn_i = 0; fn_i < FUNCTION_COUNT; fn_i++) {
      printf("Test %zu took %f clock ticks average.\n", fn_i + 1, sums[fn_i]);
    }
  }

  // Everything went fine!
  return EXIT_SUCCESS;
}

那么,m64imm64哪一个最快?

顺便说一句,我正在使用Intel Core i7 Ivy Bridge和DDR3 RAM.我正在运行macOS High Sierra.

By the way, I'm using Intel Core i7 Ivy Bridge and DDR3 RAM. I'm running macOS High Sierra.

编辑:我插入了ret指令,现在发现imm64更快.

EDIT: I inserted the ret instructions, and now imm64 turned out to be faster.

推荐答案

您没有显示测试的实际循环,也没有说出如何测量时间.显然,您测量的是挂钟时间,而不是核心时钟周期(带有性能计数器).因此,您的测量噪声源包括涡轮增压/省电以及与另一个逻辑线程(在i7上)共享一个物理内核.

You don't show the actual loop you tested with, or say anything about how you measured time. Apparently you measured wall-clock time, not core clock cycles (with performance counters). So your sources of measurement noise include turbo / power-saving as well as sharing a physical core with another logical thread (on an i7).

在Intel IvyBridge上:

On Intel IvyBridge:

movabs rax, 0xDEADBEEFFEEDFACE是ALU指令

  • 采用10个字节的代码大小(取决于周围的代码,这可能会或可能无关紧要).
  • 对于任何ALU端口(p0,p1或p5)解码为1 uop. (最大吞吐量=每个时钟3个)
  • 在uop缓存中获取2个条目(由于64位立即数),并且需要2个周期从uop缓存中读取. (因此,如果这是包含此内容的代码瓶颈,那么从循环缓冲区运行对于前端吞吐量而言是一个显着的优势.)
  • Take 10 bytes of code-size (which might or might not matter depending on surrounding code).
  • Decodes to 1 uop for any ALU port (p0, p1, or p5). (max throughput = 3 per clock)
  • Takes 2 entries in the uop cache (because of the 64-bit immediate), and takes 2 cycles to read from the uop cache. (So running from the loop buffer is a significant advantage for front-end throughput, if that's the bottleneck in code containing this).

mov rax, [RIP + val_ptr]是负载

  • 占用7个字节(REX +操作码+ modrm + rel32)
  • 对于任何一个加载端口(p2或p3),解码为1个uop. (最大吞吐量=每个时钟2个)
  • 适合uop缓存中的1个条目(无立即数和32或32个小地址偏移量).
  • 如果将负载分散在页面边界上,即使在Skylake上,
  • 运行的速度也会慢很多.
  • 第一次可能会丢失高速缓存.
  • takes 7 bytes (REX + opcode + modrm + rel32)
  • decodes to 1 uop for either load port (p2 or p3). (max throughput = 2 per clock)
  • fits in 1 entry in the uop cache (no immediate and 32 or 32small address offset).
  • runs a lot slower if the load is split across a page boundary, even on Skylake.
  • can miss in cache the first time.

来源: Agner Fog的microarch pdf和说明表.有关uop-cache内容,请参见表9.1.另请参见标签Wiki的其他性能链接.

Source: Agner Fog's microarch pdf and instruction tables. See Table 9.1 for uop-cache stuff. See also other performance links in the x86 tag wiki.

编译器通常选择使用mov r64, imm64生成64位常量. (相关:什么是最好的指令序列来动态生成矢量常量?,但是实际上,这些指令序列永远不会出现标量整数,因为存在

Compilers usually choose to generate 64-bit constants with a mov r64, imm64. (Related: What are the best instruction sequences to generate vector constants on the fly?, but in practice those never come up for scalar integer because there's no short single-instruction way to get a 64-bit -1.)

通常,这是正确的选择,尽管在长时间运行的循环中,您期望常量在高速缓存中保持高温,从.rodata加载常量可能是一个胜利.尤其是如果这样可以让您像and rax, [constant]而不是movabs r8, imm64/and rax, r8那样做.

That's generally the right choice, although in a long-running loop where you expect the constant to stay hot in cache it could be a win to load it from .rodata. Especially if that lets you do something like and rax, [constant] instead of movabs r8, imm64 / and rax, r8.

如果您的64位常量是地址,请尽可能使用相对RIP的lea. NASM语法中的lea rax, [rel my_symbol],AT& T中的lea my_symbol(%rip), %rax.

If your 64-bit constant is an address, use a RIP-relative lea instead, if possible. lea rax, [rel my_symbol] in NASM syntax, lea my_symbol(%rip), %rax in AT&T.

考虑细小的asm序列时,周围的代码非常重要,尤其是当它们争夺不同的吞吐量资源时.

The surrounding code matters a lot when considering tiny sequences of asm, especially when they compete for different throughput resources.

这篇关于对于x86-64,是imm64还是m64哪个更快?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆