汇编语言 (x86):如何创建循环来计算斐波那契数列 [英] Assembly Language (x86): How to create a loop to calculate Fibonacci sequence

查看:48
本文介绍了汇编语言 (x86):如何创建循环来计算斐波那契数列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用 Visual Studio 2013 Ultimate 在 MASM 中编程汇编语言 (x86).我正在尝试使用数组来计算使用数组的 n 个元素的斐波那契数列.换句话说,我试图转到一个数组元素,获取它之前的两个元素,将它们相加,并将结果存储在另一个数组中.

I am programming assembly language (x86) in MASM using Visual Studio 2013 Ultimate. I am trying to use an array to calculate a Fibonacci sequence for n elements using an array. In other words, I am trying to go to an array element, obtain the two elements before it, add those up, and store the result in another array.

我在设置索引寄存器以使其正常工作时遇到问题.

I am having trouble setting up the index registers to make this work.

我的程序设置如下:

TITLE fibonacci.asm

INCLUDE Irvine32.inc

.data
    fibInitial  BYTE 0, 1, 2, 3, 4, 5, 6
    fibComputed BYTE 5 DUP(0)

.code
main PROC

    MOVZX si, fibInitial
    MOVZX di, fibComputed
    MOV   cl, LENGTHOF fibInitial

L1:
    MOV   ax, [si - 1]
    MOV   dx, [si - 2]
    MOV   bp, ax + dx
    MOV   dl, TYPE fibInitial
    MOVZX si, dl
    MOV   [edi], bp
    MOV   dh, TYPE fibComputed
    MOVZX di, dl
    loop L1

exit
main ENDP
END main

我无法编译此代码,因为 MOV ebp, ax + dx 行出现错误 A2031:必须是索引或基址寄存器"的错误消息.但是,我确定我忽略了其他逻辑错误.

I cannot compile this because of an error message that says "error A2031: must be index or base register" for the line MOV ebp, ax + dx. However, I'm certain that there are other logic errors I am overlooking.

推荐答案

相关:Code-golf打印Fib(10**9)的前1000位数字:我的 x86 asm 答案 使用扩展精度 adc 循环,并将二进制转换为字符串.内环针对速度进行了优化,其他部分针对尺寸进行了优化.

related: Code-golf print the first 1000 digits of Fib(10**9): my x86 asm answer using an extended-precision adc loop, and converting binary to strings. The inner loop is optimized for speed, other parts for size.

计算斐波那契数列只需要保持两部分状态:当前元素和前一个元素.我不知道你想用 fibInitial 做什么,除了计算它的长度.这不是你在 for $n (0..5) 中做的 perl.

Computing a Fibonacci sequence only requires keeping two pieces of state: the current and previous element. I have no idea what you're trying to do with fibInitial, other than counting its length. This isn't perl where you do for $n (0..5).

我知道你只是在学习 asm,但我还是要谈谈性能.没有太多理由学习 asm 而不知道什么是快速的,什么不是.如果您不需要性能,让编译器从 C 源代码为您制作 asm.另请参阅 https://stackoverflow.com/tags/x86/info

I know you're just learning asm at all, but I'm still going to talk about performance. There's not much reason to learn asm without knowing what's fast and what's not. If you don't need performance, let a compiler make the asm for you, from C sources. Also see the other links at https://stackoverflow.com/tags/x86/info

为您的状态使用寄存器简化了在计算 a[1] 时需要查看 a[-1] 的问题.您以 curr=1prev=0 开头,然后以 a[0] = curr 开头.生产现代"斐波那契数的从零开始的序列,以 curr=0 开头代码>, prev=1.

Using registers for your state simplifies the problem of needing to look at a[-1] while calculating a[1]. You start with curr=1, prev=0, and start with a[0] = curr. To produce the "modern" starting-with-zero sequence of Fibonacci numbers, start with curr=0, prev=1.

幸运的是,我最近正在考虑斐波那契代码的高效循环,所以我花时间写了一个完整的函数.请参阅下面的展开和矢量化版本(节省存储指令,但即使在为 32 位 CPU 编译时也使 64 位整数更快):

Lucky for you, I was just thinking about an efficient loop for fibonacci code recently, so I took the time to write up a complete function. See below for an unrolled and a vectorized version (saves on store instructions, but also makes 64bit ints fast even when compiling for a 32bit CPU):

; fib.asm
;void fib(int32_t *dest, uint32_t count);
; not-unrolled version.  See below for a version which avoids all the mov instructions
global fib
fib:
    ; 64bit SysV register-call ABI:
    ; args: rdi: output buffer pointer.  esi: count  (and you can assume the upper32 are zeroed, so using rsi is safe)

    ;; locals:  rsi: endp
    ;; eax: current   edx: prev
    ;; ecx: tmp
    ;; all of these are caller-saved in the SysV ABI, like r8-r11
    ;; so we can use them without push/pop to save/restore them.
    ;; The Windows ABI is different.

    test   esi, esi       ; test a reg against itself instead of cmp esi, 0
    jz     .early_out     ; count == 0.  

    mov    eax, 1         ; current = 1
    xor    edx, edx       ; prev    = 0

    lea    rsi, [rdi + rsi * 4]  ; endp = &out[count];  // loop-end pointer
    ;; lea is very useful for combining add, shift, and non-destructive operation
    ;; this is equivalent to shl rsi, 4  /  add rsi, rdi

align 16
.loop:                    ; do {
    mov    [rdi], eax     ;   *buf = current
    add    rdi, 4         ;   buf++

    lea    ecx, [rax + rdx] ; tmp = curr+prev = next_cur
    mov    edx,  eax      ; prev = curr
    mov    eax,  ecx      ; curr=tmp
 ;; see below for an unrolled version that doesn't need any reg->reg mov instructions

    ; you might think this would be faster:
    ; add  edx, eax    ; but it isn't
    ; xchg eax, edx    ; This is as slow as 3 mov instructions, but we only needed 2 thanks to using lea

    cmp    rdi, rsi       ; } while(buf < endp);
    jb    .loop           ; jump if (rdi BELOW rsi).  unsigned compare
    ;; the LOOP instruction is very slow, avoid it

.early_out:
    ret

替代循环条件可能是

    dec     esi         ; often you'd use ecx for counts, but we had it in esi
    jnz     .loop

AMD CPU 可以融合 cmp/branch,但不能融合 dec/branch.Intel CPU 还可以宏熔断器dec/jnz.(或符号小于零/大于零).dec/inc 不更新进位标志,因此您不能将它们与上/下未签名的 ja/jb 一起使用.我认为这个想法是你可以在循环中做一个 adc(加进位),使用 inc/dec 作为循环计数器来不干扰进位标志,但是部分标志变慢使现代 CPU 变得很糟糕.

AMD CPUs can fuse cmp/branch, but not dec/branch. Intel CPUs can also macro-fuse dec/jnz. (Or signed less than zero / greater than zero). dec/inc don't update the Carry flag, so you can't use them with above/below unsigned ja/jb. I think the idea is that you could do an adc (add with carry) in a loop, using inc/dec for the loop counter to not disturb the carry flag, but partial-flags slowdowns make this bad on modern CPUs.

lea ecx, [eax + edx] 需要一个额外的字节(地址大小前缀),这就是我使用 32 位 dest 和 64 位地址的原因.(这些是 64 位模式下 lea 的默认操作数大小).对速度没有直接影响,只是通过代码大小间接影响.

lea ecx, [eax + edx] needs an extra byte (address-size prefix), which is why I used a 32bit dest and a 64bit address. (Those are the default operand sizes for lea in 64bit mode). No direct impact on speed, just indirect through code size.

另一个循环体可以是:

    mov  ecx, eax      ; tmp=curr.  This stays true after every iteration
.loop:

    mov  [rdi], ecx
    add  ecx, edx      ; tmp+=prev  ;; shorter encoding than lea
    mov  edx, eax      ; prev=curr
    mov  eax, ecx      ; curr=tmp

展开循环以进行更多迭代将意味着更少的改组.而不是 mov 指令,您只需跟踪哪个寄存器保存哪个变量.即您通过某种寄存器重命名来处理分配.

Unrolling the loop to do more iterations would mean less shuffling. Instead of mov instructions, you just keep track of which register is holding which variable. i.e. you handle assignments with a sort of register renaming.

.loop:     ;; on entry:       ; curr:eax  prev:edx
    mov  [rdi], eax             ; store curr
    add  edx, eax             ; curr:edx  prev:eax
.oddentry:
    mov  [rdi + 4], edx         ; store curr
    add  eax, edx             ; curr:eax  prev:edx

    ;; we're back to our starting state, so we can loop
    add  rdi, 8
    cmp  rdi, rsi
    jb   .loop

展开的问题是您需要清理剩余的任何奇怪的迭代.二次展开因子的幂可以使清理循环稍微容易一些,但添加 12 并不比添加 16 快.(请参阅本文的先前修订版,了解使用 lea 在第三个寄存器中生成 curr + prev,因为我没有意识到你实际上并不需要临时值.感谢 rcgldr 捕捉到.)

The thing with unrolling is that you need to clean up any odd iterations that are left over. Power-of-two unroll factors can make the cleanup loop slightly easier, but adding 12 isn't any faster than adding 16. (See the previous revision of this post for a silly unroll-by-3 version using lea to produce curr + prev in a 3rd register, because I failed to realize that you don't actually need a temp. Thanks to rcgldr for catching that.)

有关处理任何计数的完整工作展开版本,请参见下文.

See below for a complete working unrolled version which handles any count.

测试前端(此版本中的新功能:用于检测写入缓冲区末尾的 asm 错误的金丝雀元素.)

Test frontend (new in this version: a canary element to detect asm bugs writing past the end of the buffer.)

// fib-main.c
#include <stdio.h>
#include <stdint.h>
#include <stdlib.h>

void fib(uint32_t *buf, uint32_t count);

int main(int argc, const char *argv[]) {
    uint32_t count = 15;
    if (argc > 1) {
        count = atoi(argv[1]);
    }
    uint32_t buf[count+1]; // allocated on the stack
    // Fib overflows uint32 at count = 48, so it's not like a lot of space is useful

    buf[count] = 0xdeadbeefUL;
    // uint32_t count = sizeof(buf)/sizeof(buf[0]);
    fib(buf, count);
    for (uint32_t i ; i < count ; i++){
        printf("%u ", buf[i]);
    }
    putchar('
');

    if (buf[count] != 0xdeadbeefUL) {
        printf("fib wrote past the end of buf: sentinel = %x
", buf[count]);
    }
}


此代码完全有效并经过测试(除非我错过了将本地文件中的更改复制回答案中>.<):


This code is fully working and tested (unless I missed copying a change in my local file back into the answer >.<):

peter@tesla:~/src/SO$ yasm -f elf64 fib.asm && gcc -std=gnu11 -g -Og fib-main.c fib.o
peter@tesla:~/src/SO$ ./a.out 48
1 1 2 3 5 8 13 21 34 55 89 144 233 377 610 987 1597 2584 4181 6765 10946 17711 28657 46368 75025 121393 196418 317811 514229 832040 1346269 2178309 3524578 5702887 9227465 14930352 24157817 39088169 63245986 102334155 165580141 267914296 433494437 701408733 1134903170 1836311903 2971215073 512559680 


展开版本

再次感谢 rcgldr 让我思考如何在循环设置中处理奇数与偶数,而不是在最后进行清理迭代.


unrolled version

Thanks again to rcgldr for getting me thinking about how to handle odd vs. even count in the loop setup, rather than with a cleanup iteration at the end.

我选择了无分支设置代码,它将 4 * count%2 添加到起始指针.这可以是零,但添加零比查看我们是否应该使用分支更便宜.斐波那契数列会很快溢出寄存器,因此保持序言代码的紧凑和高效很重要,而不仅仅是循环内的代码.(如果我们要进行优化,我们会希望针对许多长度较短的调用进行优化).

I went for branchless setup code, which adds 4 * count%2 to the starting pointer. That can be zero, but adding zero is cheaper than branching to see if we should or not. The Fibonacci sequence overflows a register very quickly, so keeping the prologue code tight and efficient is important, not just the code inside the loop. (If we're optimizing at all, we'd want to optimize for many calls with short length).

    ; 64bit SysV register-call ABI
    ; args: rdi: output buffer pointer.  rsi: count

    ;; locals:  rsi: endp
    ;; eax: current   edx: prev
    ;; ecx: tmp
    ;; all of these are caller-saved in the SysV ABI, like r8-r11
    ;; so we can use them without push/pop to save/restore them.
    ;; The Windows ABI is different.

;void fib(int32_t *dest, uint32_t count);  // unrolled version
global fib
fib:
    cmp    esi, 1
    jb     .early_out       ; count below 1  (i.e. count==0, since it's unsigned)

    mov    eax, 1           ; current = 1
    mov    [rdi], eax
    je     .early_out       ; count == 1, flags still set from cmp
    ;; need this 2nd early-out because the loop always does 2 iterations

;;; branchless handling of odd counts:
;;;   always do buf[0]=1, then start the loop from 0 or 1
;;; Writing to an address you just wrote to is very cheap
;;; mov/lea is about as cheap as best-case for branching (correctly-predicted test/jcc for count%2==0)
;;; and saves probably one unconditional jump that would be needed either in the odd or even branch

    mov    edx, esi         ;; we could save this mov by using esi for prev, and loading the end pointer into a different reg
    and    edx, eax         ; prev = count & 1 = count%2

    lea    rsi, [rdi + rsi*4] ; end pointer: same regardless of starting at 0 or 1

    lea    rdi, [rdi + rdx*4] ; buf += count%2
    ;; even count: loop starts at buf[0], with curr=1, prev=0
    ;; odd  count: loop starts at buf[1], with curr=1, prev=1

align 16  ;; the rest of this func is just *slightly* longer than 16B, so there's a lot of padding.  Tempting to omit this alignment for CPUs with a loop buffer.
.loop:                      ;; do {
    mov    [rdi], eax       ;;   *buf = current
             ; on loop entry: curr:eax  prev:edx
    add   edx, eax          ; curr:edx  prev:eax

;.oddentry: ; unused, we used a branchless sequence to handle odd counts
    mov   [rdi+4], edx
    add   eax, edx          ; curr:eax  prev:edx
                            ;; back to our starting arrangement
    add    rdi, 8           ;;   buf++
    cmp    rdi, rsi         ;; } while(buf < endp);
    jb    .loop

;   dec   esi   ;  set up for this version with sub esi, edx; instead of lea
;   jnz   .loop
.early_out:
    ret

要生成从零开始的序列,请执行

To produce the starting-with-zero sequence, do

curr=count&1;   // and esi, 1
buf += curr;    // lea [rdi], [rdi + rsi*4]
prev= 1 ^ curr; // xor eax, esi

代替当前

curr = 1;
prev = count & 1;
buf += count & 1;

我们还可以通过使用esi 来保存prev 来保存两个版本中的mov 指令,现在prev 取决于 count.

We can also save a mov instruction in both versions by using esi to hold prev, now that prev depends on count.

  ;; loop prologue for sequence starting with 1 1 2 3
  ;; (using different regs and optimized for size by using fewer immediates)
    mov    eax, 1               ; current = 1
    cmp    esi, eax
    jb     .early_out           ; count below 1
    mov    [rdi], eax
    je     .early_out           ; count == 1, flags still set from cmp

    lea    rdx, [rdi + rsi*4]   ; endp
    and    esi, eax             ; prev = count & 1
    lea    rdi, [rdi + rsi*4]   ; buf += count & 1
  ;; eax:curr esi:prev    rdx:endp  rdi:buf
  ;; end of old code

  ;; loop prologue for sequence starting with 0 1 1 2
    cmp    esi, 1
    jb     .early_out           ; count below 1, no stores
    mov    [rdi], 0             ; store first element
    je     .early_out           ; count == 1, flags still set from cmp

    lea    rdx, [rdi + rsi*4]   ; endp
    mov    eax, 1               ; prev = 1
    and    esi, eax             ; curr = count&1
    lea    rdi, [rdi + rsi*4]   ; buf += count&1
    xor    eax, esi             ; prev = 1^curr
    ;; ESI:curr EAX:prev  (opposite of other setup)
  ;;


  ;; optimized for code size, NOT speed.  Prob. could be smaller, esp. if we want to keep the loop start aligned, and jump between before and after it.
  ;; most of the savings are from avoiding mov reg, imm32,
  ;; and from counting down the loop counter, instead of checking an end-pointer.
  ;; loop prologue for sequence starting with 0 1 1 2
    xor    edx, edx
    cmp    esi, 1
    jb     .early_out         ; count below 1, no stores
    mov    [rdi], edx         ; store first element
    je     .early_out         ; count == 1, flags still set from cmp

    xor    eax, eax  ; movzx after setcc would be faster, but one more byte
    shr    esi, 1             ; two counts per iteration, divide by two
  ;; shift sets CF = the last bit shifted out
    setc   al                 ; curr =   count&1
    setnc  dl                 ; prev = !(count&1)

    lea    rdi, [rdi + rax*4] ; buf+= count&1

  ;; extra uop or partial register stall internally when reading eax after writing al, on Intel (except P4 & silvermont)
  ;; EAX:curr EDX:prev  (same as 1 1 2 setup)
  ;; even count: loop starts at buf[0], with curr=0, prev=1
  ;; odd  count: loop starts at buf[1], with curr=1, prev=0

  .loop:
       ...
    dec  esi                  ; 1B smaller than 64b cmp, needs count/2 in esi
    jnz .loop
  .early_out:
    ret


矢量化:

斐波那契数列不是特别可并行化的.没有简单的方法可以从 F(i) 和 F(i-4) 或类似的东西中得到 F(i+4) .我们可以用向量做的是减少内存中的存储.开始:


vectorized:

The Fibonacci sequence isn't particularly parallelizable. There's no simple way to get F(i+4) from F(i) and F(i-4), or anything like that. What we can do with vectors is fewer stores to memory. Start with:

a = [f3 f2 f1 f0 ]   -> store this to buf
b = [f2 f1 f0 f-1]

然后 a+=b;b+=a;a+=b;b+=a; 产生:

a = [f7 f6 f5 f4 ]   -> store this to buf
b = [f6 f5 f4 f3 ]

使用两个 64 位整数打包成一个 128 位向量时,这不那么愚蠢.即使在 32 位代码中,您也可以使用 SSE 进行 64 位整数数学运算.

This is less silly when working with two 64bit ints packed into a 128b vector. Even in 32bit code, you can use SSE to do 64bit integer math.

此答案的先前版本有一个未完成的打包 32 位矢量版本,无法正确处理 count%4 != 0.为了加载序列的前 4 个值,我使用了 pmovzxbd,因此当我只能使用 4B 时,我不需要 16B 的数据.将序列的第一个 -1 .. 1 值放入向量寄存器要容易得多,因为只有一个非零值可供加载和混洗.

A previous version of this answer has an unfinished packed-32bit vector version that doesn't properly handle count%4 != 0. To load the first 4 values of the sequence, I used pmovzxbd so I didn't need 16B of data when I could use only 4B. Getting the first -1 .. 1 values of the sequence into vector registers is a lot easier, because there's only one non-zero value to load and shuffle around.

;void fib64_sse(uint64_t *dest, uint32_t count);
; using SSE for fewer but larger stores, and for 64bit integers even in 32bit mode
global fib64_sse
fib64_sse:
    mov eax, 1
    movd    xmm1, eax               ; xmm1 = [0 1] = [f0 f-1]
    pshufd  xmm0, xmm1, 11001111b   ; xmm0 = [1 0] = [f1 f0]

    sub esi, 2
    jae .entry  ; make the common case faster with fewer branches
    ;; could put the handling for count==0 and count==1 right here, with its own ret

    jmp .cleanup
align 16
.loop:                          ; do {
    paddq   xmm0, xmm1          ; xmm0 = [ f3 f2 ]
.entry:
    ;; xmm1: [ f0 f-1 ]         ; on initial entry, count already decremented by 2
    ;; xmm0: [ f1 f0  ]
    paddq   xmm1, xmm0          ; xmm1 = [ f4 f3 ]  (or [ f2 f1 ] on first iter)
    movdqu  [rdi], xmm0         ; store 2nd last compute result, ready for cleanup of odd count
        add     rdi, 16         ;   buf += 2
    sub esi, 2
        jae   .loop             ; } while((count-=2) >= 0);
    .cleanup:
    ;; esi <= 0 : -2 on the count=0 special case, otherwise -1 or 0

    ;; xmm1: [ f_rc   f_rc-1 ]  ; rc = count Rounded down to even: count & ~1
    ;; xmm0: [ f_rc+1 f_rc   ]  ; f(rc+1) is the value we need to store if count was odd
    cmp esi, -1
    jne   .out  ; this could be a test on the Parity flag, with no extra cmp, if we wanted to be really hard to read and need a big comment explaining the logic
    ;; xmm1 = [f1 f0]
    movhps  [rdi], xmm1         ; store the high 64b of xmm0.  There is no integer version of this insn, but that doesn't matter
    .out:
        ret

没有必要进一步展开这一点,dep 链延迟限制了吞吐量,因此我们总是可以平均每个周期存储一个元素.减少 uops 中的循环开销有助于超线程,但这非常小.

No point unrolling this further, the dep chain latency limits throughput, so we can always average storing one element per cycle. Reducing the loop overhead in uops can help for hyperthreading, but that's pretty minor.

如您所见,即使在两个展开时处理所有极端情况也很难跟踪.它需要额外的启动开销,即使您正在尝试对其进行优化以将其保持在最低水平.很容易以很多条件分支结束.

As you can see, handling all the corner cases even when unrolling by two is quite complex to keep track of. It requires extra startup overhead, even when you're trying to optimize that to keep it to a minimum. It's easy to end up with a lot of conditional branches.

更新主要内容:

#include <stdio.h>
#include <stdint.h>
#include <inttypes.h>
#include <stdlib.h>

#ifdef USE32
void fib(uint32_t *buf, uint32_t count);
typedef uint32_t buftype_t;
#define FMTx PRIx32
#define FMTu PRIu32
#define FIB_FN fib
#define CANARY 0xdeadbeefUL
#else
void fib64_sse(uint64_t *buf, uint32_t count);
typedef uint64_t buftype_t;
#define FMTx PRIx64
#define FMTu PRIu64
#define FIB_FN fib64_sse
#define CANARY 0xdeadbeefdeadc0deULL
#endif

#define xstr(s) str(s)
#define str(s) #s

int main(int argc, const char *argv[]) {
    uint32_t count = 15;
    if (argc > 1) {
        count = atoi(argv[1]);
    }
    int benchmark = argc > 2;

    buftype_t buf[count+1]; // allocated on the stack
    // Fib overflows uint32 at count = 48, so it's not like a lot of space is useful

    buf[count] = CANARY;
    // uint32_t count = sizeof(buf)/sizeof(buf[0]);
    if (benchmark) {
    int64_t reps = 1000000000 / count;
    for (int i=0 ; i<=reps ; i++)
        FIB_FN(buf, count);

    } else {
    FIB_FN(buf, count);
    for (uint32_t i ; i < count ; i++){
        printf("%" FMTu " ", buf[i]);
    }
    putchar('
');
    }
    if (buf[count] != CANARY) {
        printf(xstr(FIB_FN) " wrote past the end of buf: sentinel = %" FMTx "
", buf[count]);
    }
}


性能

对于略低于 8192 的计数,在我的 Sandybridge i5 上,由两个展开的非向量版本的运行接近其理论最大吞吐量,即每个周期 1 个存储(每个周期 3.5 条指令).8192 * 4B/int = 32768 = L1 缓存大小.在实践中,我看到 ~3.3 到 ~3.4 insn/周期.不过,我正在用 Linux perf 计算整个程序,而不仅仅是紧密循环.


Performance

For count just below 8192, the unrolled-by-two non-vector version runs near its theoretical-max throughput of 1 store per cycle (3.5 instructions per cycle), on my Sandybridge i5. 8192 * 4B/int = 32768 = L1 cache size. In practice, I see ~3.3 to ~3.4 insn / cycle. I'm counting the entire program with Linux perf, though, not just the tight loop.

无论如何,进一步展开真的没有任何意义.很明显,在 count=47 之后这不再是斐波那契数列,因为我们使用了 uint32_t.然而,对于大count,吞吐量受限于内存带宽,低至~2.6 insn/cycle.在这一点上,我们基本上是在研究如何优化 memset.

Anyway, there's not really any point unrolling further. And obviously this stopped being a Fibonacci sequence after count=47, since we use uint32_t. However, for large count, the throughput is limited by memory bandwidth, down to ~2.6 insn / cycle. At this point we're basically looking at how to optimize memset.

64 位向量版本以每个周期 3 个 insns(每两个时钟一个 128b 存储)运行,阵列大小约为 L2 缓存大小的 1.5 倍.(即 ./fib64 49152).随着阵列大小上升到 L3 缓存大小的更大部分,性能下降到每周期约 2 insn(每 3 个时钟存储一个),L3 缓存大小的 3/4.它在大小 > 时每 6 个周期平衡到 1 个存储.三级缓存.

The 64bit vector version runs at 3 insns per cycle (one 128b store per two clocks) up to an array size of about 1.5 times L2 cache size. (i.e. ./fib64 49152). As the array size goes up to larger fractions of L3 cache size, performance decreases down to ~2 insn per cycle (one store per 3 clocks) at 3/4 of L3 cache size. It levels out to 1 store per 6 cycles at sizes > L3 cache.

所以当我们适合 L2 而不是 L1 缓存时,使用向量存储会更好.

So storing with vectors does better when we fit in L2 but not L1 cache.

这篇关于汇编语言 (x86):如何创建循环来计算斐波那契数列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆