在实践中如何创建一个幽灵小工具? [英] How can I create a spectre gadget in practice?

查看：106 发布时间：2020/9/12 22:34:39 caching assembly x86 spectre

本文介绍了在实践中如何创建一个幽灵小工具?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在开发(针对ELF64的NASM + GCC)一个 PoC 一个测量访问一组缓存行的时间的幽灵小工具(存储库./p>

在GAP设置为0的情况下进行组装，以taskset -c 0链接并执行时，获取每一行所需的周期如下所示.

从内存中仅加载64行.

在不同的运行中输出是稳定的. 如果将GAP设置为1，则仅从内存中获取32行，当然是64 *(1 + 0)* 64 = 32 *(1 + 1)* 64 = 4096，所以这可能与分页有关?

如果在概要分析之前(但在刷新之后)对前64行之一执行了存储，则输出将更改为此

任何存储区的其他行都给出第一种输出类型.

我怀疑的数学运算是否正确，但是我还需要另一只眼睛找出其中的位置.

编辑

Hadi Brais指出了在修正输出不一致之后，滥用了易失性寄存器.
我看到通常在计时时间较短(〜50个周期)的地方运行，有时在计时时间较长(〜130个周期)的地方运行.
我不知道130个周期的数字来自哪里(内存太低了，高速缓存太高了吗?).

代码在MCVE(和存储库)中是固定的.

如果在概要分析之前执行了对任何第一行的存储，则输出中不会反映任何更改.

附录-MCVE

BITS 64
DEFAULT REL

GLOBAL main

EXTERN printf
EXTERN exit

;Space between lines in the buffer
%define GAP 0

SECTION .bss ALIGN=4096



 buffer:    resb 256 * (1 + GAP) * 64   


SECTION .data

 timings_data:  TIMES 256 dd 0


 strNewLine db `\n0x%02x: `, 0
 strHalfLine    db "  ", 0
 strTiming  db `\e[48;5;16`,
  .importance   db "0",
        db `m\e[38;5;15m%03u\e[0m `, 0  

 strEnd     db `\n\n`, 0

SECTION .text

;'._ .''._ .''._ .''._ .''._ .''._ .''._ .''._ .''._ .''._ .''._ .' 
;   '     '     '     '     '     '     '     '     '     '     '   
; _' \  _' \  _' \  _' \  _' \  _' \  _' \  _' \  _' \  _' \  _' \ 
;/    \/    \/    \/    \/    \/    \/    \/    \/    \/    \/    \
;
;
;FLUSH ALL THE LINES OF A BUFFER FROM THE CACHES
;
;

flush_all:
 lea rdi, [buffer]  ;Start pointer
 mov esi, 256       ;How many lines to flush

.flush_loop:
  lfence        ;Prevent the previous clflush to be reordered after the load
  mov eax, [rdi]    ;Touch the page
  lfence        ;Prevent the current clflush to be reordered before the load

  clflush  [rdi]    ;Flush a line
  add rdi, (1 + GAP)*64 ;Move to the next line

  dec esi
 jnz .flush_loop    ;Repeat

 lfence         ;clflush are ordered with respect of fences ..
            ;.. and lfence is ordered (locally) with respect of all instructions
 ret


;'._ .''._ .''._ .''._ .''._ .''._ .''._ .''._ .''._ .''._ .''._ .' 
;   '     '     '     '     '     '     '     '     '     '     '   
; _' \  _' \  _' \  _' \  _' \  _' \  _' \  _' \  _' \  _' \  _' \ 
;/    \/    \/    \/    \/    \/    \/    \/    \/    \/    \/    \
;
;
;PROFILE THE ACCESS TO EVERY LINE OF THE BUFFER
;
;


profile:
 lea rdi, [buffer]      ;Pointer to the buffer
 mov esi, 256           ;How many lines to test
 lea r8, [timings_data]     ;Pointer to timings results


 mfence             ;I'm pretty sure this is useless, but I included it to rule out ..
                ;.. silly, hard to debug, scenarios

.profile: 
  mfence
  rdtscp
  lfence            ;Read the TSC in-order (ignoring stores global visibility)

  mov ebp, eax          ;Read the low DWORD only (this is a short delay)

  ;PERFORM THE LOADING
  mov eax, DWORD [rdi]

  rdtscp
  lfence            ;Again, read the TSC in-order

  sub eax, ebp          ;Compute the delta

  mov DWORD [r8], eax       ;Save it

  ;Advance the loop

  add r8, 4         ;Move the results pointer
  add rdi, (1 + GAP)*64     ;Move to the next line

  dec esi           ;Advance the loop
 jnz .profile

 ret

;'._ .''._ .''._ .''._ .''._ .''._ .''._ .''._ .''._ .''._ .''._ .' 
;   '     '     '     '     '     '     '     '     '     '     '   
; _' \  _' \  _' \  _' \  _' \  _' \  _' \  _' \  _' \  _' \  _' \ 
;/    \/    \/    \/    \/    \/    \/    \/    \/    \/    \/    \
;
;
;SHOW THE RESULTS
;
;

show_results:
 lea rbx, [timings_data]    ;Pointer to the timings
 xor r12, r12           ;Counter (up to 256)

.print_line:

 ;Format the output

 xor eax, eax
 mov esi, r12d
 lea rdi, [strNewLine]      ;Setup for a call to printf

 test r12d, 0fh
 jz .print          ;Test if counter is a multiple of 16

 lea rdi, [strHalfLine]     ;Setup for a call to printf

 test r12d, 07h         ;Test if counter is a multiple of 8
 jz .print

.print_timing:

  ;Print
  mov esi, DWORD [rbx]      ;Timing value

  ;Compute the color
  mov r10d, 60          ;Used to compute the color 
  mov eax, esi
  xor edx, edx
  div r10d          ;eax = Timing value / 78

  ;Update the color 


  add al, '0'
  mov edx, '5'
  cmp eax, edx
  cmova eax, edx
  mov BYTE [strTiming.importance], al

  xor eax, eax
  lea rdi, [strTiming]
  call printf WRT ..plt     ;Print a 3-digits number

  ;Advance the loop 

  inc r12d          ;Increment the counter
  add rbx, 4            ;Move to the next timing
  cmp r12d, 256
 jb .print_line         ;Advance the loop

  xor eax, eax
  lea rdi, [strEnd]
  call printf WRT ..plt     ;Print a new line

  ret

.print:

  call printf WRT ..plt     ;Print a string

jmp .print_timing

;'._ .''._ .''._ .''._ .''._ .''._ .''._ .''._ .''._ .''._ .''._ .' 
;   '     '     '     '     '     '     '     '     '     '     '   
; _' \  _' \  _' \  _' \  _' \  _' \  _' \  _' \  _' \  _' \  _' \ 
;/    \/    \/    \/    \/    \/    \/    \/    \/    \/    \/    \
;
;
;E N T R Y   P O I N T
;
;
;'._ .''._ .''._ .''._ .''._ .''._ .''._ .''._ .''._ .''._ .''._ .' 
;   '     '     '     '     '     '     '     '     '     '     '   
; _' \  _' \  _' \  _' \  _' \  _' \  _' \  _' \  _' \  _' \  _' \ 
;/    \/    \/    \/    \/    \/    \/    \/    \/    \/    \/    \

main:

 ;Flush all the lines of the buffer
 call flush_all

 ;Test the access times
 call profile

 ;Show the results
 call show_results

 ;Exit
 xor edi, edi
 call exit WRT ..plt

解决方案

缓冲区是从bss部分分配的，因此，在加载程序时，操作系统会将所有buffer高速缓存行映射到同一行CoW物理页面.刷新所有行之后，在所有高速缓存级别¹中，只有对虚拟地址空间中前64行的访问会丢失，因为以后所有²都访问相同的4K页.这就是为什么当GAP为零时，前64个访问的等待时间落在主内存等待时间范围内，而所有后续访问的等待时间都等于L1命中等待时间³的原因.

当GAP为1时，访问同一物理页面的每隔一行，因此主存储器访问(L3未命中)的数量为32(一半为64).也就是说，前32个等待时间将在主内存等待时间的范围内，而所有以后的等待时间将是L1次命中.类似地，当GAP为63时，所有访问均位于同一行.因此，只有第一次访问会丢失所有缓存.

解决方案是将flush_all中的mov eax, [rdi]更改为mov dword [rdi], 0，以确保将缓冲区分配在唯一的物理页中. (可以删除flush_all中的lfence指令，因为英特尔手册指出clflush不能通过writes ⁴重新排序.)这样可以保证在初始化和刷新所有行之后，所有访问都可以进行.会错过所有缓存级别(但不是TLB，请参阅: clflush还会删除TLB条目吗? ).

您可以参考带有时间戳计数器的内存延迟测量.

(4)英特尔手册似乎并未指定clflush是否与读取一起排序，但在我看来确实如此.

I'm developing (NASM + GCC targetting ELF64) a PoC that uses a spectre gadget that measures the time to access a set of cache lines (FLUSH+RELOAD).

How can I make a reliable spectre gadget?

I believe I understand the theory behind the FLUSH+RELOAD technique, however in practice, despiste some noise, I'm unable to produce a working PoC.

Since I'm using the Timestamp counter and the loads are very regular I use this script to disable the prefetchers, the turbo boost and to fix/stabilize the CPU frequency:

#!/bin/bash

sudo modprobe msr

#Disable turbo
sudo wrmsr -a 0x1a0 0x4000850089

#Disable prefetchers
sudo wrmsr -a 0x1a4 0xf

#Set performance governor
sudo cpupower frequency-set -g performance

#Minimum freq
sudo cpupower frequency-set -d 2.2GHz

#Maximum freq
sudo cpupower frequency-set -u 2.2GHz

I have a continuous buffer, aligned on 4KiB, large enough to span 256 cache lines separated by an integral number GAP of lines.

SECTION .bss ALIGN=4096

 buffer:    resb 256 * (1 + GAP) * 64

I use this function to flush the 256 lines.

flush_all:
 lea rdi, [buffer]              ;Start pointer
 mov esi, 256                   ;How many lines to flush

.flush_loop:
  lfence                        ;Prevent the previous clflush to be reordered after the load
  mov eax, [rdi]                ;Touch the page
  lfence                        ;Prevent the current clflush to be reordered before the load

  clflush  [rdi]                ;Flush a line
  add rdi, (1 + GAP)*64         ;Move to the next line

  dec esi
 jnz .flush_loop                ;Repeat

 lfence                         ;clflush are ordered with respect of fences ..
                                ;.. and lfence is ordered (locally) with respect of all instructions
 ret

The function loops through all the lines, touching every page in between (each page more than once) and flushing each line.

Then I use this function to profile the accesses.

profile:
 lea rdi, [buffer]           ;Pointer to the buffer
 mov esi, 256                ;How many lines to test
 lea r8, [timings_data]      ;Pointer to timings results

 mfence                      ;I'm pretty sure this is useless, but I included it to rule out ..
                             ;.. silly, hard to debug, scenarios

.profile: 
  mfence
  rdtscp
  lfence                     ;Read the TSC in-order (ignoring stores global visibility)

  mov ebp, eax               ;Read the low DWORD only (this is a short delay)

  ;PERFORM THE LOADING
  mov eax, DWORD [rdi]

  rdtscp
  lfence                     ;Again, read the TSC in-order

  sub eax, ebp               ;Compute the delta

  mov DWORD [r8], eax        ;Save it

  ;Advance the loop

  add r8, 4                  ;Move the results pointer
  add rdi, (1 + GAP)*64      ;Move to the next line

  dec esi                    ;Advance the loop
 jnz .profile

 ret

An MCVE is given in appendix and a repository is available to clone.

When assembled with GAP set to 0, linked and executed with taskset -c 0 the cycles necessary to fetch each line are shown below.

Only 64 lines are loaded from memory.

The output is stable across different runs. If I set GAP to 1 only 32 lines are fetched from memory, ofcourse 64 * (1+0) * 64 = 32 * (1+1) * 64 = 4096, so this may be related to paging?

If a store is executed before the profiling (but after the flush) to one of the first 64 lines, the output changes to this

Any store the the other lines gives the first type of output.

I suspect the math in the is broken but I need another couple of eyes find out where.

EDIT

Hadi Brais pointed out a misuse of a volatile register, after fixing that the output is now inconsistent.
I see prevalently runs where the timings are low (~50 cycles) and sometimes runs where the timing are higher (~130 cycles).
I don't know where the 130 cycles figure come from (too low for memory, too high for the cache?).

Code is fixed in the MCVE (and the repository).

If a store to any of the first lines is executed before the profiling, no change is reflected in the output.

APPENDIX - MCVE

BITS 64
DEFAULT REL

GLOBAL main

EXTERN printf
EXTERN exit

;Space between lines in the buffer
%define GAP 0

SECTION .bss ALIGN=4096



 buffer:    resb 256 * (1 + GAP) * 64   


SECTION .data

 timings_data:  TIMES 256 dd 0


 strNewLine db `\n0x%02x: `, 0
 strHalfLine    db "  ", 0
 strTiming  db `\e[48;5;16`,
  .importance   db "0",
        db `m\e[38;5;15m%03u\e[0m `, 0  

 strEnd     db `\n\n`, 0

SECTION .text

;'._ .''._ .''._ .''._ .''._ .''._ .''._ .''._ .''._ .''._ .''._ .' 
;   '     '     '     '     '     '     '     '     '     '     '   
; _' \  _' \  _' \  _' \  _' \  _' \  _' \  _' \  _' \  _' \  _' \ 
;/    \/    \/    \/    \/    \/    \/    \/    \/    \/    \/    \
;
;
;FLUSH ALL THE LINES OF A BUFFER FROM THE CACHES
;
;

flush_all:
 lea rdi, [buffer]  ;Start pointer
 mov esi, 256       ;How many lines to flush

.flush_loop:
  lfence        ;Prevent the previous clflush to be reordered after the load
  mov eax, [rdi]    ;Touch the page
  lfence        ;Prevent the current clflush to be reordered before the load

  clflush  [rdi]    ;Flush a line
  add rdi, (1 + GAP)*64 ;Move to the next line

  dec esi
 jnz .flush_loop    ;Repeat

 lfence         ;clflush are ordered with respect of fences ..
            ;.. and lfence is ordered (locally) with respect of all instructions
 ret


;'._ .''._ .''._ .''._ .''._ .''._ .''._ .''._ .''._ .''._ .''._ .' 
;   '     '     '     '     '     '     '     '     '     '     '   
; _' \  _' \  _' \  _' \  _' \  _' \  _' \  _' \  _' \  _' \  _' \ 
;/    \/    \/    \/    \/    \/    \/    \/    \/    \/    \/    \
;
;
;PROFILE THE ACCESS TO EVERY LINE OF THE BUFFER
;
;


profile:
 lea rdi, [buffer]      ;Pointer to the buffer
 mov esi, 256           ;How many lines to test
 lea r8, [timings_data]     ;Pointer to timings results


 mfence             ;I'm pretty sure this is useless, but I included it to rule out ..
                ;.. silly, hard to debug, scenarios

.profile: 
  mfence
  rdtscp
  lfence            ;Read the TSC in-order (ignoring stores global visibility)

  mov ebp, eax          ;Read the low DWORD only (this is a short delay)

  ;PERFORM THE LOADING
  mov eax, DWORD [rdi]

  rdtscp
  lfence            ;Again, read the TSC in-order

  sub eax, ebp          ;Compute the delta

  mov DWORD [r8], eax       ;Save it

  ;Advance the loop

  add r8, 4         ;Move the results pointer
  add rdi, (1 + GAP)*64     ;Move to the next line

  dec esi           ;Advance the loop
 jnz .profile

 ret

;'._ .''._ .''._ .''._ .''._ .''._ .''._ .''._ .''._ .''._ .''._ .' 
;   '     '     '     '     '     '     '     '     '     '     '   
; _' \  _' \  _' \  _' \  _' \  _' \  _' \  _' \  _' \  _' \  _' \ 
;/    \/    \/    \/    \/    \/    \/    \/    \/    \/    \/    \
;
;
;SHOW THE RESULTS
;
;

show_results:
 lea rbx, [timings_data]    ;Pointer to the timings
 xor r12, r12           ;Counter (up to 256)

.print_line:

 ;Format the output

 xor eax, eax
 mov esi, r12d
 lea rdi, [strNewLine]      ;Setup for a call to printf

 test r12d, 0fh
 jz .print          ;Test if counter is a multiple of 16

 lea rdi, [strHalfLine]     ;Setup for a call to printf

 test r12d, 07h         ;Test if counter is a multiple of 8
 jz .print

.print_timing:

  ;Print
  mov esi, DWORD [rbx]      ;Timing value

  ;Compute the color
  mov r10d, 60          ;Used to compute the color 
  mov eax, esi
  xor edx, edx
  div r10d          ;eax = Timing value / 78

  ;Update the color 


  add al, '0'
  mov edx, '5'
  cmp eax, edx
  cmova eax, edx
  mov BYTE [strTiming.importance], al

  xor eax, eax
  lea rdi, [strTiming]
  call printf WRT ..plt     ;Print a 3-digits number

  ;Advance the loop 

  inc r12d          ;Increment the counter
  add rbx, 4            ;Move to the next timing
  cmp r12d, 256
 jb .print_line         ;Advance the loop

  xor eax, eax
  lea rdi, [strEnd]
  call printf WRT ..plt     ;Print a new line

  ret

.print:

  call printf WRT ..plt     ;Print a string

jmp .print_timing

;'._ .''._ .''._ .''._ .''._ .''._ .''._ .''._ .''._ .''._ .''._ .' 
;   '     '     '     '     '     '     '     '     '     '     '   
; _' \  _' \  _' \  _' \  _' \  _' \  _' \  _' \  _' \  _' \  _' \ 
;/    \/    \/    \/    \/    \/    \/    \/    \/    \/    \/    \
;
;
;E N T R Y   P O I N T
;
;
;'._ .''._ .''._ .''._ .''._ .''._ .''._ .''._ .''._ .''._ .''._ .' 
;   '     '     '     '     '     '     '     '     '     '     '   
; _' \  _' \  _' \  _' \  _' \  _' \  _' \  _' \  _' \  _' \  _' \ 
;/    \/    \/    \/    \/    \/    \/    \/    \/    \/    \/    \

main:

 ;Flush all the lines of the buffer
 call flush_all

 ;Test the access times
 call profile

 ;Show the results
 call show_results

 ;Exit
 xor edi, edi
 call exit WRT ..plt

解决方案

The buffer is allocated from the bss section and so when the program is loaded, the OS will map all of the buffer cache lines to the same CoW physical page. After flushing all of the lines, only the accesses to the first 64 lines in the virtual address space miss in all cache levels¹ because all² later accesses are to the same 4K page. That's why the latencies of the first 64 accesses fall in the range of the main memory latency and the latencies of all later accesses are equal to the L1 hit latency³ when GAP is zero.

When GAP is 1, every other line of the same physical page is accessed and so the number of main memory accesses (L3 misses) is 32 (half of 64). That is, the first 32 latencies will be in the range of the main memory latency and all later latencies will be L1 hits. Similarly, when GAP is 63, all accesses are to the same line. Therefore, only the first access will miss all caches.

The solution is to change mov eax, [rdi] in flush_all to mov dword [rdi], 0 to ensure that the buffer is allocated in unique physical pages. (The lfence instructions in flush_all can be removed because the Intel manual states that clflush cannot be reordered with writes⁴.) This guarantees that, after initializing and flushing all lines, all accesses will miss all cache levels (but not the TLB, see: Does clflush also remove TLB entries?).

You can refer to Why are the user-mode L1 store miss events only counted when there is a store initialization loop? for another example where CoW pages can be deceiving.

I suggested in the previous version of this answer to remove the call to flush_all and use a GAP value of 63. With these changes, all of the access latencies appeared to be very high and I have incorrectly concluded that all of the accesses are missing all cache levels. Like I said above, with a GAP value of 63, all of the accesses become to the same cache line, which is actually resident in the L1 cache. However, the reason that all of the latencies were high is because every access was to a different virtual page and the TLB didn't have any of mappings for each of these virtual pages (to the same physical page) because by removing the call to flush_all, none of the virtual pages were touched before. So the measured latencies represent the TLB miss latency, even though the line being accessed is in the L1 cache.

I also incorrectly claimed in the previous version of this answer that there is an L3 prefetching logic that cannot be disabled through MSR 0x1A4. If a particular prefetcher is turned off by setting its flag in MSR 0x1A4, then it does fully get switched off. Also there are no data prefetchers other than the ones documented by Intel.

Footnotes:

(1) If you don't disable the DCU IP prefetcher, it will actually prefetch back all the lines into the L1 after flushing them, so all accesses will still hit in the L1.

(2) In rare cases, the execution of interrupt handlers or scheduling other threads on the same core may cause some of the lines to be evicted from the L1 and potentially other levels of the cache hierarchy.

(3) Remember that you need to subtract the overhead of the rdtscp instructions. Note that the measurement method you used actually doesn't enable you to reliably distinguish between an L1 hit and an L2 hit. See: Memory latency measurement with time stamp counter.

(4) The Intel manual doesn't seem to specify whether clflush is ordered with reads, but it appears to me that it is.

这篇关于在实践中如何创建一个幽灵小工具?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

在实践中如何创建一个幽灵小工具? [英] How can I create a spectre gadget in practice?

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

在实践中如何创建一个幽灵小工具? [英] How can I create a spectre gadget in practice?

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭